(Sometimes) more is less

The general consensus is that when it comes to data more is always better. You may run into issues with processing too much data or the time it takes to process it, or storage costs, otherwise, if a little bit of something is good, more is better. After all, more comprises of many versions of less, so you can always work with subset your data. However, sometimes it does turn out that more is less. Sometime back I came across an interesting paper ‘When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality’ which reports on the de novo assembly of BAC clones. BAC clones which are relatively short DNA fragments (100–150 kbp) and given their short size sequencing depths in the range of 1,000x–10,000x  are easy to achieve. They study how the assembly quality changes as the amount of sequencing data increases and find that when the depth of sequencing increases over a certain threshold, sequencing errors make the the problem of decoding reads to their assemblies and the problem of de novo assembly harder and as a consequence the quality of the solution degrades with more and more data.

The reason here is that in the presence of noise as the data increases, it becomes increasingly harder to tell a novel sequence from a sequencing error. The solution they propose in this case is a “divide and conquer” solution: slice the data in subsamples, decode each slice independently, then merge the results.

In other situations, the choice of the wrong model can also lead to wrong conclusions. In tree inference, long branch attraction (LBA) is a form of systematic error whereby distantly related lineages are incorrectly inferred to be closely related. LBA arises when the amount of molecular change accumulated within a lineage is sufficient to cause that lineage to appear similar (thus closely related) to another long-branched lineage, solely because they have both undergone a large amount of change, rather than because they are related by descent. Thus, when we have a tree of this type, the more data that is collected, the more strongly will the inferred tree tend toward the wrong tree.

As a digression on more and less, back in 1978, a command called more was written by a University of California grad student named Daniel Halbert. It was a fairly basic pager, something that allowed you to view a file one screenful at a time. Very handy, except that if you wanted to go scroll back, it would not be possible. Getting around its limitations, in 1984 another developer wrote a pager which would allow you to go both forward and backward navigation through the file among other improvements. This program was called less and less could do a lot more than more.


Author: Farhat

I am a physicist turned bioinformatics researcher turned Data Scientist.

2 thoughts on “(Sometimes) more is less”

  1. This is an interesting perspective. We probably need to look at big data and small data in parallel. I read an article sometime back which made a somewhat related point-https://www.linkedin.com/pulse/lego-engineered-remarkable-turnaround-its-business-howd-lindstrom

    1. That’s a great point. Data, by itself, is firstly limited to what’s being collected, which is limited by what is cost, availability, and cultural biases. Most egregiously, I’ve seen it in Biology projects, where they will collect blood samples or cheek swabs ‘because that’s what we have permission for’ ignoring that the question they are trying to answer will not be answerable by data in the samples they have collected. The belief is that somehow the bioinformaticians will figure it all out later on.

      More importantly, in a chaotic (in a mathematical sense) world, with positive feedback and nonlinear effects all over the place, small changes can have massive effects. Finding and exploiting these, is beyond the power of today’s models partly because which small effect will start dominating is uncertain and unpredictable.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s