The general consensus is that when it comes to data more is always better. You may run into issues with processing too much data or the time it takes to process it, or storage costs, otherwise, if a little bit of something is good, more is better. After all, more comprises of many versions of less, so you can always work with subset your data. However, sometimes it does turn out that more is less. Sometime back I came across an interesting paper ‘When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality’ which reports on the de novo assembly of BAC clones. BAC clones which are relatively short DNA fragments (100–150 kbp) and given their short size sequencing depths in the range of 1,000x–10,000x are easy to achieve. They study how the assembly quality changes as the amount of sequencing data increases and find that when the depth of sequencing increases over a certain threshold, sequencing errors make the the problem of decoding reads to their assemblies and the problem of de novo assembly harder and as a consequence the quality of the solution degrades with more and more data.
The reason here is that in the presence of noise as the data increases, it becomes increasingly harder to tell a novel sequence from a sequencing error. The solution they propose in this case is a “divide and conquer” solution: slice the data in subsamples, decode each slice independently, then merge the results.
In other situations, the choice of the wrong model can also lead to wrong conclusions. In tree inference, long branch attraction (LBA) is a form of systematic error whereby distantly related lineages are incorrectly inferred to be closely related. LBA arises when the amount of molecular change accumulated within a lineage is sufficient to cause that lineage to appear similar (thus closely related) to another long-branched lineage, solely because they have both undergone a large amount of change, rather than because they are related by descent. Thus, when we have a tree of this type, the more data that is collected, the more strongly will the inferred tree tend toward the wrong tree.
As a digression on more and less, back in 1978, a command called more was written by a University of California grad student named Daniel Halbert. It was a fairly basic pager, something that allowed you to view a file one screenful at a time. Very handy, except that if you wanted to go scroll back, it would not be possible. Getting around its limitations, in 1984 another developer wrote a pager which would allow you to go both forward and backward navigation through the file among other improvements. This program was called less and less could do a lot more than more.