(Sometimes) more is less

The general consensus is that when it comes to data more is always better. You may run into issues with processing too much data or the time it takes to process it, or storage costs, otherwise, if a little bit of something is good, more is better. After all, more comprises of many versions of less, so you can always work with subset your data. However, sometimes it does turn out that more is less. Sometime back I came across an interesting paper ‘When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality’ which reports on the de novo assembly of BAC clones. BAC clones which are relatively short DNA fragments (100–150 kbp) and given their short size sequencing depths in the range of 1,000x–10,000x  are easy to achieve. They study how the assembly quality changes as the amount of sequencing data increases and find that when the depth of sequencing increases over a certain threshold, sequencing errors make the the problem of decoding reads to their assemblies and the problem of de novo assembly harder and as a consequence the quality of the solution degrades with more and more data.

The reason here is that in the presence of noise as the data increases, it becomes increasingly harder to tell a novel sequence from a sequencing error. The solution they propose in this case is a “divide and conquer” solution: slice the data in subsamples, decode each slice independently, then merge the results.

In other situations, the choice of the wrong model can also lead to wrong conclusions. In tree inference, long branch attraction (LBA) is a form of systematic error whereby distantly related lineages are incorrectly inferred to be closely related. LBA arises when the amount of molecular change accumulated within a lineage is sufficient to cause that lineage to appear similar (thus closely related) to another long-branched lineage, solely because they have both undergone a large amount of change, rather than because they are related by descent. Thus, when we have a tree of this type, the more data that is collected, the more strongly will the inferred tree tend toward the wrong tree.

As a digression on more and less, back in 1978, a command called more was written by a University of California grad student named Daniel Halbert. It was a fairly basic pager, something that allowed you to view a file one screenful at a time. Very handy, except that if you wanted to go scroll back, it would not be possible. Getting around its limitations, in 1984 another developer wrote a pager which would allow you to go both forward and backward navigation through the file among other improvements. This program was called less and less could do a lot more than more.

Advertisements

Most frequently used commands on the console

For a bioinformatician, nothing matches the speed and efficacy of the command line once you are used to it. The point and click interface of many tools simply don’t cut it when it comes to the power and flexibility of command lines and shell scripts. More importantly, scripts can be combined with other scripts, put in version control and adapted to work with other systems.

As Matt Might says, “The continued dominance of the command line among experts is a testament to the power of linguistic abstraction: when it comes to computing, a word is worth a thousand pictures.”

So I tried to see what are the most common commands I use on the command line. So with a first pass using the command

farhat@heracles:~$ cut -f1 -d' ' .bash_history |sort |uniq -c|sort -n|tail -20
13 mv
13 paste
14 bg
15 history|grep
17 wc
18 ~/software/bwa-0.7.5a/bwa
20 samtools
23 bedtools
28 less
32 scp
38 vi
42 rm
49 head
49 sudo
54 tail
66 screen
106 top
112 cat
397 cd
587 ls

 

Not surprisingly, ls and cd are there are lot. As is screen, that probably means a number of commands are missed from this list. samtools comes less freuqently than bedtools and that may be because samtools is often a part of pipelines and that will make it appear less frequently than it is used. So let’s correct for that:


farhat@heracles:~$ sed 's/|/\n/g' .bash_history| sed 's/^ //' | cut -f1 -d' ' |sort |uniq -c|sort -n|tail -20
14 bg
18 ~/software/bwa-0.7.5a/bwa
23 bedtools
23 history
32 scp
38 vi
39 samtools
41 wc
42 rm
49 sudo
66 screen
67 gawk
67 tail
79 grep
84 less
106 top
111 head
115 cat
397 cd
587 ls

With that change we see that samtools does go much farther ahead than bedtools. This also indicates which commands are ripe for optimization. Even a few keystrokes saved on the most frequently used commands can save a fair bit of time.