(Sometimes) more is less

The general consensus is that when it comes to data more is always better. You may run into issues with processing too much data or the time it takes to process it, or storage costs, otherwise, if a little bit of something is good, more is better. After all, more comprises of many versions of less, so you can always work with subset your data. However, sometimes it does turn out that more is less. Sometime back I came across an interesting paper ‘When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality’ which reports on the de novo assembly of BAC clones. BAC clones which are relatively short DNA fragments (100–150 kbp) and given their short size sequencing depths in the range of 1,000x–10,000x  are easy to achieve. They study how the assembly quality changes as the amount of sequencing data increases and find that when the depth of sequencing increases over a certain threshold, sequencing errors make the the problem of decoding reads to their assemblies and the problem of de novo assembly harder and as a consequence the quality of the solution degrades with more and more data.

The reason here is that in the presence of noise as the data increases, it becomes increasingly harder to tell a novel sequence from a sequencing error. The solution they propose in this case is a “divide and conquer” solution: slice the data in subsamples, decode each slice independently, then merge the results.

In other situations, the choice of the wrong model can also lead to wrong conclusions. In tree inference, long branch attraction (LBA) is a form of systematic error whereby distantly related lineages are incorrectly inferred to be closely related. LBA arises when the amount of molecular change accumulated within a lineage is sufficient to cause that lineage to appear similar (thus closely related) to another long-branched lineage, solely because they have both undergone a large amount of change, rather than because they are related by descent. Thus, when we have a tree of this type, the more data that is collected, the more strongly will the inferred tree tend toward the wrong tree.

As a digression on more and less, back in 1978, a command called more was written by a University of California grad student named Daniel Halbert. It was a fairly basic pager, something that allowed you to view a file one screenful at a time. Very handy, except that if you wanted to go scroll back, it would not be possible. Getting around its limitations, in 1984 another developer wrote a pager which would allow you to go both forward and backward navigation through the file among other improvements. This program was called less and less could do a lot more than more.

Advertisements

Converting outies to innies

Some sequencers (notably SOLiD) when doing mate pair sequencing provide reads in the R3/F3 format, where both reads are pointing in the forward direction. Some tools, e.g. scaffolders, insist on reads that point inward. Thus, one may want to convert reads from

------>R3       ------>F3

to

------>R3       <------F3

Now, one option would be to flip the reads around before aligning them, however, if the reads are already aligned this is not necessary. We can flip the reads on the stream.

We don’t need to flip the SEQ and QUAL fields since they are always in the 5′ -> 3′ direction. All that we need to do is identify the F3 reads and change the 0x10 flag which indicates SEQ being reverse complemented. This takes care of the F3 reads, on the R3 reads, we need to change the flags of the R3 read to add the 0x20 flag (in addition to changing the F3 flag). And we are done.

Here is a small code snippet that does the flipping on the fly and produces a new bam with the reads pointed in the right direction.

samtools view -h outie.bam | \
gawk 'BEGIN{OFS="\t"}{if ($1~/^@/) {print $0; next;} \
else if (and($2, 0x40)){$2=xor($2, 0x20)} \
else if (and($2, 0x80)){$2=xor($2, 0x10)} print $0}'| \
samtools view -bS - > innie.bam