(Sometimes) more is less

The general consensus is that when it comes to data more is always better. You may run into issues with processing too much data or the time it takes to process it, or storage costs, otherwise, if a little bit of something is good, more is better. After all, more comprises of many versions of less, so you can always work with subset your data. However, sometimes it does turn out that more is less. Sometime back I came across an interesting paper ‘When Less is More: “Slicing” Sequencing Data Improves Read Decoding Accuracy and De Novo Assembly Quality’ which reports on the de novo assembly of BAC clones. BAC clones which are relatively short DNA fragments (100–150 kbp) and given their short size sequencing depths in the range of 1,000x–10,000x  are easy to achieve. They study how the assembly quality changes as the amount of sequencing data increases and find that when the depth of sequencing increases over a certain threshold, sequencing errors make the the problem of decoding reads to their assemblies and the problem of de novo assembly harder and as a consequence the quality of the solution degrades with more and more data.

The reason here is that in the presence of noise as the data increases, it becomes increasingly harder to tell a novel sequence from a sequencing error. The solution they propose in this case is a “divide and conquer” solution: slice the data in subsamples, decode each slice independently, then merge the results.

In other situations, the choice of the wrong model can also lead to wrong conclusions. In tree inference, long branch attraction (LBA) is a form of systematic error whereby distantly related lineages are incorrectly inferred to be closely related. LBA arises when the amount of molecular change accumulated within a lineage is sufficient to cause that lineage to appear similar (thus closely related) to another long-branched lineage, solely because they have both undergone a large amount of change, rather than because they are related by descent. Thus, when we have a tree of this type, the more data that is collected, the more strongly will the inferred tree tend toward the wrong tree.

As a digression on more and less, back in 1978, a command called more was written by a University of California grad student named Daniel Halbert. It was a fairly basic pager, something that allowed you to view a file one screenful at a time. Very handy, except that if you wanted to go scroll back, it would not be possible. Getting around its limitations, in 1984 another developer wrote a pager which would allow you to go both forward and backward navigation through the file among other improvements. This program was called less and less could do a lot more than more.

On mosaics and chimeras

Whenever we sequence a genome, we assume that there is one genome that an individual possesses. While we are aware that mutations may happen, in for example, cancer cells, the usual assumption is that all cells in the body contain more or less the same genome. In animals with multiple births, it is not uncommon to see chimeras produced by the merger of multiple fertilized eggs. This can be contrasted with mosaicism which denotes the presence of two or more populations of cells with different genotypes in one individual who has developed from a single fertilized egg. 

In most cases, mosaics or chimeras would not be detected unless some medical test shows it up. There have been famous cases like that of Foekje Dillema, a female athlete who was later on found to be an XX/XY chimera and stripped of her medals. In chimeric or mosaic individuals, different body cells may have different genomes. With increase in genetic testing by parents, especially when one of their children has a genetic disorder clinicians have figure out when the disorder-associated mutation arose: Did it spring up during the creation of the sperm or egg that contributed to the child’s genetic makeup, or did it come from the parents genetic makeup.

A recent paper in The American Journal of Human Genetics shows that mosaicism may be a lot more common than previously thought. From the paper’s abstract:

However, increasing sensitivity of genomic technologies has anecdotally revealed mosaicism for mutations in somatic tissues of apparently healthy parents. Such somatically mosaic parents might also have germline mosaicism that can potentially cause unexpected intergenerational recurrences. Here, we show that somatic mosaicism for transmitted mutations among parents of children with simplex genetic disease is more common than currently appreciated.

These results indicate that many of the widely used tests for identifying CNVs and either fail to detect many kinds of genetic alterations or lack the precision to distinguish mosaicism from completely constitutional alternations. These results suggest that higher genome resolution as obtained from high throughput sequencing might allow rearrangement-specific LR-PCR to become an inexpensive yet sensitive test for CNV mosaicism. In addition, there is a need for more sensitive and specific tests for identifying disorders arising from low-level mosaicism.

Converting outies to innies

Some sequencers (notably SOLiD) when doing mate pair sequencing provide reads in the R3/F3 format, where both reads are pointing in the forward direction. Some tools, e.g. scaffolders, insist on reads that point inward. Thus, one may want to convert reads from

------>R3       ------>F3

to

------>R3       <------F3

Now, one option would be to flip the reads around before aligning them, however, if the reads are already aligned this is not necessary. We can flip the reads on the stream.

We don’t need to flip the SEQ and QUAL fields since they are always in the 5′ -> 3′ direction. All that we need to do is identify the F3 reads and change the 0x10 flag which indicates SEQ being reverse complemented. This takes care of the F3 reads, on the R3 reads, we need to change the flags of the R3 read to add the 0x20 flag (in addition to changing the F3 flag). And we are done.

Here is a small code snippet that does the flipping on the fly and produces a new bam with the reads pointed in the right direction.

samtools view -h outie.bam | \
gawk 'BEGIN{OFS="\t"}{if ($1~/^@/) {print $0; next;} \
else if (and($2, 0x40)){$2=xor($2, 0x20)} \
else if (and($2, 0x80)){$2=xor($2, 0x10)} print $0}'| \
samtools view -bS - > innie.bam