Converting outies to innies

Some sequencers (notably SOLiD) when doing mate pair sequencing provide reads in the R3/F3 format, where both reads are pointing in the forward direction. Some tools, e.g. scaffolders, insist on reads that point inward. Thus, one may want to convert reads from

------>R3       ------>F3

to

------>R3       <------F3

Now, one option would be to flip the reads around before aligning them, however, if the reads are already aligned this is not necessary. We can flip the reads on the stream.

We don’t need to flip the SEQ and QUAL fields since they are always in the 5′ -> 3′ direction. All that we need to do is identify the F3 reads and change the 0x10 flag which indicates SEQ being reverse complemented. This takes care of the F3 reads, on the R3 reads, we need to change the flags of the R3 read to add the 0x20 flag (in addition to changing the F3 flag). And we are done.

Here is a small code snippet that does the flipping on the fly and produces a new bam with the reads pointed in the right direction.

samtools view -h outie.bam | \
gawk 'BEGIN{OFS="\t"}{if ($1~/^@/) {print $0; next;} \
else if (and($2, 0x40)){$2=xor($2, 0x20)} \
else if (and($2, 0x80)){$2=xor($2, 0x10)} print $0}'| \
samtools view -bS - > innie.bam

On the obsession with Impact Factors

There is another paper out on Causes for the Persistence of Impact Factor Mania. From the abstract:

Numerous essays have addressed the misuse of the journal impact factor for judging the value of science, but the practice continues, primarily as a result of the actions of scientists themselves. This seemingly irrational behavior is referred to as “impact factor mania”.

Goodhart’s law: When a measure becomes a target, it ceases to be a good measure.

First Law of Metridynamics: The observed metric will improve.

Even a cursory glance indicates that the impact factor mania is perfectly rational. Given the increasing distance between administrators who decide on grants and promotions and the science they are trying to judge a single number is easy to judge on. Deciding on the actual impact of the science is difficult, time consuming, needs deep knowledge and you are still likely to go wrong. On the other hand, some function of the impact factor, is a single easy to judge number. Even if you know that the number is not particularly accurate, you can still justify an action later on based on a third party number.

Finding variation in a multiple sequence alignment

A friend of mine had a number of sequences that had been multiply aligned and wanted to find variations within those sequences. This should be a somewhat common task for biologists but I was unable to find an already existing tool that does this. So I thought of writing a small tool which did this with Python. Python can be downloaded from http://python.org and Biopython which helps with parsing FASTA files from biopython.org

I will write it with a number of comments so it is easier for non-programmers to follow. The first thing we need to do is, load the relevant libraries and load the FASTA sequences.

import sys
from Bio import SeqIO

We create an empty dictionary to hold the sequences with the sequence ID as key and sequence as value.

seqs={} #creating an empty dictionary for holding sequences

for seq_record in SeqIO.parse(sys.argv[1], 'fasta'):
    seqs[seq_record.id]=seq_record.seq

The for loop iterates over the entire file while the second line adds sequences to the dictionary with the id as key and sequence itself as the value. If the FASTA file contains multiple sequences with same ID, the last one is the only one that will be recorded.

seqlen=len(seq_record.seq) # extract length from last sequence

#variation finding
for i in range(seqlen):
    chars=[]
    for seq in seqs:
        chars.append(seqs[seq][i])
    charset=set(chars)
    if len(charset)==1: continue
    print("Column "+str(i)+" contains "+" ".join(charset))

In the above snippet, the first line gets the length of the sequence from the last sequence (since it is a multiple alignment, we assume all lengths are the same).

The loop from line 3-10 iterates over the entire length of the sequence, with the inner loop from line 6-7 iterating over all the sequences. The list chars holds all the characters in a particular column. We use the set operation to identify columns that are variant. Set outputs only unique members of a list. If the number of members of a set is 1, it means the comlumn is invariant and we can skip that, which is done by line 9. If not, line 10 outputs the different characters in a particular column.
We can also make a particular sequence as the reference and output variations with respect to that.

A run on a small file gives me the following output

farhat@palantir:~/$ python msa_var.py test.fa
Column 2 contains A T
Column 3 contains C T G
Column 5 contains C G
Column 6 contains T G
Column 7 contains A C T
Column 8 contains A G