The dose makes the poison…

…and this goes for things that aren’t measured out in doses too. In this post, I will look at some studies regarding stress. Stress is something that we are usually seek to avoid and it seems completely sensible as well. However, studies show that stress may have some positive sides too. Evolution has fine tuned us to live within a certain range of various parameters and it isn’t good when we move out of them even if we move out of them on the arguably ‘good’ side.

We are constantly bombarded with studies that show how stress can do everything from reducing immunity to shortening lives among other deleterious effects. However, a recent study in Nature Communications “Early life stress in fathers improves behavioral flexibility in their offspring” points out that stress can also be a positive and interestingly, the effects can be seen across generations. The authors find that the pups of stressed male mice were more behaviorally flexible.

To create stress, the authors subjected the mice pups to unpredictable maternal separation combined with unpredictable maternal stress (MSUS) for two weeks. MSUS entails taking away the pups’ mothers at unpredictable intervals and subjecting their mothers to stressful situations, such as being placed in cramped tubes or in cups of cold water. The researchers then assessed behavioral flexibility in the pups by making them complete tasks that required them to follow rapidly changing rules to get water and food. They found that mice that had been stressed early in life outperformed controls. When the researchers bred males subjected to MSUS with wild-type females, the resulting offspring similarly excelled at behavioral flexibility. There are also studies that show acute stress can increase nerve growth. That may be related to why trauma can accelerate learning.

Other similar situations can be seen with ultra clean water. Ultra clean water from which minerals have been stripped out is less healthy than regular water and hard water has been associated with better cardiovascular health. Hygiene hypothesis posits that super clean environments have contributed to an increase in allergies as the immune system having nothing better to do, overreacts to the presence of ordinary objects like dust or dander.


Most frequently used commands on the console

For a bioinformatician, nothing matches the speed and efficacy of the command line once you are used to it. The point and click interface of many tools simply don’t cut it when it comes to the power and flexibility of command lines and shell scripts. More importantly, scripts can be combined with other scripts, put in version control and adapted to work with other systems.

As Matt Might says, “The continued dominance of the command line among experts is a testament to the power of linguistic abstraction: when it comes to computing, a word is worth a thousand pictures.”

So I tried to see what are the most common commands I use on the command line. So with a first pass using the command

farhat@heracles:~$ cut -f1 -d' ' .bash_history |sort |uniq -c|sort -n|tail -20
13 mv
13 paste
14 bg
15 history|grep
17 wc
18 ~/software/bwa-0.7.5a/bwa
20 samtools
23 bedtools
28 less
32 scp
38 vi
42 rm
49 head
49 sudo
54 tail
66 screen
106 top
112 cat
397 cd
587 ls


Not surprisingly, ls and cd are there are lot. As is screen, that probably means a number of commands are missed from this list. samtools comes less freuqently than bedtools and that may be because samtools is often a part of pipelines and that will make it appear less frequently than it is used. So let’s correct for that:

farhat@heracles:~$ sed 's/|/\n/g' .bash_history| sed 's/^ //' | cut -f1 -d' ' |sort |uniq -c|sort -n|tail -20
14 bg
18 ~/software/bwa-0.7.5a/bwa
23 bedtools
23 history
32 scp
38 vi
39 samtools
41 wc
42 rm
49 sudo
66 screen
67 gawk
67 tail
79 grep
84 less
106 top
111 head
115 cat
397 cd
587 ls

With that change we see that samtools does go much farther ahead than bedtools. This also indicates which commands are ripe for optimization. Even a few keystrokes saved on the most frequently used commands can save a fair bit of time.

Converting outies to innies

Some sequencers (notably SOLiD) when doing mate pair sequencing provide reads in the R3/F3 format, where both reads are pointing in the forward direction. Some tools, e.g. scaffolders, insist on reads that point inward. Thus, one may want to convert reads from

------>R3       ------>F3


------>R3       <------F3

Now, one option would be to flip the reads around before aligning them, however, if the reads are already aligned this is not necessary. We can flip the reads on the stream.

We don’t need to flip the SEQ and QUAL fields since they are always in the 5′ -> 3′ direction. All that we need to do is identify the F3 reads and change the 0x10 flag which indicates SEQ being reverse complemented. This takes care of the F3 reads, on the R3 reads, we need to change the flags of the R3 read to add the 0x20 flag (in addition to changing the F3 flag). And we are done.

Here is a small code snippet that does the flipping on the fly and produces a new bam with the reads pointed in the right direction.

samtools view -h outie.bam | \
gawk 'BEGIN{OFS="\t"}{if ($1~/^@/) {print $0; next;} \
else if (and($2, 0x40)){$2=xor($2, 0x20)} \
else if (and($2, 0x80)){$2=xor($2, 0x10)} print $0}'| \
samtools view -bS - > innie.bam

On the obsession with Impact Factors

There is another paper out on Causes for the Persistence of Impact Factor Mania. From the abstract:

Numerous essays have addressed the misuse of the journal impact factor for judging the value of science, but the practice continues, primarily as a result of the actions of scientists themselves. This seemingly irrational behavior is referred to as “impact factor mania”.

Goodhart’s law: When a measure becomes a target, it ceases to be a good measure.

First Law of Metridynamics: The observed metric will improve.

Even a cursory glance indicates that the impact factor mania is perfectly rational. Given the increasing distance between administrators who decide on grants and promotions and the science they are trying to judge a single number is easy to judge on. Deciding on the actual impact of the science is difficult, time consuming, needs deep knowledge and you are still likely to go wrong. On the other hand, some function of the impact factor, is a single easy to judge number. Even if you know that the number is not particularly accurate, you can still justify an action later on based on a third party number.

Finding variation in a multiple sequence alignment

A friend of mine had a number of sequences that had been multiply aligned and wanted to find variations within those sequences. This should be a somewhat common task for biologists but I was unable to find an already existing tool that does this. So I thought of writing a small tool which did this with Python. Python can be downloaded from and Biopython which helps with parsing FASTA files from

I will write it with a number of comments so it is easier for non-programmers to follow. The first thing we need to do is, load the relevant libraries and load the FASTA sequences.

import sys
from Bio import SeqIO

We create an empty dictionary to hold the sequences with the sequence ID as key and sequence as value.

seqs={} #creating an empty dictionary for holding sequences

for seq_record in SeqIO.parse(sys.argv[1], 'fasta'):

The for loop iterates over the entire file while the second line adds sequences to the dictionary with the id as key and sequence itself as the value. If the FASTA file contains multiple sequences with same ID, the last one is the only one that will be recorded.

seqlen=len(seq_record.seq) # extract length from last sequence

#variation finding
for i in range(seqlen):
    for seq in seqs:
    if len(charset)==1: continue
    print("Column "+str(i)+" contains "+" ".join(charset))

In the above snippet, the first line gets the length of the sequence from the last sequence (since it is a multiple alignment, we assume all lengths are the same).

The loop from line 3-10 iterates over the entire length of the sequence, with the inner loop from line 6-7 iterating over all the sequences. The list chars holds all the characters in a particular column. We use the set operation to identify columns that are variant. Set outputs only unique members of a list. If the number of members of a set is 1, it means the comlumn is invariant and we can skip that, which is done by line 9. If not, line 10 outputs the different characters in a particular column.
We can also make a particular sequence as the reference and output variations with respect to that.

A run on a small file gives me the following output

farhat@palantir:~/$ python test.fa
Column 2 contains A T
Column 3 contains C T G
Column 5 contains C G
Column 6 contains T G
Column 7 contains A C T
Column 8 contains A G