Finding variation in a multiple sequence alignment

A friend of mine had a number of sequences that had been multiply aligned and wanted to find variations within those sequences. This should be a somewhat common task for biologists but I was unable to find an already existing tool that does this. So I thought of writing a small tool which did this with Python. Python can be downloaded from http://python.org and Biopython which helps with parsing FASTA files from biopython.org

I will write it with a number of comments so it is easier for non-programmers to follow. The first thing we need to do is, load the relevant libraries and load the FASTA sequences.

import sys
from Bio import SeqIO

We create an empty dictionary to hold the sequences with the sequence ID as key and sequence as value.

seqs={} #creating an empty dictionary for holding sequences

for seq_record in SeqIO.parse(sys.argv[1], 'fasta'):
    seqs[seq_record.id]=seq_record.seq

The for loop iterates over the entire file while the second line adds sequences to the dictionary with the id as key and sequence itself as the value. If the FASTA file contains multiple sequences with same ID, the last one is the only one that will be recorded.

seqlen=len(seq_record.seq) # extract length from last sequence

#variation finding
for i in range(seqlen):
    chars=[]
    for seq in seqs:
        chars.append(seqs[seq][i])
    charset=set(chars)
    if len(charset)==1: continue
    print("Column "+str(i)+" contains "+" ".join(charset))

In the above snippet, the first line gets the length of the sequence from the last sequence (since it is a multiple alignment, we assume all lengths are the same).

The loop from line 3-10 iterates over the entire length of the sequence, with the inner loop from line 6-7 iterating over all the sequences. The list chars holds all the characters in a particular column. We use the set operation to identify columns that are variant. Set outputs only unique members of a list. If the number of members of a set is 1, it means the comlumn is invariant and we can skip that, which is done by line 9. If not, line 10 outputs the different characters in a particular column.
We can also make a particular sequence as the reference and output variations with respect to that.

A run on a small file gives me the following output

farhat@palantir:~/$ python msa_var.py test.fa
Column 2 contains A T
Column 3 contains C T G
Column 5 contains C G
Column 6 contains T G
Column 7 contains A C T
Column 8 contains A G
Advertisements

Author: Farhat

I am a physicist turned bioinformatics researcher turned Data Scientist.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s