'ungapped index for biopython alignments
My first time using biopython. Forgive me if this is a basic question.
I would like to input sequences, then align them, and be able to refer to the index position of the original sequence (ungapped) and the aligned sequence (gapped).
My real world example is enolase (Uniprot P37869 and P0A6P9). The substrate binding lysine is index 392 in E. coli and 389 in B. subtilis. If one does a muscle alignment of the two, the index of this lysine in the alignment is 394. I want to be able to convert easily between gappend and ungapped indicies.
Example 1: What is the aligned index of E. coli residue #392? (answer 394 in the aligned sequence).
Example 2 I found a conserved residue in the alignment at 394. where is that in the original (ungapped) sequence? (answer 392 in E. coli and 389 in B. subtilis).
>>>cline = MuscleCommandline(input="enolase.txt", out="enolase.aln")
>>>cline()
>>> alignment = AlignIO.read("enolase.aln","fasta")
>>> print(alignment[:,390:])
SingleLetterAlphabet() alignment with 2 rows and 45 columns
AGQIKTGAPSRTDRVAKYNQLLRIEDQLAETAQYHGINSFYNLNK sp|P37869|ENO_BACSU
AGQIKTGSMSRSDRVAKYNQLIRIEEALGEKAPYNGRKEIKGQA- sp|P0A6P9|ENO_ECOLI
>>> print(alignment[:,394])
KK
Solution 1:[1]
Fun question! And as far as I know not something built-in into BioPython
. Here is how I would solve it.
Let's start with your example 2. If you take your two files enolase.txt
and enolase.aln
with respectively your original ungapped sequences and aligned sequences in FASTA format, then we can loop over the zipped records, count the number of gaps in your aligned sequence and calculate the index of the residue in the ungapped sequence:
from Bio import SeqIO, AlignIO
def find_in_original(index, original_path, alignment_path):
alignment = AlignIO.read(alignment_path, 'fasta')
original = SeqIO.parse(original_path, 'fasta')
for original_record, alignment_record in zip(original, alignment):
alignment_seq = str(alignment_record.seq)
original_seq = str(original_record.seq)
gaps = alignment_seq[:index].count('-')
original_index = len(alignment_seq[:index]) - gaps
assert alignment_seq[index] == original_seq[original_index]
yield ("The conserved residue {} at location {} in the alignment can be"
" found at location {} in {}.".format(alignment_seq[index],
index, original_index, original_record.id.split('|')[-1]))
This gives as result:
>>> for result in find_in_original(394, 'enolase.txt', 'enolase.aln'):
... print result
The conserved residue K at location 394 in the alignment can be found at location 392 in ENO_ECOLI.
The conserved residue K at location 394 in the alignment can be found at location 389 in ENO_BACSU.
For the reverse operation, we look at all the possible indexes in the alignment and see which one equals the ungapped sequence if we subtract the gaps:
def find_in_alignment(index, organism, original_path, alignment_path):
alignment = AlignIO.read(alignment_path, 'fasta')
original = SeqIO.parse(original_path, 'fasta')
for original_record, alignment_record in zip(original, alignment):
if organism in original_record.id:
alignment_seq = str(alignment_record.seq)
original_seq = str(original_record.seq)
residue = original_seq[index]
for i in range(index, len(alignment_seq)):
if alignment_seq[i] == residue and \
alignment_seq[:i].replace('-', '') == original_seq[:index]:
return ("The residue {} at location {} in {} is at location"
" {} in the alignment.".format(residue, index,
organism, i))
This gives as result:
>>> print find_in_alignment(392, 'ENO_ECOLI', 'enolase.txt', 'enolase.aln')
The residue K at location 392 in ENO_ECOLI is at location 394 in the alignment.
>>> print find_in_alignment(389, 'ENO_BACSU', ungapped_path, alignment_path)
The residue K at location 389 in ENO_BACSU is at location 394 in the alignment.
Solution 2:[2]
def original_index_to_aln_index(aln_seq, index):
''' given aln seq and original 0-based index, return new index'''
pos = index
n_char_before = pos - aln_seq[:pos].count('-') # how many character before position
while n_char_before < index: # while it has not reach the number of char before
missed_char = index - n_char_before
pos = pos + missed_char# if it has not reach, shift at least lacked char
n_char_before = pos - aln_seq[:pos].count('-')
while aln_seq[pos] == '-': # if the last character is gap, shift 1 until meet char
pos += 1
return pos
```
Solution 3:[3]
Instead of looping through, you could generate the indices simply by counting the gaps via logical indexing. Here the gaps are denoted by '-'.
>>> x = np.asarray(list('AGQ--IKTGA-LAETAQYHG-INS-FYNL-NK')) # an aligned sequence
>>> x
array(['A', 'G', 'Q', '-', '-', 'I', 'K', 'T', 'G', 'A', '-', 'L', 'A',
'E', 'T', 'A', 'Q', 'Y', 'H', 'G', '-', 'I', 'N', 'S', '-', 'F',
'Y', 'N', 'L', '-', 'N', 'K'], dtype='<U1')
Indices post alignment are simply 0 to the length of the aligned sequence:
>>> idx_post = np.arange(x.size)
>>> idx_post
array([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31])
For pre-alignment, we need to skip over the gaps. Here I generate a nan matrix of the same size as the aligned sequence. I then replace the value where the original sequence =/= the gap character with it's index.
>>> idx_pre = np.nan * np.ones(x.size)
>>> idx_pre
array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,
nan, nan, nan, nan, nan, nan])
>>> idx_pre[x != '-'] = np.arange(np.sum(x != '-'))
>>> idx_pre
array([ 0., 1., 2., nan, nan, 3., 4., 5., 6., 7., nan, 8., 9.,
10., 11., 12., 13., 14., 15., 16., nan, 17., 18., 19., nan, 20.,
21., 22., 23., nan, 24., 25.])
This then leaves you with two index matrices of the same size as your aligned matrix, allowing you to easily find the previous index prior to alignment.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
Solution | Source |
---|---|
Solution 1 | |
Solution 2 | Hsuan-lin Her |
Solution 3 | Yousuf |