BioRpy: April 2015

Thursday, April 23, 2015

py41. Linear Rotation

We can use the Rotation matrix to rotate any line.

Here a line (y=2x+1) is rotated by 30 deg counter clockwise.

Since we don't specify the number of points in linspace, it uses the default num of 50.

# ex41.py
from __future__ import division, print_function
import numpy as np
from numpy import sin, cos, pi
import matplotlib.pyplot as plt
print('Original line: y=5x+1')
print('Rotation matrix: 30 deg ccw (+pi/6)')
ang = pi/6
R = np.array([[cos(ang),-sin(ang)],[sin(ang),cos(ang)]])
print('R (pi/6) =\n',R)
x = np.linspace(-2,2)
xyR = [R.dot([xv,5*xv+1]) for xv in x]
xR = [xv[0] for xv in xyR]
yR = [xv[1] for xv in xyR]
plt.plot(x,5*x+1,'r')
plt.plot(xR,yR,'b')
plt.title("x,y - Red, x',y' - Blue for R(pi/6)")
plt.xlabel("x,x'")
plt.ylabel("y,y'")
plt.xlim((-2,2))
plt.ylim((-2,2))
plt.show()

#    Original line: y=5x+1
#    Rotation matrix: 30 deg ccw (+pi/6)
#    R (pi/6) =
#     [[ 0.8660254 -0.5      ]
#     [ 0.5        0.8660254]]

Output:

py40. Matrix operations in Python

We can use numpy.matrix to create matrices, rather than numpy.array. With numpy.array, the * operation is element-wise multiplication. However, with numpy.matrix, the * operation is real matrix multiplication. With numpy.array, we can always do real matrix multiplication with numpy.dot function.

The function numpy.trace finds the trace of a matrix, which is the sum of the diagonal.

# ex40.py
from __future__ import division, print_function
import numpy as np
A = np.matrix([[1,5,1],[2,-1,6],[1,0,3]])
print('A = \n',A)
B = np.matrix([[2,3,0],[3,-1,7],[4,8,9]])
print('B = \n',B)
print('5*A-10*B+3*A*B =\n',5*A-10*B+3*A*B)
print('trace(A*B)=',np.trace(A*B))
print('trace(B*A)=',np.trace(B*A))

#    A = 
#     [[ 1  5  1]
#     [ 2 -1  6]
#     [ 1  0  3]]
#    B = 
#     [[ 2  3  0]
#     [ 3 -1  7]
#     [ 4  8  9]]
#    5*A-10*B+3*A*B =
#     [[ 48  13 137]
#     [ 55 170 101]
#     [  7   1   6]]
#    trace(A*B)= 103
#    trace(B*A)= 103

py39. Matrix inversion in Python

The function numpy.linalg.inv can be used to find the inverse of matrix.

We find x in the linear equation (Ax=b) using inversion x=inv(A)*b and also with numpy.linalg.solve.

# ex39.py
from __future__ import print_function, division
import numpy as np
A = np.array([[1, 1, 1],[1, -1, 1], [3, 5, -1] ])
print('A =\n',A)
print('shape of A =',A.shape)
inverse = np.linalg.inv(A)
print('inverse of A\n', inverse)
b = np.array([1,0,2])
print('b =\n',b)
print('For Ax=b')
x = np.linalg.solve(A,b)
print('x = ',x)
print('inv(A)*b =',inverse.dot(b))

#A =
# [[ 1  1  1]
# [ 1 -1  1]
# [ 3  5 -1]]
#shape of A = (3, 3)
#inverse of A
# [[-0.5   0.75  0.25]
# [ 0.5  -0.5  -0.  ]
# [ 1.   -0.25 -0.25]]
#b =
# [1 0 2]
#For Ax=b
#x =  [ 0.   0.5  0.5]
#inv(A)*b = [ 0.   0.5  0.5]

Tuesday, April 21, 2015

py38. Numpy arrays

Numpy arrays contain data of one kind, specified by the dtype attribute, which can be changed with astype(), with the proper string identifier.

We can select values by slicing, giving a list of indices, or a logical expression.

The hstack and vstack functions can join arrays horizontally or vertically. The T attribute can be used to find the transpose of a numpy array. A 2D array will correspond to a matrix, however we also have a matrix class in numpy.

# ex38.py
from __future__ import print_function, division
import numpy as np
A = np.arange(5)
print('A =',A)
print('A.dtype =',A.dtype)
B = A.astype('float')
print('B =',B)
print('B.dtype =',B.dtype)
print('A[2:4] =',A[2:4])
print('A[[2,3]] =',A[[2,3]])
print('A[A==2 | A==3] =',A[(A==2) | (A==3)])
M1 = np.hstack((A,A,A))
print('M1 =',M1)
M2 = np.vstack((A,A,A))
print('M2 =',M2)
M3 = M2.T
print('M3 (transpose M2) =',M3)

#A = [0 1 2 3 4]
#A.dtype = int32
#B = [ 0.  1.  2.  3.  4.]
#B.dtype = float64
#A[2:4] = [2 3]
#A[[2,3]] = [2 3]
#A[A==2 | A==3] = [2 3]
#M1 = [0 1 2 3 4 0 1 2 3 4 0 1 2 3 4]
#M2 = [[0 1 2 3 4]
# [0 1 2 3 4]
# [0 1 2 3 4]]
#M3 (transpose M2) = [[0 0 0]
# [1 1 1]
# [2 2 2]
# [3 3 3]
# [4 4 4]]

Sunday, April 19, 2015

aud1. Karplus–Strong string synthesis in Python

String and other sounds can be creating by repeating a random signal while filtering out the high frequency components.

In the note function, we take average of two nearby samples, which is a simple low-pass filter.

Below we have 2 programs. The only function of the first program is to define a dictionary Hz. You can see Dictionary Example. Here I go over the frequency dictionary in more detail.

In the second program, which imports the dictionary, we iterate over the elements of notes list. The duration is the fraction of a second for that particular note. Thus the first two notes are C4 played for 0.25 seconds each.

# Hz.py
Notes = ['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']
Hz ={}
for i in range(88):
    Octave = (i+9)/12
    Pos = (i+9)%12
    S = Notes[Pos]+str(Octave)
    Hz[S] = 27.5*(2**(i/12.0))

# audio1.py

from __future__ import division
import numpy as np
from scipy.io import wavfile
from Hz import Hz
 
SR = 44100
notes = [('C4', 4), ('C4', 4), ('C4', 3), ('D4', 6),
         ('E4', 4), ('E4', 3), ('D4', 6), ('E4', 3),
         ('F4', 6), ('G4', 2), ('C5', 6), ('C5', 6),
         ('C5', 6), ('G4', 6), ('G4', 6), ('G4', 6),
         ('E4', 6), ('E4', 6), ('E4', 6), ('C4', 6),
         ('C4', 6), ('C4', 6), ('G4', 3), ('F4', 6),
         ('E4', 3), ('D4', 6), ('C4', 2)]

max_pos=32767
 
def note(f,num):
    buf=np.random.rand(SR//f)-0.5
    samples=[]
    for i in range(num):
        samples.append(buf[0])
        avg=0.5*(buf[0]+buf[1])
        buf = np.append(buf[1:],avg)
    return np.array([x*max_pos for x in samples])
 
if __name__ == '__main__':
    fname='row.wav'
    out= np.zeros(10)
    for i in range(len(notes)):
        buff = note(Hz[notes[i][0]],SR//notes[i][1])
        out = np.concatenate((out,buff))
    fade_dist = 5000
    x = np.arange(fade_dist-1,-1,-1)
    fade_out = (1-np.exp(-x/fade_dist))/(1-np.exp(-1))
    out[-fade_dist:] *= fade_out 
    wavfile.write(fname,SR,out.astype('int16'))

Saturday, April 11, 2015

bpy27. Counting Letters of a biological sequence

We use the createRecords function, that we had used in the last example, by using an appropriate import.

We also define a new function, count, which will count unique letters and return their number of occurrences. This function, in turn, can be used by other modules if they import it.

In the function count, we try to add 1 to number already there, for the appropriate dictionary entry. Should that entry not exist, it is created and set to 1.

# bpy27.py
from __future__ import print_function, division
from bpy26 import createRecords

def count(seq):
    count = {}
    for i in seq:
        try: count[i] = count[i] + 1
        except: count[i] = 1
    return count
    
if __name__ == '__main__':
    records = createRecords('data')
    for record in records:
        print('Entry Name:',record.entry_name)
        print('Sequence counts:')
        print(count(record.sequence))

#Entry Name: IGF1R_HUMAN
#Sequence counts:
#{'A': 72, 'C': 44, 'E': 113, 'D': 62, 'G': 91, 'F': 47,
# 'I': 73, 'H': 23, 'K': 69, 'M': 39, 'L': 116, 'N': 85,
# 'Q': 35, 'P': 81, 'S': 97, 'R': 80, 'T': 69, 'W': 21,
# 'V': 87, 'Y': 63}
#Entry Name: IGF1R_MOUSE
#Sequence counts:
#{'A': 68, 'C': 44, 'E': 114, 'D': 63, 'G': 91, 'F': 48,
# 'I': 73, 'H': 24, 'K': 67, 'M': 40, 'L': 115, 'N': 87,
# 'Q': 36, 'P': 83, 'S': 92, 'R': 82, 'T': 73, 'W': 22,
# 'V': 90, 'Y': 61}

Wednesday, April 8, 2015

bpy26. Parsing SwissProt files using Biopython

In the last example, we saved the insulin proteins to 'data' subfolder.

We have a function that takes one parameter, the subfolder where *.txt files are stored, each corresponding to a UniProtKB text record.

We could have given our files a different extension during saving, for example .dat or .swiss which would require that term in the list comprehension. We do not have to have a filter term if only UniProtKB records are in the subfolder; in this case we have fils = os.listdir(fol)

Several attributes are printed of the records.

# bpy26.py
from __future__ import print_function, division
import os 
from Bio import SwissProt

def createRecords(fol):
    records = []
    fils = [fil for fil in os.listdir(fol) if fil.endswith('.txt')]
    for fil in fils:
        handle = open(fol + '/' + fil)
        record = SwissProt.read(handle)
        records.append(record)
        handle.close()
    return records

if __name__ == '__main__':
    records = createRecords('data')
    for record in records:
        print('Entry Name:',record.entry_name)
        print('Organism:',record.organism)
        print('Length:',record.sequence_length)
        first_crossref = record.cross_references[0]
        print('First cross ref:')
        for i in first_crossref:
            print('\t',i)
        print()
    
#Entry Name: IGF1R_HUMAN
#Organism: Homo sapiens (Human).
#Length: 1367
#First cross ref:
#         EMBL
#         X04434
#         CAA28030.1
#         -
#         mRNA
#
#Entry Name: IGF1R_MOUSE
#Organism: Mus musculus (Mouse).
#Length: 1373
#First cross ref:
#         EMBL
#         AF056187
#         AAC12782.1
#         -
#         mRNA

Tuesday, April 7, 2015

bpy25. Saving Proteins from UniProtKB using Biopython

We save insulin proteins, from the protein database UniProtKB, for both human and mouse.

The accession strings are stored in the accessions list.

Later, we will parse these files. However, it is better to save the files, rather than always loading them. We save them to two separate files in the subfolder 'data'.

# bpy25.py
import os 
from Bio import ExPASy
accessions = ['P08069','Q60751']
records = []
for accession in accessions:
    handle = ExPASy.get_sprot_raw(accession)
    fpath = os.path.join('data',accession+'.txt')
    with open(fpath,'w') as fout:
        fout.write(handle.read())

Sunday, April 5, 2015

bpy24. The Smith-Waterman algorithm in Python

The Smith–Waterman algorithm is used for local alignment of two sequences. This algorithm is described at Wikipedia. We use the two sequences from that link.

The last example, bpy23.py, is modified. There is no initialization as the fill for first row and first column is zero, and np.zeros is used to construct the matrices. For the first step, we find the matrices H and T. In the Wikipedia page, you can see the arrows, for the pointer matrix T, which makes the traceback path clear. In the program the various directions are labelled as integers, such as 3 for diagonal.

Here the high score, 12, happens to be the last column of last row. In general, this will not be true and the highest scoring entry could be located anywhere in the matrix. From the highest entry, we proceed to 0, using the pointer information in T. At each step to 0, we record the letter of sequence 1 and sequence2. There might be cases it is null, where a gap is recorded for the particular entry. This is shown in the printouts, where sometimes the string '-' is added.

# bpy24.py

from __future__ import division, print_function
import numpy as np

pt ={'match': 2, 'mismatch': -1, 'gap': -1}

def mch(alpha, beta):
    if alpha == beta:
        return pt['match']
    elif alpha == '-' or beta == '-':
        return pt['gap']
    else:
        return pt['mismatch']

def water(s1, s2):
    m, n = len(s1), len(s2)
    H = np.zeros((m+1, n+1))    
    T = np.zeros((m+1, n+1))
    max_score = 0
    # Score, Pointer Matrix
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            sc_diag = H[i-1][j-1] + mch(s1[i-1], s2[j-1])
            sc_up = H[i][j-1] + pt['gap']
            sc_left = H[i-1][j] + pt['gap']
            H[i][j] = max(0,sc_left, sc_up, sc_diag)
            if H[i][j] == 0: T[i][j] = 0
            if H[i][j] == sc_left: T[i][j] = 1
            if H[i][j] == sc_up: T[i][j] = 2
            if H[i][j] == sc_diag: T[i][j] = 3
            if H[i][j] >= max_score:
                max_i = i
                max_j = j
                max_score = H[i][j];
    
    print('H=\n',H,'\n')
    print('T=\n',T,'\n')
    align1, align2 = '', ''
    i,j = max_i,max_j
    
    #Traceback
    while T[i][j] != 0:
        if T[i][j] == 3:
            a1 = s1[i-1]
            a2 = s2[j-1]
            i -= 1
            j -= 1
        elif T[i][j] == 2:
            a1 = '-'
            a2 = s2[j-1]
            j -= 1
        elif T[i][j] == 1:
            a1 = s1[i-1]
            a2 = '-'
            i -= 1
        print('%s ---> a1 = %s\t a2 = %s\n' % ('Add',a1,a2))
        align1 += a1
        align2 += a2

    align1 = align1[::-1]
    align2 = align2[::-1]
    sym = ''
    iden = 0
    for i in range(len(align1)):
        a1 = align1[i]
        a2 = align2[i]
        if a1 == a2:                
            sym += a1
            iden += 1
        elif a1 != a2 and a1 != '-' and a2 != '-': 
            sym += ' '
        elif a1 == '-' or a2 == '-':          
            sym += ' '
    
    identity = iden / len(align1) * 100
    print('Identity = %f percent' % identity)
    print('Score =', max_score)
    print(align1)
    print(sym)
    print(align2)

if __name__ == '__main__':
    water('AGCACACA','ACACACTA')

#H=
# [[  0.   0.   0.   0.   0.   0.   0.   0.   0.]
# [  0.   2.   1.   2.   1.   2.   1.   0.   2.]
# [  0.   1.   1.   1.   1.   1.   1.   0.   1.]
# [  0.   0.   3.   2.   3.   2.   3.   2.   1.]
# [  0.   2.   2.   5.   4.   5.   4.   3.   4.]
# [  0.   1.   4.   4.   7.   6.   7.   6.   5.]
# [  0.   2.   3.   6.   6.   9.   8.   7.   8.]
# [  0.   1.   4.   5.   8.   8.  11.  10.   9.]
# [  0.   2.   3.   6.   7.  10.  10.  10.  12.]] 
#
#T=
# [[ 0.  0.  0.  0.  0.  0.  0.  0.  0.]
# [ 0.  3.  2.  3.  2.  3.  2.  2.  3.]
# [ 0.  1.  3.  1.  3.  1.  3.  3.  1.]
# [ 0.  1.  3.  2.  3.  2.  3.  2.  2.]
# [ 0.  3.  1.  3.  2.  3.  2.  2.  3.]
# [ 0.  1.  3.  1.  3.  2.  3.  2.  2.]
# [ 0.  3.  1.  3.  1.  3.  2.  2.  3.]
# [ 0.  1.  3.  1.  3.  1.  3.  2.  2.]
# [ 0.  3.  1.  3.  1.  3.  1.  3.  3.]] 
#
#Add ---> a1 = A  a2 = A
#
#Add ---> a1 = -  a2 = T
#
#Add ---> a1 = C  a2 = C
#
#Add ---> a1 = A  a2 = A
#
#Add ---> a1 = C  a2 = C
#
#Add ---> a1 = A  a2 = A
#
#Add ---> a1 = C  a2 = C
#
#Add ---> a1 = G  a2 = -
#
#Add ---> a1 = A  a2 = A
#
#Identity = 77.777778 percent
#Score = 12.0
#AGCACAC-A
#A CACAC A
#A-CACACTA

Saturday, April 4, 2015

by23. Needleman–Wunsch algorithm in Python

The Needleman–Wunsch algorithm is used for global alignment of two sequences. The best global alignment can be scored, and we may find the number of identities..

In the pt dictionary, the three scores for match, mismatch and gap are given.

The first two steps in the algorithm are initialization and fill, which will create the score matrix.

The last column on the last row shows the final score, and we start traceback to first column in first row, and in so doing find the best scoring alignment.

# bpy23.py
from __future__ import print_function, division
import numpy as np

pt ={'match': 1, 'mismatch': -1, 'gap': -1}

def mch(alpha, beta):
    if alpha == beta:
        return pt['match']
    elif alpha == '-' or beta == '-':
        return pt['gap']
    else:
        return pt['mismatch']

def needle(s1, s2):
    m, n = len(s1), len(s2)
    score = np.zeros((m+1, n+1))
    
    #Initialization
    for i in range(m+1):
        score[i][0] = pt['gap'] * i
    for j in range(n+1):
        score[0][j] = pt['gap'] * j
    
    #Fill
    for i in range(1, m + 1):
        for j in range(1, n + 1):
            diag = score[i-1][j-1] + mch(s1[i-1], s2[j-1])
            delete = score[i-1][j] + pt['gap']
            insert = score[i][j-1] + pt['gap']
            score[i][j] = max(diag, delete, insert)

    print('score matrix = \n%s\n' % score)
    align1, align2 = '', ''
    i,j = m,n
    
    #Traceback
    while i > 0 and j > 0:
        score_current = score[i][j]
        score_diag = score[i-1][j-1]
        score_left = score[i][j-1]
        score_up = score[i-1][j]
        
        print('score_current: ',score_current)
        print('score_diag: ',score_diag)
        print('score_left: ',score_left)
        print('score_up: ',score_up)

        if score_current == score_diag + mch(s1[i-1], s2[j-1]):
            print('diag')
            a1,a2 = s1[i-1],s2[j-1]
            i,j = i-1,j-1
        elif score_current == score_up + pt['gap']:
            print('up')
            a1,a2 = s1[i-1],'-'
            i -= 1
        elif score_current == score_left + pt['gap']:
            print('left')
            a1,a2 = '-',s2[j-1]
            j -= 1
        print('%s ---> a1 = %s\t a2 = %s\n' % ('Add',a1,a2))
        align1 += a1
        align2 += a2
            

    while i > 0:
        a1,a2 = s1[i-1],'-'
        print('%s ---> a1 = %s\t a2 = %s\n' % ('Add',a1,a2))
        align1 += a1
        align2 += a2
        i -= 1
        
    while j > 0:
        a1,a2 = '-',s2[j-1]
        print('%s --> a1 = %s\t a2 = %s\n' % ('Add',a1,a2))
        align1 += a1
        align2 += a2
        j -= 1
    
    align1 = align1[::-1]
    align2 = align2[::-1]
    seqN = len(align1)
    sym = ''
    seq_score = 0
    ident = 0
    for i in range(seqN):
        a1 = align1[i]
        a2 = align2[i]
        if a1 == a2:
            sym += a1
            ident += 1
            seq_score += mch(a1, a2)
    
        else: 
            seq_score += mch(a1, a2)
            sym += ' '
        
    ident = ident/seqN * 100
    
    print('Identity = %2.1f percent' % ident)
    print('Score = %d\n'% seq_score)
    print(align1)
    print(sym)
    print(align2)

if __name__ == '__main__':
    needle("TCGA","ACCGT")

#score matrix = 
#[[ 0. -1. -2. -3. -4. -5.]
# [-1. -1. -2. -3. -4. -3.]
# [-2. -2.  0. -1. -2. -3.]
# [-3. -3. -1. -1.  0. -1.]
# [-4. -2. -2. -2. -1. -1.]]
#
#score_current:  -1.0
#score_diag:  0.0
#score_left:  -1.0
#score_up:  -1.0
#diag
#Add ---> a1 = A  a2 = T
#
#score_current:  0.0
#score_diag:  -1.0
#score_left:  -1.0
#score_up:  -2.0
#diag
#Add ---> a1 = G  a2 = G
#
#score_current:  -1.0
#score_diag:  -2.0
#score_left:  0.0
#score_up:  -3.0
#diag
#Add ---> a1 = C  a2 = C
#
#score_current:  -2.0
#score_diag:  -1.0
#score_left:  -1.0
#score_up:  -2.0
#diag
#Add ---> a1 = T  a2 = C
#
#Add --> a1 = -   a2 = A
#
#Identity = 40.0 percent
#Score = -1
#
#-TCGA
#  CG 
#ACCGT

Thursday, April 2, 2015

bpy22. Using ECitMatch in Python

If we have these 5 items for an article: (1) Journal Name, (2) Year (3) Volume, (4) Page, (5) Author, we can find the pubmed id for the article using ecitmatch. This will rarely be needed as esearch will usually find the information. However parsing the html string from get requests is important in general.

We do not have a built-in Bio.Entrez utility to deal with ecitmatch yet. But esearch can also return pubmed IDs, and the search term may use limits by Author, by Journal, by Date, etc.

The tuples journal, year, vol, page, author contain strings for the first and second article. We replace spaces with +, add pipes between different items and join the two terms using '%0D'

# bpy22.py
from __future__ import print_function, division
import urllib2

ids = 'article1','article2'
journal = "Front Cell Neurosci","J Biomed Sci"
year = "2014","2013"
vol = "8","20"
page = "47","92"
author = "Iyengar BR", "Skipper KA"
base = ("http://eutils.ncbi.nlm.nih.gov/entrez/"
        "eutils/ecitmatch.cgi?db=pubmed&retmode=xml&bdata=")
s = [None]*len(ids)
for i in range(len(ids)):
    fmt = (journal[i].replace(' ','+'), year[i], vol[i],
           page[i], author[i].replace(' ','+'), ids[i])
    s[i] = '%s|%s|%s|%s|%s|%s' % fmt
addr = base + '%0D'.join(s)    
response = urllib2.urlopen(addr)
html = response.read()
pIDs = []
for i in html.split('\n'):
    if i == '': continue
    pIDs.append(i.split('|')[-1])
print("pIDs=\n",pIDs)

#pIDs=
# ['24605084', '24320156']

bpy21. Using elink EUtil in Biopython

We can use Bio.Entrez.elink to find related IDs from one database to another, or from same database. Thus elink requires we give name of database from where the IDs are located. We also need name of database where the new IDs are located. Finally we have to provide a string, or a list of strings, for the IDs.

In the example, the database from is 'protein'. We have three IDs of tyrosine kinases from human, cow and mouse.

elink returns the IDs in the 'gene' database for these proteins.

# bpy21.py
from __future__ import print_function
from Bio import Entrez

Entrez.email = "A.N.Other@example.com"
IDs = ['15718680','157427902','119703751']
handle = Entrez.elink(dbfrom = 'protein', db='gene', id = IDs)
values = Entrez.read(handle)
link_IDs = []
for value in values:
    for linkset in value['LinkSetDb']:
        for link in linkset['Link']:
            ID = link['Id']
            print('Link:', ID)
            link_IDs.append(ID)
print("link_ids=\n",link_IDs)

#Link: 3702
#Link: 522311
#Link: 16428
#link_ids=
# ['3702', '522311', '16428']