Thursday, March 19, 2015

bpy8. Translating DNA using Biopython

There are 2 python files in this example. The first (gc.py) has only data, and the second one (bp8.py) has the actual code. We can load our data using:

from gc import *


This will import all data into the namespace. Also it will load functions, in case, there are any present.


The dictionary gc is the standard genetic code.


In the loop, we read the next 3 characters, and then use the dictionary to lookup what protein to append to the out list.


We can also use a list comprehension and make the program shorter. However, we can just use the translate method in Biopython.

# gc.py
gc = {'TTT':'F', 'TCT':'S', 'TAT':'Y', 'TGT':'C',
      'TTC':'F', 'TCC':'S', 'TAC':'Y', 'TGC':'C',
      'TTA':'L', 'TCA':'S', 'TAA':'*', 'TGA':'*',
      'TTG':'L', 'TCG':'S', 'TAG':'*', 'TGG':'W',
      'CTT':'L', 'CCT':'P', 'CAT':'H', 'CGT':'R',
      'CTC':'L', 'CCC':'P', 'CAC':'H', 'CGC':'R',
      'CTA':'L', 'CCA':'P', 'CAA':'Q', 'CGA':'R',
      'CTG':'L', 'CCG':'P', 'CAG':'Q', 'CGG':'R',
      'ATT':'I', 'ACT':'T', 'AAT':'N', 'AGT':'S',
      'ATC':'I', 'ACC':'T', 'AAC':'N', 'AGC':'S',
      'ATA':'I', 'ACA':'T', 'AAA':'K', 'AGA':'R',
      'ATG':'M', 'ACG':'T', 'AAG':'K', 'AGG':'R',
      'GTT':'V', 'GCT':'A', 'GAT':'D', 'GGT':'G',
      'GTC':'V', 'GCC':'A', 'GAC':'D', 'GGC':'G',
      'GTA':'V', 'GCA':'A', 'GAA':'E', 'GGA':'G',
      'GTG':'V', 'GCG':'A', 'GAG':'E', 'GGG':'G' }

# bpy8.py

from __future__ import print_function, division
from Bio.Seq import Seq
import numpy as np
from gc import *

def rand_dna(n):
    dna = ['A','C','T','G']
    seq = np.random.choice(dna,n)
    return "".join(seq)

data = rand_dna(45)
out = []
L = len(data)
assert(L%3==0)
for i in range(L//3):
    in_data = data[3*i:3*(i+1)]
    out.append(gc[in_data])

print("DNA: %s" % data)
print("According to gc, the protein: %s" % ("".join(out)))

dna = Seq(data)
prot = dna.translate()
print("According to Bio, the protein: %s" % prot)

# DNA: ATCTATTACTCTGCGGATATCTCAAAGGGCCGGGTGCGGCCCTGA
# According to gc, the protein: IYYSADISKGRVRP*
# According to Bio, the protein: IYYSADISKGRVRP*

No comments:

Post a Comment