Tuesday, March 31, 2015

bpy20. Using espell EUtil in Biopython

We can use Bio.Entrez.espell to find correct spelling for a term. The default database is Pubmed.


In the dictionary returned by Entrez.read(), there are 2 keys. The search term has the key of "Query" and the corrected term has the key "CorrectedQuery".

# bpy20.py
from __future__ import print_function, division
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.espell(term="cance")
record = Entrez.read(handle)
print("Query:",record["Query"])
print("Corrected Query:",record["CorrectedQuery"])

#Query: cance
#Corrected Query: cancer

Monday, March 30, 2015

bpy19. Using epost EUtil in Biopython

We can use Bio.Entrez.epost to post IDs to NCBI, so we may use them later. This example is similar to the last, except now we do not use the usehistory='y' keyword.


After getting a list of IDs using esearch, with the same search term as in last example, we selected a random subset of the IDs, of length 500, and posted them.


Then, they were used in a Bio.Entrez.esummary.

# bpy19.py
from __future__ import print_function
from Bio import Entrez
import numpy as np

Entrez.email = "A.N.Other@example.com"
term = ('"mitochondria"[MeSH Terms] OR "mitochondria"[All Fields])'
        ' AND (Review[ptyp] AND "loattrfree full text"[sb] AND'
        ' ("2012/01/01"[PDAT] : "2014/12/31"[PDAT]))')
handle = Entrez.esearch("pubmed", retmax = 1500, term = term)
values = Entrez.read(handle)
print("Count:",values['Count'])
print("len(IdList)",len(values['IdList']))
randIDs = np.random.choice(values['IdList'],500,replace=False)
print("len(randIDs)",len(randIDs))
handle1 = Entrez.epost("pubmed",id=','.join(randIDs))
values1 = Entrez.read(handle1)
print("QueryKey:", values1['QueryKey'])
print("len(WebEnv):", len(values1['WebEnv']))
summary = Entrez.esummary(db="pubmed", retmax=500,
                          webenv=values1["WebEnv"],
                          query_key=values1["QueryKey"])
records = Entrez.read(summary)
for i,record in enumerate(records):
    if i%100 != 0: continue
    print('%d. %s' % (i+1,record['Title']))
    print('')

#Count: 1223
#len(IdList) 1223
#len(randIDs) 500
#QueryKey: 1
#len(WebEnv): 77
#1. Mitochondrial poly(A) polymerase and polyadenylation.
#
#101. The role of mitochondrial dysfunction in sepsis-induced
# multi-organ failure.
#
#201. The involvement of the sigma-1 receptor in neurodegeneration
# and neurorestoration.
#
#301. Provitamin A metabolism and functions in mammalian biology.
#
#401. Peroxisome proliferator activated receptor α ligands as
# anticancer drugs targeting mitochondrial metabolism.

Saturday, March 28, 2015

bpy18. Posting IDs in a NCBI EUtil using Biopython

We can use the keyword parameter usehistory='y' to Bio.Entrez.esearch.


We searched for mitochondria review articles, that have free full text available, from years 2012 through 2014.


Because of the additional parameter, two new keywords were returned and we can use in a subsequent eUtil. We print the title of a few records, from the 1223 that were returned.

# bpy18.py

from __future__ import print_function
from Bio import Entrez
Entrez.email = "A.N.Other@example.com"
term = ('"mitochondria"[MeSH Terms] OR "mitochondria"[All Fields])'
        ' AND (Review[ptyp] AND "loattrfree full text"[sb] AND'
        ' ("2012/01/01"[PDAT] : "2014/12/31"[PDAT]))')
handle = Entrez.esearch("pubmed", retmax = 1500, term = term,
                        usehistory = 'y' )

values = Entrez.read(handle)
print("Count:",values['Count'])
print("len(IdList)",len(values['IdList']))
print("QueryKey:", values['QueryKey'])
print("len(WebEnv):", len(values['WebEnv']))

summary = Entrez.esummary(db="pubmed", retmax=values['Count'],
                          webenv=values["WebEnv"],
                          query_key=values["QueryKey"])
records = Entrez.read(summary)

for i,record in enumerate(records):
    if i%300 != 0: continue
    print('%d. %s' % (i+1,record['Title']))
    print('')

#Count: 1223
#len(IdList) 1223
#QueryKey: 1
#len(WebEnv): 78
#1. The role of SIGMAR1 gene mutation and mitochondrial dysfunction
# in amyotrophic lateral sclerosis.
#
#301. You are what you eat: multifaceted functions of autophagy
# during elegans development.
#
#601. Mitochondrial deficiency in Cockayne syndrome.
#
#901. Transcriptional repression of mitochondrial function in
# aging: a novel role for the silencing mediator of retinoid
# and thyroid hormone receptors co-repressor.
#
#1201. Signal transduction by mitochondrial oxidants.

Friday, March 27, 2015

bpy17. Using efetch EUtil in Biopython

We can use Bio.Entrez.efetch to get records from a database, given an ID or a set of IDs. We use the set of IDs that we used earlier in bpy14.


There are two programs here. The first file saves the records to a text file. The rettype and retmode key arguments are given at the NCBI Entrez site. The variable out is the string corresponding to the text file returned. This particular combination is for getting a text file in Medline format.


The second program uses Bio.Medline.parse to read the Medline file. Here we print the Title and first 200 characters of Abstract.

# bpy17a.py
from __future__ import print_function
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.efetch(db="pubmed", id="25741283,25798216",
                       rettype ='medline', retmode = 'text')
out = handle.read()
handle.close()
with open("bpy17.txt","w") as fout:
    fout.write(out)

# bpy17b.py
from __future__ import print_function
from Bio import Medline
fin = open('bpy17.txt')
records = Medline.parse(fin)

for record in records:
    print('Title:',record['TI'])
    print('Abstract\n',record['AB'][:200], end = '')
    print('...\n')
    
fin.close()

#Title: Nitric oxide and mitochondria in metabolic syndrome.
#Abstract
# Metabolic syndrome (MS) is a cluster of metabolic disorders
# that collectively increase the risk of cardiovascular disease.
# Nitric oxide (NO) plays a crucial role in the pathogeneses of
# MS components a...
#
#Title: Maternal ancestry and population history from whole
# mitochondrial genomes.
#Abstract
# MtDNA has been a widely used tool in human evolutionary and
# population genetic studies over the past three decades. Its
# maternal inheritance and lack of recombination have offered
# the opportunity to e...

Thursday, March 26, 2015

bpy16. Using esearch EUtil in Biopython

We can use Bio.Entrez.esearch to find IDs from a database. Besides the database, we have to give the search term. We may, additionally, give the maximum number of IDs to return, as a keyword argument.


20 IDs are returned since that is the default value for the maximum number of IDs returned, that is, RetMax is equal to 20. We can also see the total number of matches.


We may search any of the fields, however here we search in [All Fields].

# bpy16.py
from __future__ import print_function, division
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.esearch("omim","cancer")
record = Entrez.read(handle)
handle.close()
print("There are a total of %s results." % record['Count'])
r = record['IdList']
NumIds = len(r)
print("%d Ids were returned. They are:" % NumIds)
for i in range(0,NumIds,4):
    r_fmt = i+1,r[i],i+2,r[i+1],i+3,r[i+2],i+4,r[i+3]
    print("%d: %s\t%d: %s\t%d: %s\t%d: %s" % r_fmt)

#There are a total of 2802 results.
#20 Ids were returned. They are:
#1: 616093       2: 616092       3: 616073       4: 616043
#5: 616041       6: 616016       7: 616015       8: 616000
#9: 615993       10: 615977      11: 615943      12: 615930
#13: 615920      14: 615909      15: 615908      16: 615874
#17: 615869      18: 615854      19: 615845      20: 615825

bR1. Getting the top 5 databases for 'asthma' at NCBI using R

We will read count.csv, that we created in last example, sort and normalize it, and plot the top 5 databases as a barplot.


There is no reason that we can not use this from within python. Also, there is no reason to use Biopython rather than Bioconductor. However, for me, Biopython is easier to understand.

# bR1.R
cnt <- read.table("count.csv", sep = ',',
                  header = FALSE, stringsAsFactors = FALSE,
                  col.names = c("db","count"))
cnt.ord <- cnt[order(cnt$count, decreasing = TRUE),]
cnt.ord$count <- cnt.ord$count/max(cnt.ord$count)
barplot(cnt.ord$count[1:5], names.arg = cnt.ord$db[1:5],
        ylab = 'Relative Number', xlab = 'Database',
        main = "Top 5 databases for 'asthma' at NCBI")

bpy15. Writing a query function using Biopython

Using Bio.Entrez.egquery, that we used earlier in bpy12.py, we will encapsulate the table creation into a function, which will write that table to a csv file.


Additionally, we can give a boolean value to indicate, that if additionally, we want to print or not.


In later examples, we will use it as an import. However, it is good to have test code, should the module be run directly. The test code just prints the table returned for 'asthma' and saves result to count.csv. We will note that there are more Pubmed and Pubmed Central records than in our previous search.

# query.py

from __future__ import print_function, division
from Bio import Entrez

def query(term, fname, to_print):
    Entrez.email = "Your.Name.Here@example.org"
    handle = Entrez.egquery(term=term)
    record = Entrez.read(handle)
    handle.close()
    X=[]
    Y=[]
    out=[]
    for row in record["eGQueryResult"]:
        for i in row:
            if i == 'DbName':
               Y.append(row[i])
            elif i=='Count':
                X.append(row[i])
    for i in range(len(X)):
        if int(X[i])==0:
            continue
        out.append("%s,%s\n" % (Y[i],X[i]))
    if fname == None:
        pass
    else:
        with open(fname+'.csv','w') as fout:
            fout.writelines(out)
    if to_print:
        for i in out:
            print(i, end = '')
  
if __name__ == '__main__':
    query('asthma','count',True)
    
#    pubmed,149164
#    pmc,106784
#    mesh,11
#    books,7165
#    pubmedhealth,1461
#    omim,270
#    ncbisearch,75
#    nuccore,32446
#    nucgss,91
#    nucest,210
#    protein,2559
#    genome,1
#    structure,390
#    dbvar,9
#    gene,1036
#    sra,8245
#    biosystems,125
#    unigene,1
#    cdd,22
#    popset,7
#    geoprofiles,731699
#    gds,4417
#    homologene,18
#    pccompound,122
#    pcsubstance,2125
#    pcassay,3515
#    nlmcatalog,2610
#    probe,34
#    gap,2387
#    bioproject,211
#    biosample,12542  

Wednesday, March 25, 2015

bpy14. Using esummary EUtil in Biopython

We can use Bio.Entrez.esummary to find summary (not details) of records in a database. The parameter are the ids of records we want to retrieve as well as the database name. If we want information from more than one ID, we can pass the ids as a comma delimited string.


You should give your email for Entrez.email. Biopython returns a list of length corresponding to the number of ids that are provided in the id string.


For the 2 ids, we get title of article, the authors, and journal name. We will have to find ids by using other Entrez eUtils.

# bpy14.py
from __future__ import print_function
from Bio import Entrez
Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.esummary(db="pubmed", id="25741283,25798216")
record = Entrez.read(handle)
handle.close()
print('type(record) =',type(record))
print('len(record) =',len(record))
for i in range(len(record)):
    print("record:",i)
    print("title:",record[i]["Title"])
    for author in record[i]["AuthorList"]:
        print("author:",author)
    print("journal:",record[i]["FullJournalName"])
    print("")

#type(record) = <class 'Bio.Entrez.Parser.ListElement'>
#len(record) = 2
#record: 0
#title: Nitric oxide and mitochondria in metabolic syndrome.
#author: Litvinova L
#author: Atochin DN
#author: Fattakhov N
#author: Vasilenko M
#author: Zatolokin P
#author: Kirienkova E
#journal: Frontiers in physiology
#
#record: 1
#title: Maternal ancestry and population history from whole
# mitochondrial genomes.
#author: Kivisild T
#journal: Investigative genetics

Tuesday, March 24, 2015

bpy13. Using einfo EUtil in Biopython

We can use Bio.Entrez.einfo to find information about databases. If we call einfo without any parameter, it will return the name of all the databases. However, if we put in a particular db, such as 'pubmed', it will return information about this database.


For Entrez.email, you should give your email. Biopython returns a dictionary of length 1, with the result in key 'DbInfo". Some information about the database is printed such as its name and count.


The dictionary within the result is iterated over (with i being the current subkey). We list all the field names and their description.

# bpy13.py
from __future__ import print_function, division
from Bio import Entrez

Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.einfo(db="pubmed")
record = Entrez.read(handle)
handle.close()
db = record['DbInfo']
name = db['MenuName']
dbName = db['DbName']
desc = db['Description']
count = db['Count']
print(" - ".join([name,dbName,desc,count]))
for i in db['FieldList']:
    print(i['Name'],end='\t')
    print(i['Description'])
    
#PubMed - pubmed - PubMed bibliographic record - 24738242
#ALL     All terms from all searchable fields
#UID     Unique number assigned to publication
#FILT    Limits the records
#TITL    Words in title of publication
#WORD    Free text associated with publication
#MESH    Medical Subject Headings assigned to publication
#MAJR    MeSH terms of major importance to publication
#AUTH    Author(s) of publication
#JOUR    Journal abbreviation of publication
#AFFL    Author's institutional affiliation and address
#ECNO    EC number for enzyme or CAS registry number
#SUBS    CAS chemical name or MEDLINE Substance Name
#PDAT    Date of publication
#EDAT    Date publication first accessible through Entrez
#VOL     Volume number of publication
#PAGE    Page number(s) of publication
#PTYP    Type of publication (e.g., review)
#LANG    Language of publication
#ISS     Issue number of publication
#SUBH    Additional specificity for MeSH term
#SI      Cross-reference from publication to other databases
#MHDA    Date publication was indexed with MeSH terms
#TIAB    Free text associated with Abstract/Title
#OTRM    Other terms associated with publication
#INVR    Investigator
#COLN    Corporate Author of publication
#CNTY    Country of publication
#PAPX    MeSH pharmacological action pre-explosions
#GRNT    NIH Grant Numbers
#MDAT    Date of last modification
#CDAT    Date of completion
#PID     Publisher ID
#FAUT    First Author of publication
#FULL    Full Author Name(s) of publication
#FINV    Full name of investigator
#TT      Words in transliterated title of publication
#LAUT    Last Author of publication
#PPDT    Date of print publication
#EPDT    Date of Electronic publication
#LID     ELocation ID
#CRDT    Date publication first accessible through Entrez
#BOOK    ID of the book that contains the document
#ED      Section's Editor
#ISBN    ISBN
#PUBN    Publisher's name
#AUCL    Author Cluster ID
#EID     Extended PMID
#DSO     Additional text from the summary
#AUID    Author Identifier
#PS      Personal Name as Subject

bpy12. Using egquery EUtil in Biopython

We can use Bio.Entrez.egquery to find counts in different databases, for a query term.


For Entrez.email, you should give your email. The query term, here, is 'asthma'. Biopython returns a dictionary of length 2, with the result in key 'eGQueryResult". The result inside 'eGQueryResult' are also a dictionary.


The dictionary within the result is iterated over (with row being the current subkey). Usually, we are only interested in two subkeys; the database name 'MenuName' and the count 'Count'.


We populate lists X and Y with the strings from the 2 subkeys. Finally, we print the two lists, provided count is not zero.

# bpy12.py
from __future__ import print_function, division
from Bio import Entrez

Entrez.email = "Your.Name.Here@example.org"
handle = Entrez.egquery(term="asthma")
record = Entrez.read(handle)
handle.close()
X=[]
Y=[]
for row in record["eGQueryResult"]:
    for i in row:
        if i == 'MenuName':
           Y.append(row[i])
        elif i=='Count':
            X.append(row[i])
for i in range(len(X)):
    if int(X[i])==0:
        continue
    print("%20s\t%s" % (Y[i],X[i]))

#              PubMed    149142
#      PubMed Central    106753
#                MeSH    11
#               Books    7165
#       PubMed Health    1461
#                OMIM    270
#         Site Search    75
#          Nucleotide    32446
#                 GSS    91
#                 EST    210
#             Protein    2558
#              Genome    1
#           Structure    390
#               dbVar    9
#                Gene    1034
#                 SRA    8245
#          BioSystems    125
#             UniGene    1
#   Conserved Domains    22
#              PopSet    7
#        GEO Profiles    731699
#        GEO DataSets    4417
#          HomoloGene    18
#    PubChem Compound    122
#   PubChem Substance    2125
#    PubChem BioAssay    3515
#         NLM Catalog    2610
#               Probe    34
#               dbGaP    2387
#          BioProject    209
#           BioSample    12542

Monday, March 23, 2015

bpy11. Using Entrez EUtils in Biopython

We can use Bio.Entrez to use the different EUtils at NCBI Entrez.


There are 9 utilities, and currently 8 of them can be accessed using Bio.Entrez. The 9th is the most recent, ECitMatch. Other than using Biopython, you can also use HTML requests with appropriate query term, name of EUtil, etc. It can return data as XML, Python object, etc.


Before going into details of the different EUtils, this program just lists some documentation. You can get complete documentation by removing 2 from docL term, in which case it will print all the lines.

# bpy11.py

from __future__ import print_function, division
from Bio import Entrez

eUtils = [e for e in Entrez.__dict__ if e[0]=='e' and e != 'email']
for i,eUtil in enumerate(eUtils):
    print('%d.%s' % (i+1,eUtil))
    exec('doc=Entrez.%s.__doc__' % eUtil)
    docL = doc.split('\n')
    print(''.join(docL[:2]))
    print('')

#1.espell
#ESpell retrieves spelling suggestions, returned in a results handle.
#
#2.egquery
#EGQuery provides Entrez database counts for a global search.
#
#3.esummary
#ESummary retrieves document summaries as a results handle.
#
#4.efetch
#Fetches Entrez results which are returned as a handle.
#
#5.esearch
#ESearch runs an Entrez search and returns a handle to the results.
#
#6.elink
#ELink checks for linked external articles and returns a handle.
#
#7.einfo
#EInfo returns a summary of the Entez databases as a results handle.
#
#8.epost
#Post a file of identifiers for future use.

Thursday, March 19, 2015

bpy10. Reading a 1-record Genbank file using Biopython

Search for 'GenBank: BC135714.1' at NCBI.


In the results, there is Literature, Health, Genomes, and other sections. In the Genomes section, select the Nucleotide. If there is only 1 nucleotide, it will go to the Genbank file with that Locus ID.


In the upper right, select down arrow for send. Make sure complete record is selected, and then choose destination of File. Download options will come, and download the Genbank file.


Rename the file to BC135714.1.gb and save it to the working directory or a subfolder, such as data, under the working directory.


In this program, the function Bio.SeqIO.read is used to parse the text file. We print some information along with translated protein from a feature. The selected feature is the gene for the protein (ppm1a).

# bpy10.py
from __future__ import print_function, division
from Bio import SeqIO

gb = SeqIO.read('data/BC135714.1.gb','genbank')
print('type(gb) =',type(gb), end='\n\n')
print('id =', gb.id, end='\n\n')
print('desc =',gb.description, end='\n\n')
for annot in gb.annotations:
    print("--> ",annot)
    if annot == 'source':
        print(gb.annotations[annot], end='\n\n')
for ft in gb.features:
    print('--ft loc-->',ft.location)
dna = ft.extract(gb.seq) # ft is last feature ('ppm1a protein')
print("3rd feature translation: ",dna.translate())

#type(gb) = <class 'Bio.SeqRecord.SeqRecord'>
#
#id = BC135714.1
#
#desc = Xenopus tropicalis protein phosphatase 1A (formerly 2C),
# magnesium-dependent, alpha isoform, mRNA (cDNA clone MGC:121664
# IMAGE:7625995), complete cds.
#
#-->  comment
#-->  sequence_version
#-->  source
#Xenopus (Silurana) tropicalis (western clawed frog)
#
#-->  taxonomy
#-->  keywords
#-->  references
#-->  accessions
#-->  data_file_division
#-->  date
#-->  organism
#-->  gi
#--ft loc--> [0:3385](+)
#--ft loc--> [0:3385](+)
#--ft loc--> [336:1488](+)
#3rd feature translation:  
#MGAFLDKPKMEKHNAHGQGNGLRYGLSSMQGWRVEMEDAHTAVIGLPNGLD
#AWSFFAVYDGHAGSQVAKYCCEHLLDHITSNQDFKGTDGHLSVWSVKNGIR
#TGFLQIDEHMRVISEKKHGADRSGSTAVGVMTSPNHIYFINCGDSRGLLCR
#SKKVHFFTQDHKPSNPLEKERIQNAGGSVMIQRVNGSLAVSRALGDFDYKC
#VHGKGPTEQLVSPEPEVYEIERSEEDDQFIILACDGIWDVMGNEELCDFVW
#SRLEVTDDLERVCNEIVDTCLYKGSRDNMSVILICFPSAPKVLPEAVKREA
#ELDKYLEGRVEDIIKKQGEEGVPDLVHVMRTLASENIPNLPPGGELASKRS
#VIEAVYNRLNPYRNDDTDSASTDDMW*

bpy9. Counting k-mer words in Python

The function count, below, will find the number of times a pattern appears.


For every position in text, minus length of pattern, it checks for equality with pattern, and returns the total number of equalities.

# bpy9.py

from __future__ import print_function, division
import numpy as np

def rand_dna(n):
    dna = ['A','C','T','G']
    seq = np.random.choice(dna,n)
    return "".join(seq)

def count(text, pattern):
    cnt = 0
    for i in range(len(text)-len(pattern)):
        if text[i:i+len(pattern)] == pattern: cnt+=1
    return cnt

np.random.seed(34)
dna = rand_dna(50)
dna_pattern = rand_dna(3)
c = count(dna, dna_pattern)
print("dna = ",dna)
print("dna_pattern = ", dna_pattern)
print("c =",c)

# dna =  CTTCTCAGCGTTCTCCACTCTACATCAGGCCGAAGTCAATTCATCTTTGT
# dna_pattern =  CTT
# c = 2

bpy8. Translating DNA using Biopython

There are 2 python files in this example. The first (gc.py) has only data, and the second one (bp8.py) has the actual code. We can load our data using:

from gc import *


This will import all data into the namespace. Also it will load functions, in case, there are any present.


The dictionary gc is the standard genetic code.


In the loop, we read the next 3 characters, and then use the dictionary to lookup what protein to append to the out list.


We can also use a list comprehension and make the program shorter. However, we can just use the translate method in Biopython.

# gc.py
gc = {'TTT':'F', 'TCT':'S', 'TAT':'Y', 'TGT':'C',
      'TTC':'F', 'TCC':'S', 'TAC':'Y', 'TGC':'C',
      'TTA':'L', 'TCA':'S', 'TAA':'*', 'TGA':'*',
      'TTG':'L', 'TCG':'S', 'TAG':'*', 'TGG':'W',
      'CTT':'L', 'CCT':'P', 'CAT':'H', 'CGT':'R',
      'CTC':'L', 'CCC':'P', 'CAC':'H', 'CGC':'R',
      'CTA':'L', 'CCA':'P', 'CAA':'Q', 'CGA':'R',
      'CTG':'L', 'CCG':'P', 'CAG':'Q', 'CGG':'R',
      'ATT':'I', 'ACT':'T', 'AAT':'N', 'AGT':'S',
      'ATC':'I', 'ACC':'T', 'AAC':'N', 'AGC':'S',
      'ATA':'I', 'ACA':'T', 'AAA':'K', 'AGA':'R',
      'ATG':'M', 'ACG':'T', 'AAG':'K', 'AGG':'R',
      'GTT':'V', 'GCT':'A', 'GAT':'D', 'GGT':'G',
      'GTC':'V', 'GCC':'A', 'GAC':'D', 'GGC':'G',
      'GTA':'V', 'GCA':'A', 'GAA':'E', 'GGA':'G',
      'GTG':'V', 'GCG':'A', 'GAG':'E', 'GGG':'G' }

# bpy8.py

from __future__ import print_function, division
from Bio.Seq import Seq
import numpy as np
from gc import *

def rand_dna(n):
    dna = ['A','C','T','G']
    seq = np.random.choice(dna,n)
    return "".join(seq)

data = rand_dna(45)
out = []
L = len(data)
assert(L%3==0)
for i in range(L//3):
    in_data = data[3*i:3*(i+1)]
    out.append(gc[in_data])

print("DNA: %s" % data)
print("According to gc, the protein: %s" % ("".join(out)))

dna = Seq(data)
prot = dna.translate()
print("According to Bio, the protein: %s" % prot)

# DNA: ATCTATTACTCTGCGGATATCTCAAAGGGCCGGGTGCGGCCCTGA
# According to gc, the protein: IYYSADISKGRVRP*
# According to Bio, the protein: IYYSADISKGRVRP*

Wednesday, March 18, 2015

bpy7. Writing multi-record sequences using Biopython

We write 3 random DNA sequences of different lengths. The function rand_dna returns a Seq object of length n.


The function Bio.SeqIO.write is used to write a list of sequences. While we used a list, it can also be a generator object. We also write just one record, as well.


If we open the created file, you will see that, we have unknown description for all 3, since we did not specify them.

# bpy7.py
from __future__ import print_function
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
from Bio import SeqIO
import numpy as np

def rand_dna(n):
    dna = ['A','C','T','G']
    seq = np.random.choice(dna,n)
    return Seq("".join(seq))
    
record1 = SeqRecord(rand_dna(5),"seq1")
record2 = SeqRecord(rand_dna(8),"seq2")
record3 = SeqRecord(rand_dna(12),"seq3")
records = [record1,record2,record3]
count = SeqIO.write(records,'bpy7.fna','fasta')
print("%d records have been written." % count)
for i in records:
    print('id = %s' % i.id)
    print('seq = %s' % i.seq)

#3 records have been written.
#id = seq1
#seq = GTGGC
#id = seq2
#seq = ATTAGTTA
#id = seq3
#seq = AAAGCCAGGTCC

Tuesday, March 17, 2015

bpy6. Reading a multi-record FASTA file into Biopython

This text file is saved as bpy6.fna in the working directory.

>seq1
ATTGGA
>seq2
ATG
>seq3
GCTA

We have to use the function Bio.SeqIO.parse to read a multi-record file. This function returns a generator object.


Usually, we will just loop over the generator. However, we can also use the next method, to yield a new SeqRecord object. When there are no more SeqRecord objects to yield, it will return a StopIteration Exception.


We reload the records object, to show the next method example. The previous named records object, is garbage collected, as we do not have reference to it.

# bpy6.py

from __future__ import print_function
from Bio import SeqIO
records = SeqIO.parse('bpy6.fna','fasta')
print('type(records)=',type(records))
for record in records:
    print('id:',record.id)
    print('alpha:',record.seq.alphabet)
    print('seq:',record.seq)
records = SeqIO.parse('bpy6.fna','fasta')
record1 = records.next()
record2 = records.next()
record3 = records.next()
rec_ids = [r.id for r in [record1,record2,record3]]
print('rec_ids =',rec_ids)
#type(records)= 
#id: seq1
#alpha: SingleLetterAlphabet()
#seq: ATTGGA
#id: seq2
#alpha: SingleLetterAlphabet()
#seq: ATG
#id: seq3
#alpha: SingleLetterAlphabet()
#seq: GCTA
#rec_ids = ['seq1', 'seq2', 'seq3']

py37. split in Python

We have an input tuple of strings with two values separated by ':'. The value on the left is a string, whereas the value on the right is an integer.


We want to get a list of strings for the value that go before ':'.


We want to get an list of integers for the values that go after ':'.


We use list comprehensions to get the lists.


In addition, we create a combined list, with the values extracted from the input sequence, returned as a list of 2-length tuples. The split method of a string object is used. Notice we have to write the string, then a period to get any method or attributes. This leads to better management of the global namespace. For the split method we give the string to split as well as the split character.

# ex37.py
from __future__ import print_function
chr_vec = "time:30","pos:45","loc:5"
first_vec = [c.split(":")[0] for c in chr_vec]
second_vec = [int(c.split(":")[1]) for c in chr_vec]
combined_vec = [(c.split(":")[0],
                 int(c.split(":")[1])) for c in chr_vec]
print("original: ", chr_vec)
print("first_vec = ", first_vec)
print("second_vec = ", second_vec)
print("combined_vec = ", combined_vec)
#original:  ('time:30', 'pos:45', 'loc:5')
#first_vec =  ['time', 'pos', 'loc']
#second_vec =  [30, 45, 5]
#combined_vec =  [('time', 30), ('pos', 45), ('loc', 5)]

R37. strsplit in R

We have an input character vector, with each string having two values separated by ':'. The value on the left is a string, whereas the value on the right is an integer.


We want to get a character vector for the value that go before ':'.


We want to get an integer vector for the values that go after ':'.


In addition, we create a combined list, with the values extracted from the input vector. The strsplit function creates a list when it is given a character vector, as well as the split character.

# ex37.R
chr.vec <- c("time:30","pos:45","loc:5")
sep <- strsplit(chr.vec,":")
first.vec <- sapply(sep, function(x) x[1])
second.vec <- as.integer(sapply(sep, function(x) x[2]))
combined.vec <- lapply(sep, function(x) c(x[1],x[2]))
cat("original: ", chr.vec, '\n')
cat("first.vec = ", first.vec,'\n')
cat("second.vec = ", second.vec,'\n')
cat("combined.vec = \n")
print(combined.vec)
# original:  time:30 pos:45 loc:5 
# first.vec =  time pos loc 
# second.vec =  30 45 5 
# combined.vec = 
#   [[1]]
# [1] "time" "30"  
# 
# [[2]]
# [1] "pos" "45" 
# 
# [[3]]
# [1] "loc" "5"

Sunday, March 15, 2015

bpy5. Reading a FASTA file with one record in Biopython

A test fasta file is available from Biopython. This file contains only one sequence. You should download the text file to your working directory.


We can use Bio.SeqIO.read to read a file with only one record. The arguments are file name, format and optional alphabet. The format here is 'fasta'. If alphabet is omitted, it defaults to SingleLetterAlphabet(). You should provide an alphabet since fasta files do not have alphabet information inside them. However, you may be able to find from the extension, such as fna here.


We can find the attributes of the SeqRecord object which is returned, such as id and seq.


We also use some indexing to select some 10-length sequences.

# bpy5.py

from __future__ import print_function
from Bio import SeqIO
from Bio.Alphabet.IUPAC import unambiguous_dna as dna
record = SeqIO.read('NC_005816.fna','fasta',dna) 
print('id:',record.id)
print('alpha:',record.seq.alphabet)
print('first 10 letters:',record.seq[:10])
print('last 10 letters:',record.seq[-11:-1])
print('first 10 even letters:',record.seq[:20:2])
print('first 10 odd letters:',record.seq[1:21:2])
#id: gi|45478711|ref|NC_005816.1|
#alpha: IUPACUnambiguousDNA()
#first 10 letters TGTAACGAAC
#last 10 letters CCCGACCCCT
#first 10 even letters TTAGAGTCAA
#first 10 odd letters GACACGGATG

Saturday, March 14, 2015

bpy4. SeqRecord in Biopython

A SeqRecord object contains annotations describing the sequence.


An example, based on one from their docs, is presented.


You can get the help docs using SeqRecord? in IPython or help(SeqRecord).


The docs pertaining to creating a SeqRecord is printed. This is only partial, and a complete printout will occur the subscripts are omitted.

# bpy4.py

from __future__ import print_function
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.Alphabet.IUPAC import protein
seqStr = 'MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF'  
record = SeqRecord(Seq(seqStr,protein),
                       id="YP_025292.1", name="HokC",
                       description="toxic membrane protein")
print("record.annotations:",record.annotations)
record.annotations['where']="from docs, 'SecRecord?'"
print("record.annotations:",record.annotations)
print("record.annotations['where']:",record.annotations['where'])
createSeqRec = SeqRecord.__init__.__doc__[:317]
for i in createSeqRec.split("\n"): print(i.strip())
#record.annotations: {}
#record.annotations: {'where': "from docs, 'SecRecord?'"}
#record.annotations['where']: from docs, 'SecRecord?'
#Create a SeqRecord.
#
#Arguments:
#- seq         - Sequence, required (Seq, MutableSeq or UnknownSeq)
#- id          - Sequence identifier, recommended (string)
#- name        - Sequence name, optional (string)
#- description - Sequence description, optional (string)

bpy3. Generic alphabet and strings

In case you do not put an alphabet, the generic Alphabet() is used.


If you print a sequence it coverts the sequence to string by implicitly using str(), regardless of the alphabet used. You can also use the str() function to explicitly convert a sequence into string. Whenever you perform an operation with Seq, it returns a new object, since Seq is an immutable Python object. This is unlike Python Lists which are mutable. We also have a MutableSeq object.


However, in many cases it will convert it to string, such as when looping or when doing comparisons.

# bpy3.py
from __future__ import print_function
from Bio.Seq import Seq
dna_str = "CTGAATCG"
dna = Seq(dna_str)
print("dna =", dna)
print("type(dna) = ", type(dna))
for i in dna:
    print('type(i) =', type(i), end = ' with value of ')
    print('i =',i)
print("comp:",dna_str==dna)

#dna = CTGAATCG
#type(dna) =  
#type(i) = <type 'str'> with value of i = C
#type(i) = <type 'str'> with value of i = T
#type(i) = <type 'str'> with value of i = G
#type(i) = <type 'str'> with value of i = A
#type(i) = <type 'str'> with value of i = A
#type(i) = <type 'str'> with value of i = T
#type(i) = <type 'str'> with value of i = C
#type(i) = <type 'str'> with value of i = G
#comp: True

Friday, March 13, 2015

bpy2. Alphabet for Sequences

The different alphabet a sequence, such as DNA, RNA, or protein, might have, can be found by using Bio.Alphabet.IUPAC.__dict__


We remove, from the printout, any entries staring with a '_'.

# bpy2.py
from __future__ import print_function
from Bio.Alphabet import IUPAC
for i in IUPAC.__dict__:
    if i[0]=='_': continue
    print(i)

#Alphabet
#unambiguous_dna
#IUPACAmbiguousDNA
#protein
#ambiguous_rna
#IUPACProtein
#IUPACUnambiguousDNA
#unambiguous_rna
#ExtendedIUPACProtein
#extended_protein
#IUPACUnambiguousRNA
#ExtendedIUPACDNA
#extended_dna
#IUPACAmbiguousRNA
#IUPACData
#ambiguous_dna

bpy1. Biopython

If using Anaconda, you can install Biopython, from the terminal, using:

$ conda install biopython

Not every package is on conda. You might first search:

$ conda search biopython

or some other package. So for example, we can search Beautiful Soup, for HTML parsing, using:

$ conda search beautiful-soup

or a partial name such as

$ conda search beau

If it is not on conda, you can always use pip to install it:

$ <path to Anaconda folder>/bin/pip install package

You should get the exact package name from their website.


The most important data structure in Biopython is the sequence (Seq), and SeqRecord, to hold sequence with annotation. Here, we create a 4-sequence long DNA.

# bpy1.py
from __future__ import print_function
import Bio
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
print("version:",Bio.__version__)
four = Seq("AGTG",IUPAC.unambiguous_dna)
print("seq:",four)
print("seq alphabet:",four.alphabet)
print("complement:",four.complement())
#version: 1.65
#seq: AGTG
#seq alphabet: IUPACUnambiguousDNA()
#complement: TCAC

R36. Drawing Polygons in R

To draw polygons, we have to give vertices as we loop over the polygon counterclockwise.


We use a curve with roots of -1,0,+1. We divide the x-axis based on these roots (will become 4 shaded regions).


We make loops in x.1, x.2, x.3 and x.4, the four divisions of x that we chose. For x.2 and x.3, we do not have to complete the loop as the default is that the polygon will be closed.

# ex36.R

y <- function(x) {
  x^3 - x
}
x <- seq(-1.5, 1.5, .01)
x.1 <- x[x >= -1.5 & x <= -1]
x.2 <- x[x >= -1 & x <= 0]
x.3 <- x[x >= 0 & x <= 1]
x.4 <- x[x >= 1 & x <= 1.5]
plot(x,y(x), type = 'l', col = 'blue', xlim = c(-1.5,1.5))
abline(h = 0, col = 'magenta')
polygon(c(x.1,rev(x.1)),
        c(y(x.1),rep(0,length(x.1))), col = 'violet')
polygon(c(x.2), c(y(x.2)), col='green')
polygon(c(x.3), c(y(x.3)), col='red')
polygon(c(x.4,rev(x.4)),
        c(rep(0,length(x.4)),y(rev(x.4))), col='yellow')
title(expression('y = x'^3*'-x'))

Output:

py36. Drawing Polygons in Python

To draw polygons, we have to give vertices as we loop over the polygon counterclockwise.


We use a curve with roots of -1,0,+1. We divide the x-axis based on these roots (will become 4 shaded regions).


Using list comprehensions we go make loops in x_1, x_2, x_3 and x_4, the four divisions of x that we chose.

# ex36.py

from __future__ import print_function, division
import numpy as np
import matplotlib.pyplot as plt

def y(x):
    return x**3 - x

x = np.arange(-1.5,1.51,.01)
tol = 0.01
    
x_1 = x[(x >= -1.5) & (x <= (-1+tol))]
x_2 = x[(x >= -1) & (x <= (0+tol))]
x_3 = x[(x >= 0) & (x <= (1+tol))]
x_4 = x[(x >= 1) & (x <= (1.5+tol))]

fig = plt.figure()
ax = fig.add_subplot(111) 
ax.plot(x,y(x),'b')
ax.axhline(0, color = 'm')
plt.xlim(-1.5,1.5)
plt.xlabel('x')
plt.ylabel('y(x)')
plt.title('y(x) = $x^3$ - x')

# Counter clockwise loops
verts_left_to_right = [(xv,y(xv)) for xv in x_1]
verts_right_to_left = [(xv,0) for xv in np.flipud(x_1)]
verts = verts_left_to_right + verts_right_to_left
poly = plt.Polygon(verts, color = 'violet')
ax.add_patch(poly)

verts_left_to_right = [(xv,0) for xv in x_2]
verts_right_to_left = [(xv,y(xv)) for xv in np.flipud(x_2)]
verts = verts_left_to_right + verts_right_to_left
poly = plt.Polygon(verts, color = 'green')
ax.add_patch(poly)

verts_left_to_right = [(xv,y(xv)) for xv in x_3]
verts_right_to_left = [(xv,0) for xv in np.flipud(x_3)]
verts = verts_left_to_right + verts_right_to_left
poly = plt.Polygon(verts, color = 'red')
ax.add_patch(poly)

verts_left_to_right = [(xv,0) for xv in x_4]
verts_right_to_left = [(xv,y(xv)) for xv in np.flipud(x_4)]
verts = verts_left_to_right + verts_right_to_left
poly = plt.Polygon(verts, color = 'yellow')
ax.add_patch(poly)

plt.show()

Output:

Monday, March 9, 2015

py35. Fitting Data using Newton's method in Python

Test vectors, x and y, are created. x, the input, goes from 0 to 100 in steps of 1.


y, the output, corresponds to x/(25+x) with some added noise. This signal reaches about 0.8.


Now when using the Newton's method, we have a vector of length 101. Some of the values will be positive and some negative. The function fit finds the sum over all the terms. Finding the root of fit, will find the curve in the middle of the data points. We find the best fit x/(k+x). k has to be near 25. There are many functions for finding best fits in scipy.


In many cases, Newton's method will not converge such as when the y' keeps on changing sign. In those cases, you can always plot the function over a grid of points and then find an approximate local minima, and use that as the starting value in Newton's method (the argument x0).

# ex35.py
from __future__ import print_function, division
import numpy as np
import matplotlib.pyplot as plt

def fit(k):
    return sum(y-x/(k+x))
    
def newton(f, tol=1E-10, x0=1, N=50, h=0.2):
    for i in range(N):
        df_dx = (f(x0+h/2)-f(x0-h/2))/h
        x1 = x0 - f(x0)/df_dx
        error = abs(x1-x0)
        print("i = %d,x1 = %.4f, error = %.4e" % (i,x1,error))
        if error < tol: break
        x0 = x1
    return x1

print('Test data x and y')
x = np.arange(101)
r = 0.2*np.random.rand(len(x))-0.1
y = x/(25+x) + r
plt.xlabel('x')
plt.ylabel('y')
plt.title(r'Fitting y = x/($k_0$+x)')
k0 = newton(fit)
print('k0 =',k0)
plt.plot(x, y, 'b', x, x/(k0+x),'m')
plt.legend(('actual','fit'), loc = 4)
plt.show()

#Test data x and y
#i = 0,x1 = 11.2652, error = 1.0265e+01
#i = 1,x1 = 22.3195, error = 1.1054e+01
#i = 2,x1 = 25.9416, error = 3.6221e+00
#i = 3,x1 = 26.1740, error = 2.3239e-01
#i = 4,x1 = 26.1749, error = 8.3630e-04
#i = 5,x1 = 26.1749, error = 1.3040e-08
#i = 6,x1 = 26.1749, error = 3.9080e-14
#k0 = 26.1748507877

Output:

Sunday, March 8, 2015

R35. Fitting Data using Newton's method in R

Test vectors, x and y are created. x, the input, goes from 0 to 100 in steps of 1.


y, the output, corresponds to the function x/(25+x) with some added noise.


Now when using Newton's method, we have a vector of length 101. Some of the values will be positive and some negative. The function fit finds the sum over all the terms. Finding the root of fit, will find the curve in the middle of the data points. We find the best fit according to equation x/(k+x). We know that k has to be near 25. There are many functions for finding best fits in R.


In many cases, Newton's method will not converge such as when the y' keeps on changing sign. In those cases you can always plot the function over a grid of points and then find an approximate local minima, and use that as the starting value in Newton's method (the argument x0).

# ex35.R
fit <- function(k) {
  sum(y-x/(k+x))
}

newton <- function(f, tol=1E-10, x0=1, N=50, h=.001) {
  for (i in 1:N) {
    df.dx <- (f(x0+h/2)-f(x0-h/2))/h
    x1 <- x0 - f(x0)/df.dx
    error <- abs(x1-x0)
    cat("i = ",i,", x1 = ",x1,", error = ",error,'\n')
    if (error < tol) break
    x0 <- x1
  }
  x1
}

cat('Test data x and y\n')
x <- 0:100
y <- x/(25+x)+ runif(length(x), -.1, .1)
plot(x, y, type='l',
     col = 'blue',
     main = expression("Fitting y = x/(k"[0]*"+x)"))
k0 <- newton(fit)
cat('k0 =',k0,'\n')
legend("center",  pch=c('--'),
        col = c("blue", "magenta"), 
        legend = c("actual","fit"))
lines(x, x/(k0+x),col='magenta')

#Test data x and y
#i =  1 , x1 =  11.33843 , error =  10.33843 
#i =  2 , x1 =  22.53979 , error =  11.20136 
#i =  3 , x1 =  26.25 , error =  3.710206 
#i =  4 , x1 =  26.49259 , error =  0.2425991 
#i =  5 , x1 =  26.4935 , error =  0.0009042225 
#i =  6 , x1 =  26.4935 , error =  1.245337e-08 
#i =  7 , x1 =  26.4935 , error =  1.065814e-14 
#k0 = 26.4935 

Output:

py34. Newton's method in Python

Newton's method is based on the approximating a function with only its first Taylor Series term:


y(x+dx) = y(x) + dx y'(x)


We are interested in dx, which will result in y(x + dx) = 0.


0 = y(x) + dx y'(x).


-y(x) = dx y'(x).


dx = -y(x)/y'(x).


A function newton is written implementing this, and used to calculate cube root of 3. Later, we will use it for more complicated cases.

# ex34.py
from __future__ import print_function, division

def fit(x):
    return 3-x**3

def newton(f, tol=1E-10, x0=1, N=50, h=0.2):
    for i in range(N):
        df_dx = (f(x0+h/2)-f(x0-h/2))/h
        x1 = x0 - f(x0)/df_dx
        error = abs(x1-x0)
        print("i = %d,x1 = %.4f, error = %.4e" % (i,x1,error))
        if error < tol: break
        x0 = x1
    return x1

cube_root_3 = newton(fit)
print('cube_root_3 =',cube_root_3,
      '\nfit(cube_root_3) = ',fit(cube_root_3))
#i = 0,x1 = 1.6645, error = 6.6445e-01
#i = 1,x1 = 1.4708, error = 1.9363e-01
#i = 2,x1 = 1.4428, error = 2.7982e-02
#i = 3,x1 = 1.4423, error = 5.9354e-04
#i = 4,x1 = 1.4422, error = 1.1936e-06
#i = 5,x1 = 1.4422, error = 1.9106e-09
#i = 6,x1 = 1.4422, error = 3.0567e-12
#cube_root_3 = 1.44224957031 
#fit(cube_root_3) =  -3.15303338994e-14

R34. Newton's method in R

Newton's method is based on the approximating a function with only its first Taylor Series term:


y(x+dx) = y(x) + dx y'(x)


We are interested in dx, which will result in y(x + dx) = 0.


0 = y(x) + dx y'(x).


-y(x) = dx y'(x).


dx = -y(x)/y'(x).


A function newton is written implementing this, and used to calculate cube root of 3. Later, we will use it for more complicated cases.

# ex34.R
fit <- function(x) {
  3-x^3
}

newton <- function(f, tol=1E-10, x0=1, N=50, h=0.2) {
  for (i in 1:N) {
    df.dx <- (f(x0+h/2)-f(x0-h/2))/h
    x1 <- x0 - f(x0)/df.dx
    error <- abs(x1-x0)
    cat("i = ",i,", x1 = ",x1,", error = ",error,'\n')
    if (error < tol) break
    x0 <- x1
  }
  x1
}

cube.root.3 <- newton(fit)
cat('cube.root.3 =',cube.root.3,
    '\nfit(cube.root.3) = ',fit(cube.root.3))
# i =  1 , x1 =  1.664452 , error =  0.6644518 
# i =  2 , x1 =  1.470826 , error =  0.1936256 
# i =  3 , x1 =  1.442844 , error =  0.02798194 
# i =  4 , x1 =  1.442251 , error =  0.0005935402 
# i =  5 , x1 =  1.44225 , error =  1.193565e-06 
# i =  6 , x1 =  1.44225 , error =  1.910613e-09 
# i =  7 , x1 =  1.44225 , error =  3.056666e-12 
# cube.root.3 = 1.44225 
# fit(cube.root.3) =  -3.153033e-14

Friday, March 6, 2015

R33. Population with Harvesting in R

A harvest term, H, is used in the equation of the population, which is:


dN/dt = rN(1-N/K) - H


Steady-state solutions can be found from the roots of equation above, after setting dN/dt to zero. We use the function polyroot to find the roots.

# ex33.R
r <- .10 # growth rate / yr
K <- 100000 # carrying capacity
t <- 100 # number of years
H <- 1600 # Harvesting / yr
num <- num0 <- numeric(t+1)
num[1] <- num0[1] <- 40000
for (i in 1:t) {
  num[i+1] <- num[i]+r*num[i]*(1-num[i]/K)-H
  num0[i+1] <- num0[i]+r*num0[i]*(1-num0[i]/K)
}
sol <- polyroot(c(H,-r,r/K))
cat('\nSolutions are ',
    Re(sol[1])/1000, 'k and ',
    Re(sol[2])/1000, 'k.\n',
    sep = '')
main <- paste("Growth rate: ",r,
              ", Carrying Capacity: ",
              K/1000,"k\nHarvesting: ", H, sep = '')
plot(num/1000, type = 'l', col = 'blue',
     xlab = 'Year', ylab = 'Number (Thousands)',
     main = main, ylim =c(num[1]/1000,K/1000))
lines(num0/1000, col = 'magenta' )
text(x = 80, y = 84, paste('80K (H =',H,')'), col = 'blue')
text(x = 80, y = 97, '100K (H = 0)', col = 'magenta')

#Solutions are 20k and 80k.

Output:

py33. Population with Harvesting in Python

A harvest term, H, is used in the equation of the population, which is:


dN/dt = rN(1-N/K) - H


Steady-state solutions can be found from the roots of equation above, after setting dN/dt equal to zero. We find the roots using numpy.roots function.

# ex33.py
from __future__ import print_function, division
import numpy as np
import matplotlib.pyplot as plt

r = .10 # growth rate / yr
K = 100000 # carrying capacity
t = 100 # number of years
H = 1600 # Harvesting / yr
num = np.zeros(t+1)
num0 = np.zeros(t+1)
num[0] = 40000
num0[0] = 40000
for i in range(t):
    num[i+1] = num[i]+r*num[i]*(1-num[i]/K)-H
    num0[i+1] = num0[i]+r*num0[i]*(1-num0[i]/K)
sol = np.roots([r/K,-r,H])
print('\nSolutions are %.0fk and %.0fk.' %
      (sol[0]/1000,sol[1]/1000))
mStr = ("Growth rate: %.2f,"
        "Carrying Capacity: %dk"
        "\nHarvesting %d")
tStr = r,K/1000,H
main =  mStr % tStr
plt.plot(range(t+1),num/1000, 'b',
         range(t+1),num0/1000,'m')
plt.ylim(38,102)
plt.xlabel('Year')
plt.ylabel('Number (Thousands)')
plt.title(main)
plt.text(70, 84, '80K (H = %d)' % H)
plt.text(70, 97, '100K (H = 0)')

#Solutions are 80k and 20k.

Output:

R32. Logistic Growth in R

Logistic growth of fisheries, or other populations, will go towards a Carrying Capacity. The growth will start exponentially and is S-shaped.


For a specific rate of growth, and Carrying Capacity, the Population is shown.


The vertical line shows the maximum derivative, that is where the maximum growth occurs.

# ex32.R
library(nnet)
r <- .25 # growth rate / yr
K <- 100 # carrying capacity
t <- 40 # number of years
num <- numeric(t+1)
num[1] <- 1
for (i in 1:t) {
  num[i+1] <- num[i]+r*num[i]*(1-num[i]/K)
}
plot(num, type = 'l', col = 'blue',
     xlab = 'Year', ylab = 'Number',
     main = "Growth rate: 0.25, Carrying Capacity = 100")
abline(v=which.is.max(diff(num)))

Output:

py32. Logistic Growth in Python

Logistic growth of fisheries, or other populations, will go towards a Carrying Capacity. The growth will start exponentially and is S-shaped.


For a specific rate of growth, and Carrying Capacity, the Population is shown.


The vertical line shows the maximum derivative, that is where the maximum growth occurs.

# ex32.py
from __future__ import print_function, division
import numpy as np
import matplotlib.pyplot as plt

r = .25 # growth rate / yr
K = 100 # carrying capacity
t = 40 # number of years
num = np.zeros(t+1)
num[0] = 1
for i in range(t):
    num[i+1] = num[i]+r*num[i]*(1-num[i]/K)
plt.plot(range(t+1),num, 'b')
plt.xlabel('Year')
plt.ylabel('Number')
plt.title('Growth rate: 0.25, Carrying Capacity = 100')
plt.axvline(np.argmax(np.diff(num)),  color = 'k'  )
plt.show()

Output:

py31. Reading a json file in Python

Using the example json file from wikipedia.


The function json.load is used.


A dictionary is read and we can print what are the components of the dictionary. Since a Python dictionary does have any order, it does not list in the order that they are read from the file.


Since len() may not be applied to all types of data, the length is only printed if len() does not return an exception, in which case, it does nothing.

# ex31.py
from __future__ import print_function, division
import json
with open("ex31data.json") as f:
    data = json.load(f)
for k in data:
    print('\n', k, sep='')
    print('type:',type(data[k]))
    try: 
        print('length:',len(data[k]))
    except: pass
    
print("\ndata['phoneNumber']")
for i in data['phoneNumber']:
    print(i)

#firstName
#type: <type 'unicode'>
#length: 4
#
#lastName
#type: <type 'unicode'>
#length: 5
#
#age
#type: <type 'int'>
#
#address
#type: <type 'dict'>
#length: 4
#
#phoneNumber
#type: <type 'list'>
#length: 2
#
#gender
#type: <type 'dict'>
#length: 1
#
#data['phoneNumber']
#{u'type': u'home', u'number': u'212 555-1234'}
#{u'type': u'fax', u'number': u'646 555-4567'}

R31. Reading json file in R

Using the example json file from wikipedia.


The function fromJSON of jsonlite package is used.


A list is read and we can print what are the components of the list.

# ex31.R
require(jsonlite)
data <- fromJSON("ex31data.json")
for (i in names(data)) {
  print(i)
  cat('class:',class(data[[i]]),'\n')
  cat('length:',length(data[[i]]),'\n\n')
}
print("data[['phoneNumber']]")
print(data[['phoneNumber']])

# [1] "firstName"
# class: character 
# length: 1 
# 
# [1] "lastName"
# class: character 
# length: 1 
# 
# [1] "age"
# class: integer 
# length: 1 
# 
# [1] "address"
# class: list 
# length: 4 
# 
# [1] "phoneNumber"
# class: data.frame 
# length: 2 
# 
# [1] "gender"
# class: list 
# length: 1 
# 
# [1] "data[['phoneNumber']]"
# type       number
# 1 home 212 555-1234
# 2  fax 646 555-4567

Tuesday, March 3, 2015

py30. Reading and parsing XML document in Python

We will use a test xml file, from python documentation country_data.xml.


The module bs4, Beautiful Soup 4, is used to parse the document. If this module is not installed, you will have to install it using the methods described at the their website, such as pip. If using a distribution like Anaconda, use the pip inside the distribution. It is possible to use other modules as described in the python link.


The name of the country, the year, as well as the neighbours are returned for all 3 countries in document.

# ex30.py
from __future__ import print_function, division
from bs4 import BeautifulSoup
doc = BeautifulSoup(open('ex30data.xml'),"xml")
country = doc.findAll("country")
for c in country:
    print("\n",c.attrs['name'],sep = "")
    y = c.find('year')
    print('year :',y.text)
    n = c.findAll('neighbor')
    for i,ni in enumerate(n):
        print("neighbor",i, end=" ", sep=" : ")
        print(ni.attrs)
    
#Liechtenstein
#year : 2008
#neighbor : 0 {'direction': u'E', 'name': u'Austria'}
#neighbor : 1 {'direction': u'W', 'name': u'Switzerland'}
#
#Singapore
#year : 2011
#neighbor : 0 {'direction': u'N', 'name': u'Malaysia'}
#
#Panama
#year : 2011
#neighbor : 0 {'direction': u'W', 'name': u'Costa Rica'}
#neighbor : 1 {'direction': u'E', 'name': u'Colombia'}

R30. Reading and parsing XML document in R

We will use a test xml file, from python documentation country_data.xml.


The package XML is used to parse the document.


The name of the country, the year, as well as the neighbors are returned for all 3 countries in document.

# ex30.R
require('XML')
doc <- xmlParse('ex30data.xml')
r <- xmlRoot(doc)
country <- xmlChildren(r)
for (i in 1:length(country)) {
  cat('\n',xmlAttrs(country[[i]]),'\n', sep="")
  cat("year : ")
  cyear <- country[[i]][['year']][[1]]
  print(cyear)
  cneigh <- country[[i]]["neighbor"]
  for (i in 1:length(cneigh)) {
    cat("neighbor",i,":")
    print(cneigh[[i]])
  }
}

# Liechtenstein
# year : 2008 
# neighbor 1 :<neighbor name="Austria" direction="E"/> 
# neighbor 2 :<neighbor name="Switzerland" direction="W"/> 
#   
# Singapore
# year : 2011 
# neighbor 1 :<neighbor name="Malaysia" direction="N"/> 
#   
# Panama
# year : 2011 
# neighbor 1 :<neighbor name="Costa Rica" direction="W"/> 
# neighbor 2 :<neighbor name="Colombia" direction="E"/>