Thursday, May 21, 2015

nlp1. Reading a text in Python NLTK

The NLTK module in Python can be used to load a text, or corpus. In nltk_data folder, you can find the included texts. This assumes all the data files have been downloaded to the computer using nltk.download().


Here Shakespeare’s Julius Caesar is read as a raw string. We may also use the xml loader which will allow parsing the tree, for example the <LINE> elements.


The <LINE> elements are extracted using regular expressions. Only a subset of the lines are printed; those with the word 'Pompey'.

# nlp1.py
from __future__ import print_function, division
from nltk.corpus import shakespeare
import re
sp = " " * 2
jc = shakespeare.raw("j_caesar.xml")
jc_lines = re.findall(r"<LINE>.+</LINE>", jc)
for line in jc_lines:
    lin = line[6:-7]
    if lin.count("Pompey"):
        print(sp+lin)
        
#  Knew you not Pompey? Many a time and oft
#  To see great Pompey pass the streets of Rome:
#  That comes in triumph over Pompey's blood? Be gone!
#  In Pompey's porch: for now, this fearful night,
#  Repair to Pompey's porch, where you shall find us.
#  That done, repair to Pompey's theatre.
#  Who rated him for speaking well of Pompey:
#  That now on Pompey's basis lies along
#  Even at the base of Pompey's statua,
#  As Pompey was, am I compell'd to set

No comments:

Post a Comment