Thursday, May 28, 2015

nlp21. Chincking in Python NLTK

In the last program, we got three chunks: (1) DT/JJ/NN (2) DT/JJ/NN (3) DT/NN.


With chincking, we can remove results from the chunking result set. A trivial example could be to remove all entries with just the signature DT/JJ/NN, thus we only get the last Chunk.


To do this we just add the chincking regular expression between } { (opposite of the chunk opening and closing delimiter) that we want to remove on the second line of chunking regular expression. Note we have to use the starting and ending triple quotes for the mult-line string.

# nlp21.py
from __future__ import print_function
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import RegexpParser
text = "The big dog barked at the little cat. The cat ran away."
tag_text = [tag for tag in pos_tag(word_tokenize(text))]

chunk = """CHNK: {<DT>?<JJ>*<NN>}
 }<DT><JJ><NN>{
"""

cp = RegexpParser(chunk)
result = cp.parse(tag_text)
print(result)
result.draw()

#(S
#  The/DT
#  big/JJ
#  dog/NN
#  barked/VBD
#  at/IN
#  the/DT
#  little/JJ
#  cat/NN
#  ./.
#  (CHNK The/DT cat/NN)
#  ran/VBD
#  away/RB
#  ./.)

Output:

No comments:

Post a Comment