Monday, May 25, 2015

nlp6. Treebank tokenizer in Python NLTK

The program below works the same as last, since Treebank Tokenizer is the default word tokenizer.


Parts of the DocString is printed. We import nltk.tokenize.TreebankWordTokenizer as the alias TWT.


Instead of using the regular expressions in Penn Treebank, we may also create new rules.

# nlp6.py
from __future__ import print_function, division
from nltk.tokenize import TreebankWordTokenizer as TWT
lines = """Dr. Brown gave a speech. I'd wish I knew then
what I know now. Finally, he praised Python! At 8 o'clock,
he went home.""" 

t = '\t'
A = TWT()
B = A.tokenize(lines)
print("DocString")
for i in TWT.__doc__.split('\n')[1:4]:
    print(i)
for i,b in enumerate(B):
    print(t,i,b)

#DocString
#    The Treebank tokenizer uses regular expressions to tokenize
# text as in Penn Treebank. This is the method that is invoked by
# ``word_tokenize()``.  It assumes that the text has already been
# segmented into sentences, e.g. using ``sent_tokenize()``.
#         0 Dr.
#         1 Brown
#         2 gave
#         3 a
#         4 speech.
#         5 I
#         6 'd
#         7 wish
#         8 I
#         9 knew
#         10 then
#         11 what
#         12 I
#         13 know
#         14 now.
#         15 Finally
#         16 ,
#         17 he
#         18 praised
#         19 Python
#         20 !
#         21 At
#         22 8
#         23 o'clock
#         24 ,
#         25 he
#         26 went
#         27 home
#         28 .

No comments:

Post a Comment