Monday, May 25, 2015

nlp4. Directly loading a tokenizer in Python NLTK

This program does the same thing as the last.


Now we explicity load our tokenizer. It has to be found in ntlk_data folder. This load was implicit in the last program.


We also print the DocString, after removing whitespace characters.

# nlp4.py
from __future__ import print_function, division
from nltk.data import load
lines = """This is the first sentence. Dr. Brown gave a speech.
Finally, he praised Python! At 8 o'clock, he went home.""" 

tok = load("tokenizers/punkt/english.pickle")
print("DocString:\n",tok.tokenize.__doc__.strip())
A = tok.tokenize(lines)

print('type(A)=',type(A))
for i,j in enumerate(A):
    print(i,': ',j)

# DocString:
#  Given a text, returns a list of the sentences in that text.
# type(A)= <type 'list'>
# 0 :  This is the first sentence.
# 1 :  Dr. Brown gave a speech.
# 2 :  Finally, he praised Python!
# 3 :  At 8 o'clock, he went home.

No comments:

Post a Comment