Usually, we want a text to be broken into words, which is done by nltk.tokenize.word_tokenize.
As we can see from the DocString, it uses sentence tokenizing as well.
Instead of using enumerate, we can always iterate over the indices.
# nlp5.py
from __future__ import print_function, division
from nltk.tokenize import word_tokenize
lines = """This is the first sentence. Dr. Brown gave a speech.
Finally, he praised Python! At 8 o'clock, he went home."""
A = word_tokenize(lines)
print("DocString for %s:\n%s" % ("word_tokenize",
word_tokenize.__doc__.strip()))
for i in range(len(A)):
print(i,A[i])
# DocString for word_tokenize:
# Return a tokenized copy of *text*,
# using NLTK's recommended word tokenizer
# (currently :class:`.TreebankWordTokenizer`
# along with :class:`.PunktSentenceTokenizer`).
# 0 This
# 1 is
# 2 the
# 3 first
# 4 sentence
# 5 .
# 6 Dr.
# 7 Brown
# 8 gave
# 9 a
# 10 speech
# 11 .
# 12 Finally
# 13 ,
# 14 he
# 15 praised
# 16 Python
# 17 !
# 18 At
# 19 8
# 20 o'clock
# 21 ,
# 22 he
# 23 went
# 24 home
# 25 .
No comments:
Post a Comment