Monday, May 25, 2015

nlp5. Word Tokenization in Python NLTK

Usually, we want a text to be broken into words, which is done by nltk.tokenize.word_tokenize.

As we can see from the DocString, it uses sentence tokenizing as well.

Instead of using enumerate, we can always iterate over the indices.

from __future__ import print_function, division
from nltk.tokenize import word_tokenize
lines = """This is the first sentence. Dr. Brown gave a speech.
Finally, he praised Python! At 8 o'clock, he went home.""" 

A = word_tokenize(lines)
print("DocString for %s:\n%s" % ("word_tokenize",
for i in range(len(A)):

#    DocString for word_tokenize:
#    Return a tokenized copy of *text*,
#        using NLTK's recommended word tokenizer
#        (currently :class:`.TreebankWordTokenizer`
#        along with :class:`.PunktSentenceTokenizer`).
#    0 This
#    1 is
#    2 the
#    3 first
#    4 sentence
#    5 .
#    6 Dr.
#    7 Brown
#    8 gave
#    9 a
#    10 speech
#    11 .
#    12 Finally
#    13 ,
#    14 he
#    15 praised
#    16 Python
#    17 !
#    18 At
#    19 8
#    20 o'clock
#    21 ,
#    22 he
#    23 went
#    24 home
#    25 .

