Monday, May 25, 2015

nlp7. Stop word removal in Python NLTK

The function nltk.corpus.stopwords.words gets a list of 127 stop words which usually do not add much to the meaning of sentences. However, it is always possible to find exceptions.


The list is put in S. If you are getting too much filtering, you should try to shorten the stoplist.

# nlp7.py
from __future__ import print_function, division
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
lines = """Dr. Brown gave a speech. I'd wish I knew then
what I know now. Finally, he praised Python! At 8 o'clock,
he went home.""" 
S = stopwords.words("english")
t = '\t'
A = word_tokenize(lines.lower())
for a in A:
    if a not in S:
        print(t,a)

#    dr.
#    brown
#    gave
#    speech
#    .
#    'd
#    wish
#    knew
#    know
#    .
#    finally
#    ,
#    praised
#    python
#    !
#    8
#    o'clock
#    ,
#    went
#    home
#    .

No comments:

Post a Comment