Wednesday, May 27, 2015

nlp16. Bigrams and Trigrams in Python NLTK

Bigrams are 2-contiguous word sequences. Trigrams are 3-contiguous words.


We can uses nltk.collocations.ngrams to create ngrams. Depending on the n parameter, we can get bigram, trigram, or any ngram. The function returns a generator object and it is possible so create a list, for example A = list(A).


The first line of text is from the nltk website. The second sentence is a random sentence. After tokenizing by words, we first filter for stop words, and then for any resulting word with length less than 3. Note the numbers have been removed.

# nlp16.py
from __future__ import print_function
from nltk.collocations import ngrams
from nltk.tokenize import PunktWordTokenizer
from nltk.corpus import stopwords
text = """
NLTK is a leading platform for building Python
programs to work with human language data. It's
considered the best by 2 out of 30 computational
linguists.
"""
S = stopwords.words('english')
tok = PunktWordTokenizer().tokenize
words = [w for w in tok(text) if w not in S]
words = [w for w in words if len(w)>2]
A = ngrams(words,2)
B = ngrams(words,3)
print("Bigrams")
for a in A:
    print(a)
print()
print("Trigrams")
for b in B:
    print(b)
    
#    Bigrams
#    ('NLTK', 'leading')
#    ('leading', 'platform')
#    ('platform', 'building')
#    ('building', 'Python')
#    ('Python', 'programs')
#    ('programs', 'work')
#    ('work', 'human')
#    ('human', 'language')
#    ('language', 'data.')
#    ('data.', 'considered')
#    ('considered', 'best')
#    ('best', 'computational')
#    ('computational', 'linguists.')
#    
#    Trigrams
#    ('NLTK', 'leading', 'platform')
#    ('leading', 'platform', 'building')
#    ('platform', 'building', 'Python')
#    ('building', 'Python', 'programs')
#    ('Python', 'programs', 'work')
#    ('programs', 'work', 'human')
#    ('work', 'human', 'language')
#    ('human', 'language', 'data.')
#    ('language', 'data.', 'considered')
#    ('data.', 'considered', 'best')
#    ('considered', 'best', 'computational')
#    ('best', 'computational', 'linguists.')

No comments:

Post a Comment