BioRpy: nlp16. Bigrams and Trigrams in Python NLTK

Bigrams are 2-contiguous word sequences. Trigrams are 3-contiguous words.

We can uses nltk.collocations.ngrams to create ngrams. Depending on the n parameter, we can get bigram, trigram, or any ngram. The function returns a generator object and it is possible so create a list, for example A = list(A).

The first line of text is from the nltk website. The second sentence is a random sentence. After tokenizing by words, we first filter for stop words, and then for any resulting word with length less than 3. Note the numbers have been removed.

# nlp16.py
from __future__ import print_function
from nltk.collocations import ngrams
from nltk.tokenize import PunktWordTokenizer
from nltk.corpus import stopwords
text = """
NLTK is a leading platform for building Python
programs to work with human language data. It's
considered the best by 2 out of 30 computational
linguists.
"""
S = stopwords.words('english')
tok = PunktWordTokenizer().tokenize
words = [w for w in tok(text) if w not in S]
words = [w for w in words if len(w)>2]
A = ngrams(words,2)
B = ngrams(words,3)
print("Bigrams")
for a in A:
    print(a)
print()
print("Trigrams")
for b in B:
    print(b)
    
#    Bigrams
#    ('NLTK', 'leading')
#    ('leading', 'platform')
#    ('platform', 'building')
#    ('building', 'Python')
#    ('Python', 'programs')
#    ('programs', 'work')
#    ('work', 'human')
#    ('human', 'language')
#    ('language', 'data.')
#    ('data.', 'considered')
#    ('considered', 'best')
#    ('best', 'computational')
#    ('computational', 'linguists.')
#    
#    Trigrams
#    ('NLTK', 'leading', 'platform')
#    ('leading', 'platform', 'building')
#    ('platform', 'building', 'Python')
#    ('building', 'Python', 'programs')
#    ('Python', 'programs', 'work')
#    ('programs', 'work', 'human')
#    ('work', 'human', 'language')
#    ('human', 'language', 'data.')
#    ('language', 'data.', 'considered')
#    ('data.', 'considered', 'best')
#    ('considered', 'best', 'computational')
#    ('best', 'computational', 'linguists.')

BioRpy

Wednesday, May 27, 2015

nlp16. Bigrams and Trigrams in Python NLTK

Bigrams are 2-contiguous word sequences. Trigrams are 3-contiguous words.

We can uses nltk.collocations.ngrams to create ngrams. Depending on the n parameter, we can get bigram, trigram, or any ngram. The function returns a generator object and it is possible so create a list, for example A = list(A).

The first line of text is from the nltk website. The second sentence is a random sentence. After tokenizing by words, we first filter for stop words, and then for any resulting word with length less than 3. Note the numbers have been removed.

No comments:

Post a Comment