Bigrams are 2-contiguous word sequences. Trigrams are 3-contiguous words.
We can uses nltk.collocations.ngrams to create ngrams. Depending on the n parameter, we can get bigram, trigram, or any ngram. The function returns a generator object and it is possible so create a list, for example A = list(A).
The first line of text is from the nltk website. The second sentence is a random sentence. After tokenizing by words, we first filter for stop words, and then for any resulting word with length less than 3. Note the numbers have been removed.
# nlp16.py
from __future__ import print_function
from nltk.collocations import ngrams
from nltk.tokenize import PunktWordTokenizer
from nltk.corpus import stopwords
text = """
NLTK is a leading platform for building Python
programs to work with human language data. It's
considered the best by 2 out of 30 computational
linguists.
"""
S = stopwords.words('english')
tok = PunktWordTokenizer().tokenize
words = [w for w in tok(text) if w not in S]
words = [w for w in words if len(w)>2]
A = ngrams(words,2)
B = ngrams(words,3)
print("Bigrams")
for a in A:
print(a)
print()
print("Trigrams")
for b in B:
print(b)
# Bigrams
# ('NLTK', 'leading')
# ('leading', 'platform')
# ('platform', 'building')
# ('building', 'Python')
# ('Python', 'programs')
# ('programs', 'work')
# ('work', 'human')
# ('human', 'language')
# ('language', 'data.')
# ('data.', 'considered')
# ('considered', 'best')
# ('best', 'computational')
# ('computational', 'linguists.')
#
# Trigrams
# ('NLTK', 'leading', 'platform')
# ('leading', 'platform', 'building')
# ('platform', 'building', 'Python')
# ('building', 'Python', 'programs')
# ('Python', 'programs', 'work')
# ('programs', 'work', 'human')
# ('work', 'human', 'language')
# ('human', 'language', 'data.')
# ('language', 'data.', 'considered')
# ('data.', 'considered', 'best')
# ('considered', 'best', 'computational')
# ('best', 'computational', 'linguists.')
No comments:
Post a Comment