Monday, June 1, 2015

nlp31. Reading text with PlaintextCorpusReader in Python NLTK

Rather than creating our corpus, it is instructive to read corpus in nltk_data/corpora.


In that module, they define the variable movie_reviews, which we can use if we import it.


We can use that corpus with 2000 reviews, 1000 positive and 1000 negative. They use pos and neg folders to store the two categories.

# nlp31.py

from __future__ import print_function, division
from nltk.corpus.reader import PlaintextCorpusReader
from nltk.data import path
path = path[0] + '/corpora/movie_reviews'

mov_rev = PlaintextCorpusReader(path,'.*.txt')
fils = mov_rev.fileids()
print('len(fils) =',len(fils))
print('First 10 files')
for fil in fils[:10]: print(fil)
nPos = len([fil for fil in fils if fil.startswith('pos')])
nNeg = len([fil for fil in fils if fil.startswith('neg')])
print("pos - %d, neg - %d" % (nPos,nNeg))
print("First 10 words of 1st review")
words = mov_rev.words(fils[0])
for word in words[:10]:
    print(word,end=' ')

#len(fils) = 2000
#First 10 files
#neg/cv000_29416.txt
#neg/cv001_19502.txt
#neg/cv002_17424.txt
#neg/cv003_12683.txt
#neg/cv004_12641.txt
#neg/cv005_29357.txt
#neg/cv006_17022.txt
#neg/cv007_4992.txt
#neg/cv008_29326.txt
#neg/cv009_29417.txt
#pos - 1000, neg - 1000
#First 10 words of 1st review
#plot : two teen couples go to a church party

No comments:

Post a Comment