PunktWordTokenizer and WordPunctTokenizer will give different tokens for words such as I'll.
The same line is tokenized with different word tokenizers, and the resulting list is either B1,B2,B3, and of different lengths. To show until the last index of the max(length of B1,B2, B3), there are try-clauses to print only if an index exists. Since each clause is only 1-statement, we may put in the sole statement after the colon.
# nlp10.py
from __future__ import print_function, division
from nltk.tokenize import (PunktWordTokenizer,
WordPunctTokenizer, word_tokenize)
A = "I'll finish my project today."
PWT = PunktWordTokenizer()
WPT = WordPunctTokenizer()
w = word_tokenize
B1 = PWT.tokenize(A)
B2 = WPT.tokenize(A)
B3 = w(A)
L1,L2,L3 = len(B1),len(B2),len(B3)
print('B1\tB2\tB3')
for i in range(max(L1,L2,L3)):
try: print(B1[i],end='\t')
except: print(end='\t')
try: print(B2[i],end='\t')
except: print(end='\t')
try: print(B3[i],end='\t')
except: print(end='\t')
print()
# B1 B2 B3
# I I I
# 'll ' 'll
# finish ll finish
# my finish my
# project my project
# today. project today
# today .
Works only on nltk ver 3.3
ReplyDelete