Tuesday, May 26, 2015

nlp10. PunktWordTokenizer and WordPunctTokenizer in Python NLTK

PunktWordTokenizer and WordPunctTokenizer will give different tokens for words such as I'll.


The same line is tokenized with different word tokenizers, and the resulting list is either B1,B2,B3, and of different lengths. To show until the last index of the max(length of B1,B2, B3), there are try-clauses to print only if an index exists. Since each clause is only 1-statement, we may put in the sole statement after the colon.

# nlp10.py
from __future__ import print_function, division
from nltk.tokenize import (PunktWordTokenizer,
                           WordPunctTokenizer, word_tokenize)
A = "I'll finish my project today."
PWT = PunktWordTokenizer()
WPT = WordPunctTokenizer()
w = word_tokenize
B1 = PWT.tokenize(A)
B2 = WPT.tokenize(A)
B3 = w(A)
L1,L2,L3 = len(B1),len(B2),len(B3)
print('B1\tB2\tB3')
for i in range(max(L1,L2,L3)):
    try: print(B1[i],end='\t')
    except: print(end='\t')
    try: print(B2[i],end='\t')
    except: print(end='\t')
    try: print(B3[i],end='\t')
    except: print(end='\t')
    print()

#    B1      B2      B3
#    I       I       I       
#    'll     '       'll     
#    finish  ll      finish  
#    my      finish  my      
#    project my      project 
#    today.  project today   
#            today   .       

1 comment: