Tuesday, May 26, 2015

nlp11. RegexpTokenizer in Python NLTK

We can use RegexpTokenizer to write our own tokenizers.


Our sentences here are alternating numbers and words. The regular expression splits the numbers and words. It will consider a period (.) to be a number.


This only tokens selected have a period, digits, and letters. Thus ? or ! will not be selected.

# nlp11.py
from __future__ import print_function, division
from nltk.tokenize import RegexpTokenizer
A = "I'll3finish45my987project2.2today!3a"
tok = RegexpTokenizer("([a-zA-z']+|[0-9.]+)")
B = tok.tokenize(A)
for b in B: print('\t'+b)
#        I'll
#        3
#        finish
#        45
#        my
#        987
#        project
#        2.2
#        today
#        3
#        a

No comments:

Post a Comment