We can use RegexpTokenizer to write our own tokenizers.
Our sentences here are alternating numbers and words. The regular expression splits the numbers and words. It will consider a period (.) to be a number.
This only tokens selected have a period, digits, and letters. Thus ? or ! will not be selected.
# nlp11.py
from __future__ import print_function, division
from nltk.tokenize import RegexpTokenizer
A = "I'll3finish45my987project2.2today!3a"
tok = RegexpTokenizer("([a-zA-z']+|[0-9.]+)")
B = tok.tokenize(A)
for b in B: print('\t'+b)
# I'll
# 3
# finish
# 45
# my
# 987
# project
# 2.2
# today
# 3
# a
No comments:
Post a Comment