Thursday, May 28, 2015

nlp22. Named Entities in Python NLTK

Related to chunking, we can use nltk.chunk.ne_chunk to find named entities such as PERSON, ORGANIZATION, etc. in tokenized words. A line from Wikipedia is used, about their article on NLP.


It should be noted that we do not have to write nltk.chunk.ne_chunk, but instead we can write it as nltk.ne_chunk. However, I believe the longer format is better to show the structure of NLTK, and also the longer name is used only once in the import statements, such as from nltk.chunk import ne_chunk, rather than the shorthand from nltk import ne_chunk


Also note that only results which are not tuples are printed, and these correspond to Named Entities. (0,0) or (0,) are tuples but (0) is an integer.

# nlp22.py
from __future__ import print_function
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import ne_chunk
text = """The history of NLP generally starts in the
1950s, although work can be found from earlier periods.
In 1950, Alan Turing published an article titled
"Computing Machinery and Intelligence" which proposed
what is now called the Turing test as a criterion of
intelligence.
"""
tag_text = [tag for tag in pos_tag(word_tokenize(text))]
result = ne_chunk(tag_text)

for r in result:
    if type(r)!=type((0,)):
        print(r)

#    (ORGANIZATION NLP/NNP)
#    (PERSON Alan/NNP Turing/NNP)
#    (ORGANIZATION Intelligence/NNP)
#    (GPE Turing/NNP)

No comments:

Post a Comment