Thursday, May 28, 2015

nlp19. Part of Speech tagging in Python NLTK

By using nltk.tag.pos_tag you can tag a text with part of speech for each word token. Each token turns into 2-length tuple, the first element is the token, and the second element is the part of speech.


We also write the help for the tag using nltk.help. Once you are familiar with the various letters, you will rarely need this info.

# nlp19.py
from __future__ import print_function
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.help import upenn_tagset as hlp
text = "Python is a great language."

words = word_tokenize(text)
for tagged_word in pos_tag(words):
    print(tagged_word)
    hlp(tagged_word[1])
    
# ('Python', 'NNP')
# NNP: noun, proper, singular
#    Motown Venneboerger Czestochwa Ranzer Conchita
#    Trumplane Christos Oceanside Escobar Kreisler
#    Sawyer Cougar Yvette Ervin ODI Darryl CTCA
#    Shannon A.K.C. Meltex Liverpool ...
# ('is', 'VBZ')
# VBZ: verb, present tense, 3rd person singular
#    bases reconstructs marks mixes displeases
#    seals carps weaves snatches slumps stretches
#    authorizes smolders pictures emerges stockpiles
#    seduces fizzes uses bolsters slaps speaks pleads ...
# ('a', 'DT')
# DT: determiner
#    all an another any both del each either every
#    half la many much nary neither no some such
#    that the them these this those 
# ('great', 'JJ')
# JJ: adjective or numeral, ordinal
#    third ill-mannered pre-war regrettable oiled
#    calamitous first separable ectoplasmic
#    battery-powered participatory fourth
#    still-to-be-named multilingual
#    multi-disciplinary ...
# ('language', 'NN')
# NN: noun, common, singular or mass
#    common-carrier cabbage knuckle-duster Casino
#    afghan shed thermostat investment slide humour
#    falloff slick wind hyena override subhumanity
#    machinist ...
# ('.', '.')
# .: sentence terminator
#    . ! ?

No comments:

Post a Comment