The most common pattern for collecting and using Tweets from various users is It is a pattern that extracts and uses a specific word contained in Tweet.
This time, we will use MeCab, a morphological analyzer, to extract nouns, verbs, and adjectives by dividing them into word units.
The output format depends on the option. *'mecabrc': (default) *'-Ochasen': (ChaSen compatible format) *'-Owakati': (output only word-separation) *'-Oyomi': (output read only)
By default
Surface form \ t Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3, Conjugation form, Conjugation type, Prototype, Reading, Pronunciation
The output will be.
Divide a sentence into word units (as is on the surface) and
A program that takes out 4 ways.
#!/usr/bin/env python
# -*- coding:utf-8 -*-
import MeCab
### Constants
MECAB_MODE = 'mecabrc'
PARSE_TEXT_ENCODING = 'utf-8'
### Functions
def main():
sample_u = u"I want to be the catcher of the rye field. I know it's ridiculous. But that's the only thing I really want to be."
words_dict = parse(sample_u)
print "All:", ",".join(words_dict['all'])
print "Nouns:", ",".join(words_dict['nouns'])
print "Verbs:", ",".join(words_dict['verbs'])
print "Adjs:", ",".join(words_dict['adjs'])
return
def parse(unicode_string):
tagger = MeCab.Tagger(MECAB_MODE)
#If it is not str type, the operation will be strange, so convert it to str type
text = unicode_string.encode(PARSE_TEXT_ENCODING)
node = tagger.parseToNode(text)
words = []
nouns = []
verbs = []
adjs = []
while node:
pos = node.feature.split(",")[0]
#Revert to unicode type
word = node.surface.decode("utf-8")
if pos == "noun":
nouns.append(word)
elif pos == "verb":
verbs.append(word)
elif pos == "adjective":
adjs.append(word)
words.append(word)
node = node.next
parsed_words_dict = {
"all": words[1:-1], #Remove the empty string at the beginning and end
"nouns": nouns,
"verbs": verbs,
"adjs": adjs
}
return parsed_words_dict
### Execute
if __name__ == "__main__":
main()
(twi-py)$ python tweet_parser.py
All:Rye,field,of,Catch,Role,、,Such,もof,To,I,Is,Nari,Want,Hmm,Is,Yo,。,Stupid,Teru,thing,Is,Know,Teru,Yo,。,But,、,ほHmmWhenう,To,Nari,Want,もof,When,Ichi,Cod,It,Shika,Absent,Ne,。
Nouns:Rye,field,Role,thing,I,Hmm,thing,ほHmmとう,thing,It
Verbs:Catch,Nari,Stupid,Teru,Know,Teru,Nari,Ichi
Adjs:Absent
Now you can extract words by feeding parse () the retrieved Tweet.
For this sample code, I used the surface type in node.surface, If you want to normalize words that change endings, such as verbs, You can use the original form included in node.feature.
Recommended Posts