About morphological analysis

The most common pattern for collecting and using Tweets from various users is It is a pattern that extracts and uses a specific word contained in Tweet.

This time, we will use MeCab, a morphological analyzer, to extract nouns, verbs, and adjectives by dividing them into word units.

Collecting information from Twitter with Python (Environment construction)
For environment construction, etc. here

MeCab output format

The output format depends on the option. *'mecabrc': (default) *'-Ochasen': (ChaSen compatible format) *'-Owakati': (output only word-separation) *'-Oyomi': (output read only)

By default Surface form \ t Part of speech, Part of speech subclassification 1, Part of speech subclassification 2, Part of speech subclassification 3, Conjugation form, Conjugation type, Prototype, Reading, Pronunciation The output will be.

Sample code

Divide a sentence into word units (as is on the surface) and

All words
Nouns only
Verbs only
Adjectives only

A program that takes out 4 ways.

#!/usr/bin/env python                                                                                                                                             
# -*- coding:utf-8 -*-                                                                                                                                            

import MeCab

### Constants                                                                                                                                                     
MECAB_MODE = 'mecabrc'
PARSE_TEXT_ENCODING = 'utf-8'

### Functions                                                                                                                                                     
def main():
    sample_u = u"I want to be the catcher of the rye field. I know it's ridiculous. But that's the only thing I really want to be."
    words_dict = parse(sample_u)
    print "All:", ",".join(words_dict['all'])
    print "Nouns:", ",".join(words_dict['nouns'])
    print "Verbs:", ",".join(words_dict['verbs'])
    print "Adjs:", ",".join(words_dict['adjs'])
    return


def parse(unicode_string):
    tagger = MeCab.Tagger(MECAB_MODE)
    #If it is not str type, the operation will be strange, so convert it to str type
    text = unicode_string.encode(PARSE_TEXT_ENCODING)
    node = tagger.parseToNode(text)

    words = []
    nouns = []
    verbs = []
    adjs = []
    while node:
        pos = node.feature.split(",")[0]
        #Revert to unicode type
        word = node.surface.decode("utf-8")
        if pos == "noun":
            nouns.append(word)
        elif pos == "verb":
            verbs.append(word)
        elif pos == "adjective":
            adjs.append(word)
        words.append(word)
        node = node.next
    parsed_words_dict = {
        "all": words[1:-1], #Remove the empty string at the beginning and end
        "nouns": nouns,
        "verbs": verbs,
        "adjs": adjs
        }
    return parsed_words_dict

### Execute                                                                                                                                                       
if __name__ == "__main__":
    main()

Output result

(twi-py)$ python tweet_parser.py
All:Rye,field,of,Catch,Role,、,Such,もof,To,I,Is,Nari,Want,Hmm,Is,Yo,。,Stupid,Teru,thing,Is,Know,Teru,Yo,。,But,、,ほHmmWhenう,To,Nari,Want,もof,When,Ichi,Cod,It,Shika,Absent,Ne,。
Nouns:Rye,field,Role,thing,I,Hmm,thing,ほHmmとう,thing,It
Verbs:Catch,Nari,Stupid,Teru,Know,Teru,Nari,Ichi
Adjs:Absent

Finally

Now you can extract words by feeding parse () the retrieved Tweet.

For this sample code, I used the surface type in node.surface, If you want to normalize words that change endings, such as verbs, You can use the original form included in node.feature.

Collecting information from Twitter with Python (morphological analysis with MeCab)

About morphological analysis

MeCab output format

Sample code

Output result

Finally