preface

Hi, this is another newcomer. This is a continuation of Last time.

If I used only mecab last time, it wouldn't work, and when I unveiled it to the world as it was, from all directions (qiita, twitter and almost all directions), "I think I can speak Japanese with just the IPA dictionary, use NEologd". Thank you for flying ~~ Face ~~ I took it from the front.

I also tried using ʻelasticsearch-analysis-kuromoji`, which can analyze Japanese with Elasticsearch Plugin.

Impressions at the time of conception (verbal tone)

――NEologd seems to support modern languages, but how far will it work on Twitter, where new words that are just beginning to become popular are rampant? ――It seems that kuromoji can be divided into Japanese, but how modern is it? I'm not sure how to get rid of rips and hashtags.

environment

Please refer to Last time.

Worked

-Installation of NEologd

I will pull it from github.

$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git

To get the update:

$ ./bin/install-mecab-ipadic-neologd -n

Find the location where NEologd is installed with the following command:

$ echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

How to use is introduced in the python code.

-Install elasticsearch-analysis-kuromoji

In the location where Elasticsearch is installed ($ ES_HOME), do the following:

$ sudo bin/elasticsearch-plugin install analysis-kuromoji

This time, we will parse the text field, so submit the following template. (Of course, after starting Elasticsearch.)

curl -XPUT --user elastic:changeme localhost:9200/_template/text_analysis?pretty -d '{
  "template": "twitter-*",
  "settings": {
    "analysis": {
      "tokenizer": {
        "kuromoji_user_dict": {
          "type": "kuromoji_tokenizer",
          "mode": "normal"
        }
      }
    }
  },
  "mappings": {
    "twitter": {
      "properties": {
        "text": {
            "type": "text",
            "fielddata": true,
            "analyzer": "kuromoji",
          "fields": {
            "keyword": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}'

--user elastic: changeme is how to throw curl after installing x-pack. By the way, ʻusername = elastic, password = changeme`. (I'm sorry I haven't changed my password yet)

--Gathering information from Twitter

For how to use the Twitter API, see Last time.

`search.py`


#!/usr/bin/env python                                                                                                                                                                                           
# -*- coding: utf-8 -*-                                                                                                                                                                                         

from requests_oauthlib import OAuth1Session
import json
import MeCab

CK = '***********'
CS = '***********'
AT = '***********'
AS = '***********'

url = "https://api.twitter.com/1.1/search/tweets.json"

#here can be set ID of tweet (ordered by time), and number of tweets (default is 20, max 200)                                                                                                                   
params = {'q':'#Escape shame', 'count':'200'}

# GET request                                                                                                                                                                                                   
twitter = OAuth1Session(CK, CS, AT, AS)
req = twitter.get(url, params = params)

f = open("json/search_nigehaji.json","a")

if req.status_code == 200:
    timeline = json.loads(req.text)
    print(timeline)
    for tweet in timeline["statuses"]:
        word_array = []
        mecab_combo = [[] for j in range(3)]
        word_combo = []
        print(tweet)
        for word in tweet["text"].split(" "):
            word_array.append(word)
            print(word)
            if (not word.startswith('http')) and (not word.startswith('@')) and (word != 'RT'):
                tagger = MeCab.Tagger(' -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
                text_str = word.encode('utf-8')
                node = tagger.parseToNode(text_str)
                mecab_array_noun = [];mecab_array_verb = []
                while node:
                    mecab_word = node.surface.decode("utf-8")
                    pos = node.feature.split(",")[0]
                    mecab_combo[0].append(pos)
                    mecab_combo[1].append(mecab_word)
                    mecab_combo[2].append(node.feature.split(",")[6])
                    if pos == "noun":
                        mecab_array_noun.append(mecab_word)
                        print(pos)
                        print(mecab_word)
                    elif pos == "verb":
                        mecab_array_verb.append(mecab_word)
                        print(pos)
                        print(mecab_word)
                        print(node.feature.split(",")[6])
                    node = node.next
                    print(mecab_combo)
        print("###########")
        print(len(mecab_combo[0]))
        for i in xrange(0, len(mecab_combo[0])):
            print("########################################")
            print(mecab_combo[0][i])
            stage_count = 0
            if mecab_combo[0][i] == "noun":
                print("start for")
                l = []
                for j in xrange(i, len(mecab_combo[0])):
                    print(mecab_combo[1][j])
                    if mecab_combo[0][j] == "noun":
                        l.append(mecab_combo[1][j])
                        word_combo.append(''.join(l))
                        print(''.join(l))
                    elif mecab_combo[0][j] in ["Particle", "Auxiliary verb", "verb"]:
                        if stage_count != 0:
                            break
                        l.append(mecab_combo[1][j])
                        word_combo.append(''.join(l))
                        stage_count += 1
                        print(''.join(l))
                    else:
                        print(''.join(l))
                        print("end")
                        break
            if mecab_combo[0][i] == "verb":
                print("start for")
                l = []
                for j in xrange(i, len(mecab_combo[0])):
                    print(mecab_combo[1][j])
                    if mecab_combo[0][j] == "verb":
                        l.append(mecab_combo[1][j])
                        word_combo.append(''.join(l))
                        print(''.join(l))
                    elif mecab_combo[0][j] in ["adjective", "Particle"]:
                        l.append(mecab_combo[1][j])
                        word_combo.append(''.join(l))
                        print(''.join(l))
                        print("end")
                        break
                    else:
                        print(''.join(l))
                        print("end")
                        break
            print("%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
        #injection                                                                                                                                                                                              
        tweet['words']=word_array;tweet['mecab_noun']=mecab_array_noun;tweet['mecab_verb']=mecab_array_verb;tweet['word_combo']=word_combo
        json.dump(tweet, f)
        f.write('\n')
else:
    print("Error: %d" % req.status_codea)

The difference from Last time is that NEologd is applied first.

`search.py`


tagger = MeCab.Tagger(' -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')

After that, taking advantage of the reflection of previous, I also removed URL links (starting with http), replies (starting with @) and RT. But

The biggest part is that nouns + nouns + ... and nouns + particles + nouns are put together! I tried to pick up the hit words. You can add as many patterns as you like. With this, for example ...

Gen Hoshino

Tawawa on Monday

result

Somehow, I'm hugging in the best scene of the final episode, but ... leave it alone. ..

スクリーンショット 2016-12-20 23.20.44.png

I think the results are pretty good! For escape shame analysis graphs other than hashtags, change to X-Axis ʻAdvanced-> ʻExclude Pattern

.*https.*|\#.*|Escape shame|.|..

Is specified. As for Japanese analysis, you can imagine characters, actors, and scenes in both kuromoji and mecab (noun analysis). If you don't want to do enough to bite python, it's kuromoji.

And in the hit word analysis, everyone seems to mutter the lines as they are ...

The best pretty

And

Free labor

And

Good intentions of people

And

15 minutes expansion

Something is ranked in.

I haven't seen it since I saw only 2 episodes, and I've only seen the first and last 5 minutes of the final episode, so I don't know what these keywords are, but I'm sure there was such a line (one person)

Well, let's go to bed. That scene was too strong for me ...

Would you like to analyze your escape shame yourself?

preface

Impressions at the time of conception (verbal tone)

environment

Worked

search.py

search.py

result

`search.py`

`search.py`