Hi, this is another newcomer. This is a continuation of Last time.
If I used only mecab last time, it wouldn't work, and when I unveiled it to the world as it was, from all directions (qiita, twitter and almost all directions), "I think I can speak Japanese with just the IPA dictionary, use NEologd". Thank you for flying ~~ Face ~~ I took it from the front.
I also tried using ʻelasticsearch-analysis-kuromoji`, which can analyze Japanese with Elasticsearch Plugin.
――NEologd seems to support modern languages, but how far will it work on Twitter, where new words that are just beginning to become popular are rampant? ――It seems that kuromoji can be divided into Japanese, but how modern is it? I'm not sure how to get rid of rips and hashtags.
Please refer to Last time.
-Installation of NEologd
I will pull it from github.
$ git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
To get the update:
$ ./bin/install-mecab-ipadic-neologd -n
Find the location where NEologd is installed with the following command:
$ echo `mecab-config --dicdir`"/mecab-ipadic-neologd"
How to use is introduced in the python code.
-Install elasticsearch-analysis-kuromoji
In the location where Elasticsearch is installed ($ ES_HOME
), do the following:
$ sudo bin/elasticsearch-plugin install analysis-kuromoji
This time, we will parse the text
field, so submit the following template. (Of course, after starting Elasticsearch.)
curl -XPUT --user elastic:changeme localhost:9200/_template/text_analysis?pretty -d '{
"template": "twitter-*",
"settings": {
"analysis": {
"tokenizer": {
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "normal"
}
}
}
},
"mappings": {
"twitter": {
"properties": {
"text": {
"type": "text",
"fielddata": true,
"analyzer": "kuromoji",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
}
}'
--user elastic: changeme
is how to throw curl after installing x-pack. By the way, ʻusername = elastic,
password = changeme`. (I'm sorry I haven't changed my password yet)
--Gathering information from Twitter
For how to use the Twitter API, see Last time.
search.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from requests_oauthlib import OAuth1Session
import json
import MeCab
CK = '***********'
CS = '***********'
AT = '***********'
AS = '***********'
url = "https://api.twitter.com/1.1/search/tweets.json"
#here can be set ID of tweet (ordered by time), and number of tweets (default is 20, max 200)
params = {'q':'#Escape shame', 'count':'200'}
# GET request
twitter = OAuth1Session(CK, CS, AT, AS)
req = twitter.get(url, params = params)
f = open("json/search_nigehaji.json","a")
if req.status_code == 200:
timeline = json.loads(req.text)
print(timeline)
for tweet in timeline["statuses"]:
word_array = []
mecab_combo = [[] for j in range(3)]
word_combo = []
print(tweet)
for word in tweet["text"].split(" "):
word_array.append(word)
print(word)
if (not word.startswith('http')) and (not word.startswith('@')) and (word != 'RT'):
tagger = MeCab.Tagger(' -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
text_str = word.encode('utf-8')
node = tagger.parseToNode(text_str)
mecab_array_noun = [];mecab_array_verb = []
while node:
mecab_word = node.surface.decode("utf-8")
pos = node.feature.split(",")[0]
mecab_combo[0].append(pos)
mecab_combo[1].append(mecab_word)
mecab_combo[2].append(node.feature.split(",")[6])
if pos == "noun":
mecab_array_noun.append(mecab_word)
print(pos)
print(mecab_word)
elif pos == "verb":
mecab_array_verb.append(mecab_word)
print(pos)
print(mecab_word)
print(node.feature.split(",")[6])
node = node.next
print(mecab_combo)
print("###########")
print(len(mecab_combo[0]))
for i in xrange(0, len(mecab_combo[0])):
print("########################################")
print(mecab_combo[0][i])
stage_count = 0
if mecab_combo[0][i] == "noun":
print("start for")
l = []
for j in xrange(i, len(mecab_combo[0])):
print(mecab_combo[1][j])
if mecab_combo[0][j] == "noun":
l.append(mecab_combo[1][j])
word_combo.append(''.join(l))
print(''.join(l))
elif mecab_combo[0][j] in ["Particle", "Auxiliary verb", "verb"]:
if stage_count != 0:
break
l.append(mecab_combo[1][j])
word_combo.append(''.join(l))
stage_count += 1
print(''.join(l))
else:
print(''.join(l))
print("end")
break
if mecab_combo[0][i] == "verb":
print("start for")
l = []
for j in xrange(i, len(mecab_combo[0])):
print(mecab_combo[1][j])
if mecab_combo[0][j] == "verb":
l.append(mecab_combo[1][j])
word_combo.append(''.join(l))
print(''.join(l))
elif mecab_combo[0][j] in ["adjective", "Particle"]:
l.append(mecab_combo[1][j])
word_combo.append(''.join(l))
print(''.join(l))
print("end")
break
else:
print(''.join(l))
print("end")
break
print("%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%")
#injection
tweet['words']=word_array;tweet['mecab_noun']=mecab_array_noun;tweet['mecab_verb']=mecab_array_verb;tweet['word_combo']=word_combo
json.dump(tweet, f)
f.write('\n')
else:
print("Error: %d" % req.status_codea)
The difference from Last time is that NEologd is applied first.
search.py
tagger = MeCab.Tagger(' -d /usr/local/lib/mecab/dic/mecab-ipadic-neologd')
After that, taking advantage of the reflection of previous, I also removed URL links (starting with http), replies (starting with @) and RT
. But
The biggest part is that nouns + nouns + ... and nouns + particles + nouns are put together! I tried to pick up the hit words. You can add as many patterns as you like. With this, for example ...
Gen Hoshino
Tawawa on Monday
Somehow, I'm hugging in the best scene of the final episode, but ... leave it alone. ..
I think the results are pretty good!
For escape shame analysis graphs other than hashtags, change to X-Axis
ʻAdvanced-> ʻExclude Pattern
.*https.*|\#.*|Escape shame|.|..
Is specified. As for Japanese analysis, you can imagine characters, actors, and scenes in both kuromoji and mecab (noun analysis). If you don't want to do enough to bite python, it's kuromoji.
And in the hit word analysis, everyone seems to mutter the lines as they are ...
The best pretty
And
Free labor
And
Good intentions of people
And
15 minutes expansion
Something is ranked in.
I haven't seen it since I saw only 2 episodes, and I've only seen the first and last 5 minutes of the final episode, so I don't know what these keywords are, but I'm sure there was such a line (one person)
Well, let's go to bed. That scene was too strong for me ...
Recommended Posts