Since Pokemon GO has been released, I definitely want to see twitter information efficiently. Try word2vec with Python3 and search for word-of-mouth data to reach the information you want.
Click here for Python Installation Click here for Installation of Mecab + neologd
[murotanimari]$ python3 --version
Python 3.5.2
[murotanimari]$ pip3 --version
pip 8.1.2 from /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (python 3.5)
http://pydev.org/updates/からPyDevをインストールすると、PyDevProjectが作成可能になるので、新規に作成します。
pip3 install gensim
pip3 install argparse
pip3 install prettyprint
pip3 install word2vec
pip3 install print
pip3 install pp
pip3 install nltk #I don't need it in Japanese
pip3 install tweepy
pip3 install scipy
# for japanese
brew install mecab
brew install mecab-ipadic
pip3 install mecab-python3
HelloWorld.py
import nltk
nltk.download('all');
import argparse
from gensim.models import word2vec
print("Hello, World!")
ParseJP.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import nltk
import sys
import tweepy
import json
import subprocess
import datetime
import MeCab
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from numpy.core.multiarray import empty
#Variables that contains the user credentials to access Twitter API
access_token = "*****************"
access_token_secret = "*****************"
consumer_key = "*****************"
consumer_secret = "*****************"
#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
def on_data(self, data):
jsondata = json.loads(data)
sentence = jsondata["text"]
try:
#print(sentence)
t = MeCab.Tagger("-Ochasen")
tagged = t.parse(sentence)
#print(tagged)
out = "";
for item in tagged.split('\n'):
item = str(item).strip()
if item is '':
continue
fields = item.split("\t")
#print(fields)
found = ""
if 'EOS' not in item:
if "noun" in fields[3]:
found = fields[2]
if "verb" in fields[3]:
if "Auxiliary verb" not in fields[3]:
found = fields[2]
if("//" not in str(found).lower()):
if(found.lower() not in ["rt","@","sex","fuck","https","http","#",".",",","/"]):
if(len(found.strip()) != 0):
found = found.replace("'", "/'");
out += found + " "
today = datetime.date.today()
cmd = "echo '"+ out + "' >> /tmp/JP" + today.isoformat() +".txt"
#print(cmd)
subprocess.check_output(cmd, shell=True)
return True
except:
print("Unexpected error:",found, sys.exc_info()[0])
return True
def on_error(self, status):
print(status)
#### main method
if __name__ == '__main__':
#This handles Twitter authetification and the connection to Twitter Streaming API
l = StdOutListener()
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, l)
#This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
#stream.filter(track=['#pokemongo','#PokemonGo', '#PokémonGo', '#Pokémon' ,'#Pokemon', '#pokemon'], languages=["en"])
stream.filter(track=['#Pokémon','#pokemongo','#PokemonGo', '#PokémonGo', '#Pokémon' ,'#Pokemon', '#pokemon'], languages=["ja"])
#stream.filter(track=['#pokemon'], languages=["en"])
For the time being, let's check if the data can be obtained properly on the command line.
python
>>> # !/usr/bin/env python
... # -*- coding:utf-8 -*-
... from gensim.models import word2vec
>>>
>>> data = word2vec.Text8Corpus('/tmp/JP2016-07-23.txt')
>>> model = word2vec.Word2Vec(data, size=200)
>>> model.most_similar(positive=u'Pokemon')
[('Pokémon', 0.49616560339927673), ('ND', 0.47942256927490234), ('Yo-Kai Watch', 0.4783376455307007), ('I', 0.44967448711395264), ('9', 0.4415249824523926), ('j', 0.4309641122817993), ('B', 0.4284788966178894), ('CX', 0.42728638648986816), ('l', 0.42639225721359253), ('bvRxC', 0.41929835081100464)]
>>>
>>> model.most_similar(positive=u'Pikachu')
[('SolderingArt', 0.7791135311126709), ('61', 0.7604312896728516), ('Pokemon', 0.7314165830612183), ('suki', 0.7087007761001587), ('Chu', 0.6967192888259888), ('docchi', 0.6937340497970581), ('Latte art', 0.6864794492721558), ('EjPbfZEhIS', 0.6781727075576782), ('Soldering', 0.6571916341781616), ('latteart', 0.6411304473876953)]
>>>
>>> model.most_similar(positive=u'Pikachu')
[('tobacco', 0.9689614176750183), ('Create', 0.9548219442367554), ('Shibuya', 0.9207605123519897), ('EXCJ', 0.9159889221191406), ('Littering', 0.8906601667404175), ('Get trash', 0.7719830274581909), ('There is there', 0.6942187547683716), ('Thank you', 0.6873651742935181), ('Please', 0.6714405417442322), ('GET', 0.6686745285987854)]
>>>
>>> model.most_similar(positive=u'Rare Pokemon')
[('table', 0.8076062202453613), ('Hayami', 0.8065655827522278), ('Habitat', 0.7529213428497314), ('obtain', 0.7382372617721558), ('latest', 0.7039971351623535), ('Japanese version', 0.6925774216651917), ('base', 0.6455932855606079), ('300', 0.6433809995651245), ('YosukeYou', 0.6330702900886536), ('Enoshima', 0.6322115659713745)]
>>>
>>> model.most_similar(positive=u'Mass generation')
[('Area', 0.9162761569023132), ('chaos', 0.8581807613372803), ('Sakuragicho Station', 0.7103563547134399), ('EjPbfZEhIS', 0.702730655670166), ('Okura', 0.6720583438873291), ('Tonomachi', 0.6632444858551025), ('Imai Shoten', 0.6514744758605957), ('丿', 0.6451742649078369), ('Paris', 0.6437439918518066), ('entrance', 0.640221893787384)]
I'm curious about bases and Enoshima with rare Pokemon! What about Sakuragicho Station, Okura, Tonomachi, Imai Shoten, etc. due to the outbreak?
I started processing data with Deploy on EC2. If you don't have money, you can't publish it as an API w
note:The accuracy is still low, so please check your true intentions.!!!!
▼ Pokemon "Spot" word-of-mouth keyword ranking by twitter& word2vec
1.Kinshi Park
2.Aichi prefecture
3. gamespark
4.Nagoya
5.park
6.Shopping street
7.Three places
8.Ohori Park
▼ Pokemon "mass outbreak" word-of-mouth keyword ranking by twitter& word2vec
1.Pokemon event collaboration
2.Sakuragicho Station
3.Okura
4.Nishi-Shinjuku
5.Shopping street
6.Paris
7.Central park
8.Fukushima
9.Imai Shoten
▼ Pokemon "Rare Pokemon" Review Keyword Ranking by twitter& word2vec
1.Legend
2.False rumor
3.Habitat
4.Midnight
5.Private house
6.east
7.Mewtwo
8.Hoax information
9.update
10.Evaluation
11.Mamizukamachi, Isesaki City, Gunma Prefecture
https://github.com/neologd/mecab-ipadic-neologd If you read carefully, it seems that you can install the latest version with install-mecab-ipadic-neologd as shown below.
git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
/usr/local/lib/mecab/dic/mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n
echo `mecab-config --dicdir`"/mecab-ipadic-neologd"
vi /usr/local/etc/mecabrc
dicdir = /usr/local/lib/mecab/dic/mecab-ipadic-neologd
Add a user dictionary by referring to here. Enter the station name list, parks in Tokyo, and monster names.
cd /usr/local/lib/mecab/dic/ipadic
# add pokemon list
/usr/local/libexec/mecab/mecab-dict-index -u pokemon.dic -f utf-8 -t utf-8 /mnt/s3/resources/pokemons.csv
# add station list
/usr/local/libexec/mecab/mecab-dict-index -u station.dic -f utf-8 -t utf-8 /mnt/s3/resources/stations.csv
/usr/local/libexec/mecab/mecab-dict-index -u park.dic -f utf-8 -t utf-8 /mnt/s3/resources/park.csv
# copy into dict folder
cp pokemon.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
cp station.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
cp park.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
I will add the result of trying with the data model of wikipedia in 2018.
A woman's life is "love"
Adding "marriage" to a woman's life is "affair"
Subtracting "marriage" from a woman's life is "wisdom"
The answer from the venerable WikiPedia data model
By the way, if you search by job hunting, success, or case, you will find Rokkasho reprocessing plant ...
Recommended Posts