Search for Pokemon haunting information from Twitter

Since Pokemon GO has been released, I definitely want to see twitter information efficiently. Try word2vec with Python3 and search for word-of-mouth data to reach the information you want.

Advance preparation

Click here for Python Installation Click here for Installation of Mecab + neologd

Install Meacab
Confirm that python3 works on the command line
Add add-on to Eclipse (STS)
Module installation with pip3
HelloWold
Get information on Twitter
Training model creation and data extraction

let's try it!

2. Confirm that python3 works on the command line

[murotanimari]$  python3 --version
Python 3.5.2
[murotanimari]$ pip3 --version
pip 8.1.2 from /Library/Frameworks/Python.framework/Versions/3.5/lib/python3.5/site-packages (python 3.5)

3. Add add-on to Eclipse (STS)

http://pydev.org/updates/からPyDevをインストールすると、PyDevProjectが作成可能になるので、新規に作成します。

4. Module installation with pip3

pip3 install gensim
pip3 install argparse
pip3 install prettyprint

pip3 install word2vec
pip3 install print
pip3 install pp
pip3 install nltk　#I don't need it in Japanese
pip3 install tweepy
pip3 install scipy

# for japanese
brew install mecab
brew install mecab-ipadic
pip3 install mecab-python3

HelloWold

`HelloWorld.py`


import nltk
nltk.download('all');

import argparse
from gensim.models import word2vec

print("Hello, World!")

6. Get information on Twitter

`ParseJP.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*- 

import nltk
import sys
import tweepy
import json
import subprocess
import datetime
import MeCab

from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
from tweepy import Stream
from numpy.core.multiarray import empty

#Variables that contains the user credentials to access Twitter API 
access_token = "＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊"
access_token_secret = "＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊"
consumer_key = "＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊"
consumer_secret = "＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊＊"

#This is a basic listener that just prints received tweets to stdout.
class StdOutListener(StreamListener):
    def on_data(self, data):
        jsondata = json.loads(data)
        sentence = jsondata["text"]
        
        try:
            #print(sentence)
            t = MeCab.Tagger("-Ochasen")
            tagged = t.parse(sentence)
            #print(tagged)
            out = "";
            for item in tagged.split('\n'):
                item = str(item).strip()
                if item is '':
                    continue
                
                fields = item.split("\t")
                #print(fields)
                found = ""
                if 'EOS' not in item:
                    if "noun" in fields[3]:
                        found = fields[2]
                    if "verb" in fields[3]:
                        if "Auxiliary verb" not in fields[3]:
                            found = fields[2]
                    
                if("//" not in str(found).lower()):
                    if(found.lower() not in ["rt","@","sex","fuck","https","http","#",".",",","/"]):
                        if(len(found.strip()) != 0):
                            found = found.replace("'", "/'");
                            out += found + " "
                            
            today  = datetime.date.today()
            cmd  = "echo '"+ out + "' >> /tmp/JP" + today.isoformat() +".txt"
            #print(cmd)
            subprocess.check_output(cmd, shell=True)
                    
            return True
        except:
            print("Unexpected error:",found, sys.exc_info()[0])
            return True
            
    def on_error(self, status):
        print(status)

#### main method
if __name__ == '__main__':

    #This handles Twitter authetification and the connection to Twitter Streaming API
    l = StdOutListener()
    auth = OAuthHandler(consumer_key, consumer_secret)
    auth.set_access_token(access_token, access_token_secret)
    stream = Stream(auth, l)

    #This line filter Twitter Streams to capture data by the keywords: 'python', 'javascript', 'ruby'
    #stream.filter(track=['#pokemongo','#PokemonGo', '#PokémonGo', '#Pokémon' ,'#Pokemon', '#pokemon'], languages=["en"])    
    stream.filter(track=['#Pokémon','#pokemongo','#PokemonGo', '#PokémonGo', '#Pokémon' ,'#Pokemon', '#pokemon'], languages=["ja"])    
    #stream.filter(track=['#pokemon'], languages=["en"])

7. Training model creation and data extraction

For the time being, let's check if the data can be obtained properly on the command line.

`python`


>>> # !/usr/bin/env python
... # -*- coding:utf-8 -*-
... from gensim.models import word2vec
>>>
>>> data = word2vec.Text8Corpus('/tmp/JP2016-07-23.txt')
>>> model = word2vec.Word2Vec(data, size=200)
>>> model.most_similar(positive=u'Pokemon')
[('Pokémon', 0.49616560339927673), ('ND', 0.47942256927490234), ('Yo-Kai Watch', 0.4783376455307007), ('I', 0.44967448711395264), ('9', 0.4415249824523926), ('j', 0.4309641122817993), ('B', 0.4284788966178894), ('CX', 0.42728638648986816), ('l', 0.42639225721359253), ('bvRxC', 0.41929835081100464)]
>>>
>>> model.most_similar(positive=u'Pikachu')
[('SolderingArt', 0.7791135311126709), ('61', 0.7604312896728516), ('Pokemon', 0.7314165830612183), ('suki', 0.7087007761001587), ('Chu', 0.6967192888259888), ('docchi', 0.6937340497970581), ('Latte art', 0.6864794492721558), ('EjPbfZEhIS', 0.6781727075576782), ('Soldering', 0.6571916341781616), ('latteart', 0.6411304473876953)]
>>>
>>> model.most_similar(positive=u'Pikachu')
[('tobacco', 0.9689614176750183), ('Create', 0.9548219442367554), ('Shibuya', 0.9207605123519897), ('EXCJ', 0.9159889221191406), ('Littering', 0.8906601667404175), ('Get trash', 0.7719830274581909), ('There is there', 0.6942187547683716), ('Thank you', 0.6873651742935181), ('Please', 0.6714405417442322), ('GET', 0.6686745285987854)]
>>>
>>> model.most_similar(positive=u'Rare Pokemon')
[('table', 0.8076062202453613), ('Hayami', 0.8065655827522278), ('Habitat', 0.7529213428497314), ('obtain', 0.7382372617721558), ('latest', 0.7039971351623535), ('Japanese version', 0.6925774216651917), ('base', 0.6455932855606079), ('300', 0.6433809995651245), ('YosukeYou', 0.6330702900886536), ('Enoshima', 0.6322115659713745)]
>>>
>>> model.most_similar(positive=u'Mass generation')
[('Area', 0.9162761569023132), ('chaos', 0.8581807613372803), ('Sakuragicho Station', 0.7103563547134399), ('EjPbfZEhIS', 0.702730655670166), ('Okura', 0.6720583438873291), ('Tonomachi', 0.6632444858551025), ('Imai Shoten', 0.6514744758605957), ('丿', 0.6451742649078369), ('Paris', 0.6437439918518066), ('entrance', 0.640221893787384)]

I'm curious about bases and Enoshima with rare Pokemon! What about Sakuragicho Station, Okura, Tonomachi, Imai Shoten, etc. due to the outbreak?

Postscript: Bonus

I started processing data with Deploy on EC2. If you don't have money, you can't publish it as an API w
note:The accuracy is still low, so please check your true intentions.!!!!

▼ Pokemon "Spot" word-of-mouth keyword ranking by twitter& word2vec
1.Kinshi Park
2.Aichi prefecture
3. gamespark 
4.Nagoya
5.park
6.Shopping street
7.Three places
8.Ohori Park
▼ Pokemon "mass outbreak" word-of-mouth keyword ranking by twitter& word2vec
1.Pokemon event collaboration
2.Sakuragicho Station
3.Okura
4.Nishi-Shinjuku
5.Shopping street
6.Paris
7.Central park
8.Fukushima
9.Imai Shoten
▼ Pokemon "Rare Pokemon" Review Keyword Ranking by twitter& word2vec
1.Legend
2.False rumor
3.Habitat
4.Midnight
5.Private house
6.east
7.Mewtwo
8.Hoax information
9.update
10.Evaluation
11.Mamizukamachi, Isesaki City, Gunma Prefecture

Postscript: neologd

https://github.com/neologd/mecab-ipadic-neologd If you read carefully, it seems that you can install the latest version with install-mecab-ipadic-neologd as shown below.

git clone --depth 1 https://github.com/neologd/mecab-ipadic-neologd.git
cd mecab-ipadic-neologd
/usr/local/lib/mecab/dic/mecab-ipadic-neologd
./bin/install-mecab-ipadic-neologd -n
echo `mecab-config --dicdir`"/mecab-ipadic-neologd"

vi /usr/local/etc/mecabrc
dicdir = /usr/local/lib/mecab/dic/mecab-ipadic-neologd

Postscript: Addition of user dictionary

Add a user dictionary by referring to here. Enter the station name list, parks in Tokyo, and monster names.

cd /usr/local/lib/mecab/dic/ipadic
# add pokemon list
/usr/local/libexec/mecab/mecab-dict-index -u pokemon.dic -f utf-8 -t utf-8 /mnt/s3/resources/pokemons.csv
# add station list
/usr/local/libexec/mecab/mecab-dict-index -u station.dic -f utf-8 -t utf-8 /mnt/s3/resources/stations.csv
/usr/local/libexec/mecab/mecab-dict-index -u park.dic -f utf-8 -t utf-8 /mnt/s3/resources/park.csv

# copy into dict folder
cp pokemon.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/ 
cp station.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/
cp park.dic /usr/local/lib/mecab/dic/mecab-ipadic-neologd/

bonus

I will add the result of trying with the data model of wikipedia in 2018.

A woman's life is "love"
Adding "marriage" to a woman's life is "affair"
Subtracting "marriage" from a woman's life is "wisdom"
The answer from the venerable WikiPedia data model

By the way, if you search by job hunting, success, or case, you will find Rokkasho reprocessing plant ...