I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK

Elasticsearch When I was thinking of trying various things, I decided to use the tweet log of JAWS DAYS 2017, which I also tweeted.

Premise

Python uses 2.7 series
ELK uses 5.2.2

Log search from Twitter

I put the created source code here. It can be used by entering the access key. (It is necessary to complete registration to the Twitter API as a preliminary preparation.)

https://github.com/kojiisd/TweetsSearch

--The source code itself is very simple. ――The number of 3000 is decided appropriately. (A value specified appropriately because I thought it would be that many tweets) ――By the way, when you use the Twitter API, you need to be careful because there is a limit to the number of times you can use it (it will be reset).

import twitter
from twitter import Api
import sys
import os
import time
import json

reload(sys)
sys.setdefaultencoding('utf-8')
from collections import defaultdict

maxid = 0
search_word = "#XXXXXX"

api = Api(base_url="https://api.twitter.com/1.1",
          consumer_key='XXXXXXXX',
          consumer_secret='XXXXXXXX',
          access_token_key='XXXXXXXX',
          access_token_secret='XXXXXXXX')


count = 0
file = open(name='../data/result.json', mode='w')
found = api.GetSearch(term=search_word, count=100, lang="ja", result_type='mixed', until="yyyy-mm-dd")
while count < 3000:
    for result in found:

        file.write(str(result) + os.linesep)

        count += 1
        maxid = result.id
    found = api.GetSearch(term=search_word, count=100, result_type='mixed', max_id=maxid - 1)

file.close()
print "TweetsNum: " + str(count)

Try searching with the hashtag of JAWS DAYS 2017

--I searched for "#jawsdays" in the above program. The tweet I was able to get was like this. ――About 1900 hits. --The period is 2017/03/11 00:00:00 --2017/03/12 09:00:00. Appropriately specify the period from when tweets related to JAWS DAYS start in earnest until they settle down.

{
    "created_at": "Sat Mar 11 04:57:29 +0000 2017", 
    "favorited": false, 
    "hashtags": [
        "jd2017_b", 
        "jawsdays"
    ], 
    "id": XXXXXXXXXXXXXXXXXX, 
    "lang": "ja", 
    "retweeted": false, 
    "source": "<a href="http://twitter.com" rel="nofollow">Twitter Web Client</a>", 
    "text": "Full + standing. It's amazing.#jd2017_b #jawsdays", 
    "truncated": false, 
    "user": {
        ：
        ：
        ：
        "name": "Koji, 
        ：
        ：
        ：
        "screen_name": "kojiisd", 
        ：
        ：
        ：
    }
}

For logstash, I wanted to see the result immediately, so I just imported it as JSON I didn't write much processing in the configuration file ...

As a result, the index has become difficult, but for the time being, I will continue as it is.

スクリーンショット 2017-04-06 7.26.23.png

So, ELK is convenient no matter how many times you use it. As long as I'm happy that Kibana can make a little visualization once it's poured into Elasticsearch. I tried arranging the number of tweets for each user in the tweets acquired this time. The user who is tweeting can be obtained by "user.screen_name".

スクリーンショット 2017-04-04 7.14.22.png

Hmmm, maybe you've had a good fight, but didn't you tweet 100? Let's do our best.

The most retweeted person was "nakayama_san". He is the one who always posts easy-to-understand pictures at study sessions. Convinced. Retweets are narrowed down by "retweeted_status.user.screen_name". I wonder if it fits this way ... スクリーンショット 2017-04-04 7.26.32.png

By the way, when I try to display the tag cloud for the time being, it looks like this. It's no good at all (^^; It's natural, but it's because I haven't set anything such as analyzed. スクリーンショット 2017-04-06 7.33.21.png

With JAWS DAYS, there are few tweets themselves, so it seems that preparations such as properly analyzing character strings, properly considering logstash settings, adding templates to the target index, etc. are necessary.

However, if you prepare properly and then analyze it, you will be able to see trend keywords in re: Invent.

Summary

For the time being, I created a flow of "data acquisition" → "visualization". This time I created it on the assumption that it will be executed locally, but if you can analyze it on AWS as it is and publish the URL, it may be possible to display trend keywords in the tag cloud in real time. Hmm. If you feel like it, let's try it before re: Invent.