I'm an engineer at a mid-sized company. Suddenly, I decided to summarize what I learned on my own in Qiita. For the time being, I will summarize the machine learning that I am studying. (If you break your heart, take a break with another theme that is easy to write) By the way, I plan to mainly use Python as the language.
――I like the way of touching and understanding rather than theory, so it's a lot of miscellaneous things. ――Because it is illiterate, I think it will be difficult to understand. ――For that reason, I think there are many things that make you want to read it, but please understand!
――Let's study machine learning using Twitter data. ――This time, we're talking about data and environment preparation to use. ――Next time, I plan to cluster tweets and calculate similarity.
At the beginning of my study, I tried various things using the "iris" and "Titanic" that appear in reference books and tutorials as they are, but I have no interest in the data. I didn't get into my head at all ...
I changed my mind and decided to use ** Twitter ** data that seems to be interesting to analyze. Also, to make it more intriguing (?), I will focus on tweets about "** Perfume **" this time. (Note: It's a Nocchi school)
By the way, when I got the tweets containing "Perfume" and looked at them, I found that the number of tweets was smaller than I expected (250 / 1h), and even less when I excluded tweets that seemed to be bots. I'm wondering if there are many fans who don't publish tweets. (I wonder if you can publish private tweets without a user name ... Muri ...)
As a result, the number of data is a little small, so I think I will increase the number of artists for comparison in the future. (Candidates: ** CAPSULE **, ** Sakanaction ** ...)
What to do specifically with machine learning is ... ** I'm thinking while moving **.
We have already built an environment to collect tweets about Perfume using Streming API and accumulate them with Elasticsearch. This time, in order to speed up studying, let's save some data from Elasticsearch to a file (** es.log ), machine learning with a local script ( tw_ml.py **), and so on. I think.
It looks like the following.
The data format is as follows. (Since it is a dictionary type when saved, the Unicode object has "u" attached.)
es.log(1 excerpt, data is dummy)
[{u'_score': 1.0, u'_type': u'raw', u'_id': u'AVZkL6ZevipIIzTJxrL7', u'_source': {u'retweeted_status': u'True', u'text': u'Perfume\u306e\u597d\u304d\u306a\u6b4c', u'user': u'xxxxxxxxxx', u'date': u'2016-08-07T08:45:27', u'retweet_count': u'0', u'geo': u'None', u'favorite_count': u'0'}, u'_index': u'tweet'}]
To read es.log from Python, open it with codecs and use ast to return it to the dictionary. For the time being, there are 265 cases, but if I think that the amount of data is insufficient, I will get it again from Elasticsearch and increase it.
tw_ml.py(Excerpt)
import codecs
import ast
with codecs.open("es.log", "r", "utf-8") as f:
es_dict = ast.literal_eval(f.read())
print "doc:%d" % len(es_dict) # doc:265
--Script execution environment
--Main Python libraries
By the way, Mecab is used for Japanese morphological analysis, but "** mecab-ipadic-neologd **" is used for the dictionary. If I didn't use this, even "Kashiyuka", "Ah-chan", and "Nocchi" wouldn't be regarded as words ... lol
tw_ml.py(Excerpt)
MECAB_OPT = "-Ochasen -d C:\\tmp\\mecab-ipadic-neologd\\"
t = mc.Tagger(MECAB_OPT)
I thought, but since it has become long, I will do the main subject of machine learning from the next time onwards! Lol For the time being, I plan to try clustering tweets and calculating similarity.
-Use Font-Awesome for Qiita article headlines to improve the appearance # qiita -One year has passed since the second year programmer posted to Qiita once a week
Recommended Posts