This article is the 22nd day article of Python Advent Calendar 2014 --Qiita.
Last week, I wrote a blog called BigQuery x Perfume x tweet analysis on the company's Advent Calendar. There, I tried to analyze the tweets about Perfume collected during one week from 12/12 (Friday) to 12/18 (Thursday) by giving them to BigQuery.
This time, as a development, I will do tweet analysis with natural language processing using Mecab and CaboCha m (_ _) m
That's why I look up the tweets of the House of Representatives election --Qiita https://github.com/mima3/stream_twitter
Rather than reference, most of what I'm doing is for sale by mima_ita ...
· Mac OSX 10.9.5 -Python 2.7.8
This time, I used a service called mention. In mention, you can easily pull data from SNS with the keywords set on the management screen. Export the data acquired here to csv from the management screen, and you are ready to go.
The conditions specified this time are as follows.
Search keyword: Perfume|| prfm || perfume_um (prfm and perfume_um are hashtags used in posts about Perfume)
Negative keywords: RT
Target SNS: Twitter Language: Japanese
Period: 12/12 (Friday) -12/18 (Thursday)
Mecab:http://salinger.github.io/blog/2013/01/17/1/ Cabocha:http://qiita.com/ShingoOikawa/items/ef4ac2929ec19599a3cf
If you follow this article, there was no problem (`・ ω ・ ´) ゞ
https://github.com/mima3/stream_twitter
However, here we use the Streaming API to collect Twitter data and store it in the DB. This time, unlike that, it is necessary to store what was collected by mention and spit out by csv in the DB.
So, please store it in SQLite with the following source.
create_database.py
#!/usr/bin/python
# -*- coding: utf-8 -*-
import sqlite3
import csv
if __name__ == '__main__':
con = sqlite3.connect("twitter_stream.sqlite")
c = con.cursor()
# Create table
c.execute('''CREATE TABLE "twitte" ("id" INTEGER NOT NULL PRIMARY KEY, "createAt" DATETIME NOT NULL, "idStr" VARCHAR(255) NOT NULL, "contents" VARCHAR(255) NOT NULL);
''')
c.execute('''CREATE INDEX "twitte_createAt" ON "twitte" ("createAt");''')
c.execute('''CREATE INDEX "twitte_idStr" ON "twitte" ("idStr")''')
# Insert data
i = 0
data = []
reader = csv.reader(open("./perfume_tweet.csv"))
for row in reader:
id = i+1
createAt = row[4]
idStr = unicode(row[0],'utf-8')
contents = unicode(row[1],'utf-8')
t = (id,createAt,idStr,contents)
data.append(t)
i += 1
con.executemany(u"insert into twitte values(?,?,?,?)",data)
# Save (commit) the changes
con.commit()
# We can also close the cursor if we are done with it
con.close()
You have successfully created twitter_stream.sqlite in your current directory.
Here, I will pick up 12/17 (Wednesday). We will count the number of tweets by hour in the day.
python twitter_db_hist.py "2014/12/16 15:00" "2014/12/17 15:00" 3600
Looking at this, we can analyze the following.
・ The time zone from 05:00 to 06:00 is the least. It's all low from 02:00 to 06:00, so everyone is probably sleeping at this time. ・ The most frequent hours are from 23:00 to 24:00. The number of tweets is generally high from 18:00 to 24:00, but the highest is from 23:00 to 24:00 at the end of the day. Many people have been sleeping since they tweeted about Perfume?
That's a rough idea, but you can understand it somehow.
Next, we will perform morphological analysis using Mecab. Here, we will sift the collected tweets for one week.
python twitter_db_mecab.py "2014/12/11 15:00" "2014/12/17 15:00" > mecab.txt
Below is a list of the top 100.
word | count |
---|---|
Perfume | 10935 |
prfm | 2739 |
perfume | 2136 |
Follow | 1553 |
~ | 1478 |
Chiru | 1462 |
Noru | 1448 |
Regular | 1410 |
- | 1347 |
Like | 1256 |
Video | 1218 |
Chan | 1204 |
Perfume | 1056 |
Man | 996 |
YouTube | 945 |
Mutual | 889 |
During ~ | 850 |
Summary | 837 |
sougofollow | 775 |
live | 737 |
um | 726 |
Teru | 568 |
Tame | 555 |
www | 552 |
Breaking news | 552 |
male | 551 |
Oomoto | 544 |
Ayano | 544 |
Circle | 543 |
marriage | 540 |
En | 537 |
loose the temper | 535 |
General | 534 |
♪ | 493 |
bid | 480 |
storm | 433 |
Absent | 431 |
Day | 420 |
Ku | 419 |
Yahoo auction | 413 |
Year | 408 |
you | 401 |
Current | 398 |
price | 377 |
Time | 374 |
Date and time | 374 |
Song | 364 |
View | 356 |
Cling | 350 |
One | 345 |
Black | 342 |
End | 341 |
Eye | 341 |
Pafukura | 335 |
number | 332 |
listen | 331 |
of | 330 |
thing | 324 |
Please | 318 |
yauc | 308 |
DVD | 308 |
Limited | 307 |
Board | 300 |
ticket | 295 |
I love You | 285 |
Sheet | 282 |
love | 276 |
First time | 271 |
Yuka | 270 |
natural | 263 |
Month | 263 |
:-, | 262 |
Sa | 261 |
Sakanaction | 254 |
Give me | 250 |
!: | 250 |
Peaches | 247 |
Hope | 240 |
nowplaying | 240 |
Music | 239 |
FC | 237 |
Rank | 230 |
mask | 229 |
Chowder | 228 |
love | 219 |
soil | 217 |
come | 217 |
゜ | 217 |
Apple | 214 |
Player | 213 |
To be | 210 |
Sekaowa | 209 |
Kashi | 207 |
mp | 206 |
source | 205 |
dance | 201 |
Explosive sound | 201 |
so | 201 |
Shiina | 200 |
1 | 200 |
The top words are included in the bot's tweet, so it's not significant data. As a sensory value, it seems that meaningful words appear when analyzing from a total of 500 tweets or less.
During this data collection period, there were many tweets about the event "Masked Chowder ~ YAJIO CRAZY ~ Chowder University International Collagen High School" that was held on Saturday, December 20th. was.
Event name related: "Mask" "Chowder" Ticket information related: "Bid" "Yahoo auction" "Price" "yauc" "Ticket" "Sheet" (I wasn't familiar with yauc, but it seems to be a hashtag of Yahoo Auction)
After that, I was able to roughly analyze the following.
Perfume song title related: "Cling" (Cling Cling) "Natural" "Love" (In love with Natural) Singer related: "Arashi" "Momo" "Kuro" "Sakanaction" "Sekaowa" "Ringo" "Shiina" Perfume Music Player related: "Listen" "Music" "Player" "mp"
Next, let's use CaboCha to aggregate the dependency relationships of clauses. Again, we are analyzing a week's worth of tweets.
python twitter_db_cabocha.py "2014/12/11 15:00" "2014/12/17 15:00" > cabocha.txt
Let's look at the top 100 in the same way.
phrase1 | phrase2 | count |
---|---|---|
Perfume | Noru | 582 |
During live | Furious www http://t | 535 |
Noru | Furious www http://t | 535 |
●● | Furious www http://t | 535 |
~Chan | During live | 535 |
Ayano Omoto General male | marriage | 532 |
This | Ayano Omoto General male | 532 |
PerfumeMusicPlayer | listen | 138 |
RT#Mr. Pafukura | Connect | 137 |
Absent | http://t | 131 |
Kuu | http://t | 127 |
Mr. Saito | http://t | 127 |
- | Mr. Saito | 127 |
Connect | #Pafukura http://t | 125 |
___VAMPS | ___One Ok | 97 |
Black | ___L'Arc | 97 |
___glee | ___UVERworld | 97 |
___UVERworld | ___VAMPS | 97 |
___Peaches | Black | 97 |
___Bz | ___One Ok | 97 |
___L'Arc | ___Bz | 97 |
___Ringo Sheena___GLAY | ___glee | 93 |
『[BGM for work]perfumemix』 | http://t | 91 |
Love | Keru | 90 |
#Perfume#I like Perfume | Connect | 89 |
[Diffusion hope] ● Explosive event schedule ● Details___http | ://t | 89 |
Nino | Much | 87 |
"secret | Arashi-chan | 86 |
Just just | Keru | 86 |
radio | Keru | 86 |
thing | Kariru | 86 |
"Dengeki Marriage~perfumeoflove~___Episode 1" | http://t | 84 |
~Chan | Keru | 83 |
Ste | ___20131115』http://t | 77 |
___MUSIC | STATION | 77 |
STATION | Ste | 77 |
FC2 video: | Keru | 75 |
___Perfume | ___One Ok | 74 |
natural | I miss you | 68 |
Hama Okamoto | (OKAMOTO'S) | 63 |
Like | thing | 62 |
One song | Vote | 60 |
this year | One song | 60 |
word | 『Perfume』 | 58 |
- | -#prfm#perfume_um | 56 |
Ah | ~Chan | 55 |
Back story Talk | http://t | 50 |
12/20(soil)Osaka Castle Hall | ticket | 50 |
2008-4-5 GAME release | Back story Talk | 50 |
[3 princesses | Back story Talk | 50 |
Pleasant | Back story Talk | 50 |
『PerfumeTalk | 2008-4-5 GAME release | 50 |
PerfumeMusicPlayer | listen | 48 |
Kashino Yuka | To inform | 47 |
~Chan | To inform | 47 |
Perfume | ~Chan | 47 |
Like | Man | 45 |
Noru | To inform | 45 |
natural | I love you | 45 |
hand | connect | 43 |
Sekaowa | Momokuro | 43 |
nice to meet you | please | 43 |
thing | is there | 43 |
Like | One | 41 |
Perfume | Kashino Yuka | 40 |
member | Draw | 38 |
now | Check http://t | 36 |
soon | Check http://t | 36 |
Perfume(Perfume) | ticket | 36 |
Fit | One | 36 |
co/ | 1IoZn9U583 | 35 |
Momokuro | Perfume | 35 |
Qi | Become | 34 |
Perfume | GLAY | 33 |
co/ | 7CRGN21Brf) | 33 |
One | Follow me | 33 |
winter | Era | 33 |
___http | ://t | 32 |
(#Perfume | Kuu | 32 |
Guess | Extreme Maiden | 32 |
H | ___GLAY\720 | 31 |
Noru | #prfm | 31 |
/ | Sandaime J Soul Brothers | 31 |
Kariru | ___GLAY\720 | 31 |
Sandaime J Soul Brothers | ___GLAY\720 | 31 |
(Watts Inn) | 2015 | 31 |
Two persons | Is | 31 |
2015 | January issue[magazine]http://t | 31 |
___TEAM | H | 31 |
Feel free | To follow | 30 |
heart | Sports | 29 |
12/20(soil)Details of Osaka Castle Hall | Here | 29 |
you | 28 | |
Perfume | member | 28 |
word | "Perfume Cosplay" | 28 |
Draw | ☆ Ultimate work ☆(> | 28 |
[Price or less] ★ Transfer ★ Masked Chowder YAJIO CRAZY Chowder University International Collagen High School | 12/20(soil)Osaka Castle Hall | 28 |
Fit | Man | 28 |
thing | Karu | 27 |
Like | Artist | 27 |
Again, a total of 500 or more dependencies are noisy bots, so let's go through. By the way, the bot has the following two tweets.
Perfume's Nocchi A-chan During the live ●● and rage www http://t.co/4Q0fmhel2l
I don't know how to look at the dependency table, so I grep it with the names of the three members. Among the dependent Phrase, the ones that include "A-chan", "Kashiyuka", and "Nocchi" are shown below. (* Extract only those with neatly separated phrases)
phrase1 | phrase2 | count |
---|---|---|
Kashino Yuka | finger | 2 |
Kashino Yuka | hair | 2 |
Kashino Yuka | Shake | 2 |
Kashino Yuka | cute | 1 |
Kashino Yuka | cute | 1 |
Kashino Yuka | Head | 1 |
Kashino Yuka | skirt | 1 |
divine | Kashino Yuka | 1 |
Kashino Yuka | left hand | 1 |
Kashino Yuka | voice | 1 |
Nocchi | Beautiful | 2 |
A-chan | angel | 2 |
A-chan | One piece / collar | 1 |
A-chan's | Dumplings | 1 |
A-chan's | Smile | 1 |
As you can see, there were many things related to Kashino in the dependency analysis! Looking at this dependency, the characteristics of each of the three people are revealed, which is wonderful ...
As mentioned above, with the help of mima_ita, we have analyzed the tweets about Perfume. Looking back, it's a pity that we have collected a large amount of bottweets that are unnecessary for analysis under the conditions we performed this time ...
I still have a lot of data analysis and Python skills, so I will buy this book to be released next week and study it (`・ ω ・ ´) ゞ Introduction to programming in Python language: World Standard MIT textbook
Recommended Posts