-[x] Program ready. -[x] Fixed bugs in the library -[x] Start execution! -[x] Was there any problem with the first peak? -[] Let's add up for the time being -[] Let's take a peek from the outside
About a week has passed for the time being. For the time being, it seems to be working safely. Then, what I want next is the current number of inputs or the total for each hour.
>>> db.xxx.count()
If you do, you can see that the count itself is increasing, but it is hard to say that anything is accurate ... Somehow, it is necessary to bring it to the hourly tweet count.
It seems that there are multiple aggregation methods in MongoDB, but for the time being, I will try to aggregate with MapReduce based on a "thin book". ...... I'm wondering if it's my first time to touch JavaScript, but I'm googled
check.js
db.xxxxx.mapReduce(
// map
function() {
d = new Date(Date.parse(this.created_at) + 32400000);
p = d.getFullYear() + "/" + ("0" + (d.getMonth() + 1)).slice(-2) + "/" + ("0" + d.getDate()).slice(-2) + " " + ("0" + d.getHours()).slice(-2) + ":00:00"
emit(
p,
1
);
},
// reduce
function(key, values) {
return Array.sum(values)
},
{
query: {},
out: "TweetsCount"
}
)
Like this. (Isn't JavaScript's date and time format the only way to do this?)
# mongo localhost/TwitterDB check.js
If you execute it like this, the result will be stored in another Collection "TweetsCount".
(Excerpt only at the peak occurrence point)
{ "_id" : "2016/10/28 13:00:00", "value" : 67 }
{ "_id" : "2016/10/28 14:00:00", "value" : 47 }
{ "_id" : "2016/10/28 15:00:00", "value" : 102 }
{ "_id" : "2016/10/28 16:00:00", "value" : 103 }
{ "_id" : "2016/10/28 17:00:00", "value" : 2850 }
{ "_id" : "2016/10/28 18:00:00", "value" : 5317 }
{ "_id" : "2016/10/28 19:00:00", "value" : 4324 }
It seems that it can be taken for the time being. Isn't it small overall? I feel like that, but (I feel like I'm having a peak of one digit ... maybe because of the search key) </ sub>. This level of execution does not seem to be a significant load. It may change again if it is settled for 100 days ...
The acquisition program currently running does not output anything to the standard output on the screen, so it is not possible to check what was stored. You can see it by looking directly at MongoDB, but you don't know the real-time acquisition status, and if you just look at it, there are many unnecessary items in JSON, so frankly it is hard to see.
Somehow, I want a script that can check the content of tweets in real time (or with some follow-up).
There seems to be various means, but it's quick and not exaggerated, and if you try to see only what you want to see, it's like this (implementation about 2h).
InsertChecker.py
#!/usr/bin/env python
# -*- coding:utf-8 -*-
from pymongo import MongoClient
import json
import time
#Variables related to MongoDB connection
HOST = 'mongo' #host
PORT = 27017 #port(Default:27017)
DB_NAME = 'TwitterDB' #DB name
COL_NAME= 'xxxxxx' #Collection name
# ------Main processing from here------
try:
Client = MongoClient(HOST, PORT)
db = Client[DB_NAME]
Collection = db[COL_NAME]
print('DB ready')
twCnt = Collection.count() #Get the very first count
except pymongo.errors.PyMongoError as exc:
#Connection error
print('DB connection error')
while(True): #infinite loop
try:
if(Collection.count() > twCnt):
for doc in Collection.find(skip=twCnt):
if("retweeted_status" in doc):
if("extended_tweet" in doc["retweeted_status"]):
#RT long sentence
print("RT @" + doc["retweeted_status"]["user"]["screen_name"] + ": " +doc["retweeted_status"]["extended_tweet"]["full_text"])
else:
#RT short sentence
print("RT @" + doc["retweeted_status"]["user"]["screen_name"] + ": " +doc["retweeted_status"]["text"])
else:
if("extended_tweet" in doc):
#Long sentence
print(doc["extended_tweet"]["full_text"])
else:
#Short sentence
print(doc["text"])
print(u"{name}({screen}) {created} via {src}".format(name=doc["user"]["name"], screen=doc["user"]["screen_name"],
created=doc["created_at"], src=doc["source"]))
print(u"--------------------------------------------------")
twCnt = Collection.count()
#Check interval is about every 3 seconds(Is 10 seconds acceptable depending on the flow velocity?)
time.sleep(3)
except KeyboardInterrupt:
# CTRL+C
print('CTRL +End with C.')
ExitCode = 0
break
(The notation of the print () sentence in the middle has changed because the copy and paste from the reference material is mixed) It's easy to do, check the number of collections on a regular basis, and if there are more, format and output the increased amount.
Somehow Extension of tweet wording As a result of the specification change, the following if To branch into 4 patterns. ・ RT tweets and extended specification (long) tweet patterns ・ RT tweets and standard (short) tweet patterns ・ Normal tweets and extended specifications (long sentences) tweet patterns ・ Tweet patterns that are normal tweets and standard specifications (short sentences)
After that, display the time and display the user name. Even if it is so ultra suitable, it will be displayed like "tail -f"! Wonderful!
This still fulfills what I want to do, but it is humanity that makes me want to pursue convenience.
How much is the tweet that is being RT actually RT? I think I want it.
print("\n<RT: " + str(doc["retweeted_status"]["retweet_count"]) + " cnt, Fav: " + str(doc["retweeted_status"]["favorite_count"]) + "cnt >")
Insert something like this in the branch that is displayed during RT (align the indents). You can see that it counts up every time it is RT.
The time display in JSON sent from Twitter is difficult for Japanese people to read, so I modified it.
d = datetime.strptime(doc["created_at"],'%a %b %d %H:%M:%S +0000 %Y') + timedelta(hours=9)
After converting to datetime format, the time storage location
created=d.strftime("%Y/%m/%d %H:%M:%S")
If you replace it like this, it will be completed. Don't forget "from datetime import datetime, timedelta".
It is possible to display what kind of client the tweet was sent from, but it is hard to see because this item and HTML tags are included. If you enable the regular expression as "import re" and do something like this in the place where doc ["source"] is specified, the tag will disappear strangely.
src=re.sub("<.+?>", "", doc["source"])
(It's possible that there will be nothing but tags, but it's working fine so far.)
For the time being, with this ・ Number of tweets per hour ・ Display of tweets acquired in real time
Can now be done. After all, it's mentally reassuring to see the tweets flowing properly.
The next goal is ・ Extraction of tweets that are currently the most RT ・ Twitter client usage statistics ・ Extract the store (in other words) </ sub> name from the tweet and measure the number of mentions. Is it around? I will try to make it from time to time.
(Continues persistently.)
Recommended Posts