I have an idea that SNS information such as Twitter can be used to monitor the risk of outbreak of a new coronavirus infection cluster, and I would like to collect tweets by keyword search such as "Today's drinking party". Since the free search API can only trace tweets one week ago, we will build a mechanism that can automatically collect data every day, considering the possibility of using it for future research.
If you have a good idea of a search word that can be used to evaluate the risk of developing a new coronavirus infection cluster, please comment! </ b>
Originator https://developer.twitter.com/en/docs/twitter-api Very easy-to-understand commentary site https://gaaaon.jp/blog/twitterapi It's sad that this article hasn't even reached the "tried self-satisfaction code" mentioned in the link above, so in some cases the article may be kept private.
Execute the following nomikai_tweets.py
.
# coding: utf-8
# nomikai_tweets.py
import pandas as pd
import json
import schedule
from time import sleep
from requests_oauthlib import OAuth1Session
import datetime
from datetime import date, timedelta
import pytz
def convert_to_datetime(datetime_str):
"""
Date and time format conversion
"""
tweet_datetime = datetime.datetime.strptime(datetime_str,'%a %b %d %H:%M:%S %z %Y')
return(tweet_datetime)
def job():
"""
Program that repeats in main
"""
#Search keyword Excluding retweets
keyword = "Drinking party exclude today:retweets"
#Save directory Create first
DIR = 'nomikai/'
#Information obtained by API communication and developer registration
Consumer_key = 'bT*****************'
Consumer_secret = 've*****************'
Access_token = '25*****************'
Access_secret = 'NT*****************'
url = "https://api.twitter.com/1.1/search/tweets.json"
twitter = OAuth1Session(Consumer_key, Consumer_secret, Access_token, Access_secret)
#Parameters used for collection
max_id = -1
count = 100
params = {'q' : keyword, 'count' : count, 'max_id' : max_id, 'lang' : 'ja', 'tweet_mode' : 'extended'}
#Preparing to compare date processing utc and jst in Japan time
today =datetime.datetime.now(pytz.timezone('Asia/Tokyo'))
today_beggining_of_day = today.replace(hour=0, minute=0, second=0, microsecond=0)
yesterday_beggining_of_day = today_beggining_of_day - timedelta(days=1)
yesterday_str = datetime.datetime.strftime(yesterday_beggining_of_day, '%Y-%m-%d')
#Corresponds to record in the DF while statement that stores tweet information
columns = ['time', 'user.id', 'user.location', 'full_text', 'user.followers_count', 'user.friends_count', 'user.description', 'id']
df = pd.DataFrame(index=[], columns=columns)
while(True):
if max_id != -1: #Go back to the tweet that already stored the tweet id
params['max_id'] = max_id - 1
req = twitter.get(url, params = params)
if req.status_code == 200: #If you can get it normally
search_timeline = json.loads(req.text)
if search_timeline['statuses'] == []: #When you finish taking all tweets
break
for tweet in search_timeline['statuses']:
#Tweet time utc
tweet_datetime = convert_to_datetime(tweet['created_at'])
#If it's not yesterday's tweet, skip
in_jst_yesterday = today_beggining_of_day > tweet_datetime >= yesterday_beggining_of_day
if not in_jst_yesterday: #If it's not yesterday's tweet, skip
continue
#Store in DF
record = pd.Series([tweet_datetime,
tweet['user']['id'],
tweet['user']['location'],
tweet['full_text'],
tweet['user']['followers_count'],
tweet['user']['friends_count'],
tweet['user']['description'],
tweet['id'],
],index=df.columns)
df = df.append(record, ignore_index=True)
max_id = search_timeline['statuses'][-1]['id']
else: #Wait 15 minutes if you get stuck in access frequency restrictions
print("Total", df.shape[0], "tweets were extracted", sep=" ")
print('wainting for 15 min ...')
sleep(15*60)
#Save
df = df.set_index("time")
df.index = df.index.tz_convert('Asia/Tokyo')
df.to_pickle(DIR + yesterday_str + keyword +".pkl")
df.to_csv(DIR + yesterday_str + keyword +".csv")
print(today, "Total", df.shape[0], "tweets were extracted!\nnext start at 01:00 tommorow")
def main():
print("start at 01:00 tommorow")
#Run at 01:00 every day
schedule.every().day.at("01:00").do(job)
while True:
schedule.run_pending()
sleep(1)
if __name__ == '__main__':
main()
I want to analyze as soon as the data is collected. This is because it is not possible to evaluate even the periodic changes depending on the day of the week using only the past data for one week.
I learned that I have to be careful about handling Japan time and standard time in order to collect data every day.
Note that datetime.datetime.now ()
depends on the environment in which the program runs, so running this source on a machine in another country will not work properly. The same applies to schedule.every (). day.at ("01: 00 "). Do (job)
.
Of the tweets that included the past "today" and "drinking party" that could be extracted, about 10% included "online." Also, many Twitterers don't like company drinking parties.
Recommended Posts