A Python program that collects tweets containing specific keywords daily and saves them in csv

Purpose

I have an idea that SNS information such as Twitter can be used to monitor the risk of outbreak of a new coronavirus infection cluster, and I would like to collect tweets by keyword search such as "Today's drinking party". Since the free search API can only trace tweets one week ago, we will build a mechanism that can automatically collect data every day, considering the possibility of using it for future research.

If you have a good idea of a search word that can be used to evaluate the risk of developing a new coronavirus infection cluster, please comment! </ b>

Reference URL

Originator https://developer.twitter.com/en/docs/twitter-api Very easy-to-understand commentary site https://gaaaon.jp/blog/twitterapi It's sad that this article hasn't even reached the "tried self-satisfaction code" mentioned in the link above, so in some cases the article may be kept private.

Method

Execute the following nomikai_tweets.py.

# coding: utf-8 # nomikai_tweets.py import pandas as pd import json import schedule from time import sleep from requests_oauthlib import OAuth1Session import datetime from datetime import date, timedelta import pytz def convert_to_datetime(datetime_str): """ Date and time format conversion """ tweet_datetime = datetime.datetime.strptime(datetime_str,'%a %b %d %H:%M:%S %z %Y') return(tweet_datetime) def job(): """ Program that repeats in main """ #Search keyword Excluding retweets keyword = "Drinking party exclude today:retweets" #Save directory Create first DIR = 'nomikai/' #Information obtained by API communication and developer registration Consumer_key = 'bT*****************' Consumer_secret = 've*****************' Access_token = '25*****************' Access_secret = 'NT*****************' url = "https://api.twitter.com/1.1/search/tweets.json" twitter = OAuth1Session(Consumer_key, Consumer_secret, Access_token, Access_secret) #Parameters used for collection max_id = -1 count = 100 params = {'q' : keyword, 'count' : count, 'max_id' : max_id, 'lang' : 'ja', 'tweet_mode' : 'extended'} #Preparing to compare date processing utc and jst in Japan time today =datetime.datetime.now(pytz.timezone('Asia/Tokyo')) today_beggining_of_day = today.replace(hour=0, minute=0, second=0, microsecond=0) yesterday_beggining_of_day = today_beggining_of_day - timedelta(days=1) yesterday_str = datetime.datetime.strftime(yesterday_beggining_of_day, '%Y-%m-%d') #Corresponds to record in the DF while statement that stores tweet information columns = ['time', 'user.id', 'user.location', 'full_text', 'user.followers_count', 'user.friends_count', 'user.description', 'id'] df = pd.DataFrame(index=[], columns=columns) while(True): if max_id != -1: #Go back to the tweet that already stored the tweet id params['max_id'] = max_id - 1 req = twitter.get(url, params = params) if req.status_code == 200: #If you can get it normally search_timeline = json.loads(req.text) if search_timeline['statuses'] == []: #When you finish taking all tweets break for tweet in search_timeline['statuses']: #Tweet time utc tweet_datetime = convert_to_datetime(tweet['created_at']) #If it's not yesterday's tweet, skip in_jst_yesterday = today_beggining_of_day > tweet_datetime >= yesterday_beggining_of_day if not in_jst_yesterday: #If it's not yesterday's tweet, skip continue #Store in DF record = pd.Series([tweet_datetime, tweet['user']['id'], tweet['user']['location'], tweet['full_text'], tweet['user']['followers_count'], tweet['user']['friends_count'], tweet['user']['description'], tweet['id'], ],index=df.columns) df = df.append(record, ignore_index=True) max_id = search_timeline['statuses'][-1]['id'] else: #Wait 15 minutes if you get stuck in access frequency restrictions print("Total", df.shape[0], "tweets were extracted", sep=" ") print('wainting for 15 min ...') sleep(15*60) #Save df = df.set_index("time") df.index = df.index.tz_convert('Asia/Tokyo') df.to_pickle(DIR + yesterday_str + keyword +".pkl") df.to_csv(DIR + yesterday_str + keyword +".csv") print(today, "Total", df.shape[0], "tweets were extracted!\nnext start at 01:00 tommorow") def main(): print("start at 01:00 tommorow") #Run at 01:00 every day schedule.every().day.at("01:00").do(job) while True: schedule.run_pending() sleep(1) if __name__ == '__main__': main()

result

I want to analyze as soon as the data is collected. This is because it is not possible to evaluate even the periodic changes depending on the day of the week using only the past data for one week.

comment

I learned that I have to be careful about handling Japan time and standard time in order to collect data every day. Note that datetime.datetime.now () depends on the environment in which the program runs, so running this source on a machine in another country will not work properly. The same applies to schedule.every (). day.at ("01: 00 "). Do (job).

Of the tweets that included the past "today" and "drinking party" that could be extracted, about 10% included "online." Also, many Twitterers don't like company drinking parties.

Recommended Posts
A Python program that collects tweets containing specific keywords daily and saves them in csv

A script that retrieves tweets with Python, saves them in an external file, and performs morphological analysis.

Save tweets containing specific keywords in CSV on Twitter

A script that transfers tweets containing specific Twitter keywords to Slack in real time

How to stop a program in python until a specific date and time

A program that removes duplicate statements in Python

A Python script that reads a SQL file, executes BigQuery and saves the csv

Continue to retrieve tweets containing specific keywords using the Streaming API in Python

I made a program in Python that reads CSV data of FX and creates a large amount of chart images

A note that runs an external program in Python and parses the resulting line

A Python program in "A book that gently teaches difficult programming"

A general-purpose program that formats Linux command strings in python

I tried "a program that removes duplicate statements in Python"

I made a program to collect images in tweets that I liked on twitter with Python

Collect tweets using tweepy in Python and save them in MongoDB

Create code that outputs "A and pretending B" in python

A program that determines whether a number entered in Python is a prime number

[Python] A program that creates stairs with #

Get tweets containing keywords using Python Tweepy

I made a payroll program in Python!

A program that plays rock-paper-scissors using Python

[Python] A program that rounds the score

[Beginner] What happens if I write a program that runs in php in Python?

Publishing and using a program that automatically collects facial images of specified people

I want to exe and distribute a program that resizes images Python3 + pyinstaller

[Python] A program that finds the minimum and maximum values without using methods

[Python] A program that calculates the number of updates of the highest and lowest records

Save Twitter's tweets with Geo in CSV and plot them on Google Map.

Until you get daily data for multiple years of Japanese stocks and save it in a single CSV (Python)

Organize python modules and packages in a mess

A memo that I wrote a quicksort in Python

A nice nimporter that connects nim and python

I wrote a class in Python3 and Java

Reading and writing CSV and JSON files in Python

A simple Pub / Sub program note in Python

Extract lines containing a specific "string" in Pandas

Let's write a Python program and run it

Create a package containing global commands in Python

I made a Caesar cryptographic program in Python.

Get a row containing a specific element in np.where

A Python script that crawls RSS in Azure Status and posts it to Hipchat

A program that asks for a few kilograms to reach BMI and standard weight [Python]

[Python] Rename all image files in a specific folder by shooting date and time

[Python] A program that finds the shortest number of steps in a game that crosses clouds

[Python] Change the text color and background color of a specific keyword in print output

[Python] Leave only the elements that start with a specific character string in the array

A program that summarizes the transaction history csv data of SBI SECURITIES stocks [Python3]

A solution to the problem that files containing [and] are not listed in glob.glob ()