Be careful when retrieving tweets at regular intervals with the Twitter API

If you think about it carefully, it's quite natural, but I'm quite addicted to it, so I'll leave it as a memo.

Trigger

I wrote a script that uses the Twitter API to retrieve tweets in Python and save them in the database every 15 minutes in an attempt to use them in deep learning programs, but sometimes the number of tweets retrieved jumped abnormally. FullSizeRender.jpg Normally, it takes about 20 tweets / 15 minutes on average, but suddenly nearly 600 to 700 tweets are acquired only in some places. This happens almost every day, but it doesn't happen at a fixed time, and the number of times it happens in a day is indefinite.

The tweet acquisition program I was using

It was a program that saves the ID of the latest tweet at the time of the previous acquisition, and at the next execution, it goes back from the latest tweet to that tweet and acquires it.

--------------------This time acquisition--------------------
Tweet 1 ID 899673612013064192<-Go back down from here
Tweet 2 ID 899673575619141633
Tweet 3 ID 899673508619276288
.         .          .
.         .          .
.         .          .
Tweet n ID 899669914251796480
--------------------This time acquisition--------------------
--------------------Last acquisition--------------------
Tweet 1' ID 899669914251796480 <-Finish when you reach here
Tweet 2' ID 899669747448414209
Tweet 3' ID 899669628170911750
.         .          .
.         .          .
.         .          .
Tweet n' ID 899668363969941506 

As shown above, 100 tweets are fetched from the latest tweet downward, and if the tweet ID matches the last ID obtained last time, it ends. It was like this in Python:

fetch.py


for tweet in fetched:
    if tweet["id_str"] == last_time_id: # last_time_id is a string
        break
    else:
        tweets.append(tweet)

Causes and countermeasures

The cause was that the tweet with last_time_id was deleted before the next acquisition. Or the retweet may have been canceled. In other words, there are no more tweets that match the ID of last_time_id, so new acquisitions will be repeated forever without matching the ID oftweet [" id_str"]. ʻId_str in the API response is a character string version of ʻid, which is originally a numerical value (it seems that a character string is prepared because an error occurs depending on the language if it is a numerical value), but I am using this It seems that it was the cause. The ID of the tweet is a number that increases over time, so if you do the following, the bug will disappear.

fetch_fixed.py


for tweet in fetched:
    if tweet["id"] <= int(last_time_id):
        break
    else:
        tweets.append(tweet)

I just changed both sides to numbers and changed it to <=. With this, even if the tweet with the ID of last_time_id is deleted, the tweet immediately before it has a smaller ID, so you can break it at that point.

Recommended Posts

Be careful when retrieving tweets at regular intervals with the Twitter API
Specifying the date with the Twitter API
Be careful when running CakePHP3 with PHP7.2
Be careful of the type when making an image mask with Numpy
Tweet regularly with the Go language Twitter API
Be careful when working with gzip-compressed text files
Hit the Twitter API after Oauth authentication with Django
Be careful when differentiating the eigenvectors of a matrix
Be careful when reading data with pandas (specify dtype)
Try hitting the Twitter API quickly and easily with Python
Solution when the image cannot be displayed with tkinter [python]
Streamline information gathering with the Twitter API and Slack bots
Use Twitter API with Python
Try using the Twitter API
Try using the Twitter API
Support yourself with Twitter API
Call the API with python3.
Search twitter tweets with python
Successful update_with_media with twitter API
Be careful when specifying the default argument value in Python3 series
[Latest version] Let the bot speak at regular intervals on discord.py
[Python] Code that can be written with brain death at the beginning when scraping as a beginner