I was addicted to collecting tweets using the Twitter API Search API. !! There is a pitfall in an unexpected place ...

About Twitter's Search API

This time, I used the Search API to get tweets. A brief summary of the Search API specifications.

There are three types of Search API. --Standard Search API (free) --Premium Search API (paid) --Enterprise Search API (paid)

This time, we will use the Standard Search API, which can be used for free.

Features of Standard Search API

--Free available --There is a limit to the number of requests --When authenticating with user auth (OAuth1) 180 requests / 15 minutes --When authenticating with app auth (OAuth2) 450 requests / 15 minutes -** You can get up to the last 7 days ** tweets --With a paid API, you can get more past tweets.

Request parameters

Parameters	Description	Remarks
q	Search query (required)	Search similar to tweet search on Twitter is possible, only character strings are possible
geocode	Where you tweeted	Specify by latitude, longitude, radius
lang	Specifying the language of tweets
locale	Query language specification	Currently in Japanese`ja`Only valid
result_type	Specify the type of acquired tweet	`recent`Then the latest tweet,`popular`Then popular tweets,`mixed`Then both
count	Specify the number of acquisitions	The default is 15 and the maximum is 100.
until	Specifying the tweet time	YYYY-MM-Get tweets before DD (cannot get before 7 days)
since_id	ID value specification	Get tweets with ID values larger than the specified ID value
max_id	ID value specification	Get tweets with ID values smaller than the specified ID value
include_entities	entitiesWith or without	`false`If you specify, you will get tweets without including entities information.

Response parameters

Parameters	Description	Remarks
statuses	List of tweets	The tweet object is stored in a list
search_metadata	Search metadata	Contains search metadata

Response example

`python`


{
  "statuses": [
(Omitted because it is a tweet object),
    ...
  ],
  "search_metadata": {
    "max_id": 250126199840518145,
    "since_id": 24012619984051000,
    "refresh_url": "?since_id=250126199840518145&q=%23freebandnames&result_type=mixed&include_entities=1",
    "next_results": "?max_id=249279667666817023&q=%23freebandnames&count=4&include_entities=1&result_type=mixed",
    "count": 4,
    "completed_in": 0.035,
    "since_id_str": "24012619984051000",
    "query": "%23freebandnames",
    "max_id_str": "250126199840518145"
  }
}

Occurrence of problem

Background

I was trying to collect a lot of tweets with the hashtag #Qiita.

However, the Standard Search API can only retrieve up to 100 tweets in a single request. Therefore, I tried to get 1000 tweets by recursively calling the API by using the request parameter next_results. A query is stored in next_results, and by executing this query, you can get the 101st and subsequent tweets. In other words

Request-> Response-> Parse next_results-> Go to next request parameter-> Request-> ...

I will do it until I get 1000 cases.

(Reference: Get more than 100 tweets with Twitter API search / tweets (PHP))

However, ** requests are only executed 3 times ** and only 200 tweets can be taken! (The number of tweets acquired is 0 in the third response) Even though there are clearly over 200 tweets ...

program

The code was written in Python. In addition, various API keys are registered in environment variables.

`get_tweet.py`


from requests_oauthlib import OAuth1Session
import os
import json

#API key installation
CONSUMER_KEY = XXXXXXXXXXXXXXXXXXXXXX #API key 
CONSUMER_SECRET = XXXXXXXXXXXXXXXXXXXXXX #API secret
ACCESS_TOKEN = XXXXXXXXXXXXXXXXXXXXXX
ACCESS_SECRET = XXXXXXXXXXXXXXXXXXXXXX

#URL for getting tweets
SEARCH_URL = 'https://api.twitter.com/1.1/search/tweets.json'


def search(params):
    twitter = OAuth1Session(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET)
    req = twitter.get(SEARCH_URL, params = params)
    tweets = json.loads(req.text)
    return tweets

#Parse in PHP_Instead of str function
def parseToParam(parse_str, parse=None):
    if parse is None:
        parse = '&'
    return_params = {}
    parsed_str = parse_str.split(parse)
    for param_string in parsed_str:
        param, value = param_string.split('=', 1)
        return_params[param] = value
    return return_params

def main():
    search_word = '#Qiita'
    tweet_data = []

    # Tweet Search
    params = {
                'q'  : search_word,
            'count'  : 100,
             }
    tweet_count = 0

    while tweet_count < 1000:
        tweets = search(params)
        for tweet in tweets['statuses']:
            tweet_data.append(tweet)
        # tweets['search_metadata']['next_results']Parse to param
        if 'next_results' in tweets['search_metadata'].keys():
            next_results = tweets['search_metadata']['next_results']
            next_results = next_results.lstrip('?') #At the beginning?Delete
            params = parseToParam(next_results)
            tweet_count += len(tweets['statuses'])
        else:
            break

if __name__=='__main__':
    main()

Investigation of the cause

Since the response parameter next_results is used for the next request parameter, --Request parameters --Response parameter next_results Check the two points.

Check request parameters and response parameters

First time Request parameters

{
  'q'    : '#Qiita',
  'count': 100
}

Response parameter next_results

?max_id=1250763045871079425&q=%23Qiita&count=100&include_entities=1

Second time Request parameters

{
　'max_id': '1250763045871079425', 
  'q'    : '%23Qiita',
  'count': 100,
  'include_entities': '1'
}

Response parameter next_results

?max_id=1250673475351572480&q=%2523Qiita&count=100&include_entities=1

Third time Request parameters

{
　'max_id': '1250673475351572480', 
  'q'    : '%2523Qiita',
  'count': 100,
  'include_entities': '1'
}

Response parameter next_results

None

Survey results

Apparently, the same query is inherited originally #Qiita → %23Qiita → %2523Qiita It seems that the query is changing.

#Qiita and% 23Qiita are compatible with each other by URL encoding, but % 2523Qiita is a completely different query. (You can check it by trying encoding / decoding at here.)

That is, It seems that the cause of the problem is that **% is decoded ** in the process of % 23Qiita →% 2523Qiita.

solution

After parsing the response parameter next_results, replace **% 25 in the request parameter with% **.

Program modification

Added parameter replacement process in while statement

`get_tweet.py`


    while tweet_count < 1000:
        tweets = search(params)
        for tweet in tweets['statuses']:
            tweet_data.append(tweet)
        # tweets['search_metadata']['next_results']Parse to param
        if 'next_results' in tweets['search_metadata'].keys():
            next_results = tweets['search_metadata']['next_results']
            next_results = next_results.lstrip('?') #At the beginning?Delete
            params = parseToParam(next_results)
            # %Added 25 replacement processes
            params['q'] = params['q'].replace('%25', '%') 
            tweet_count += len(tweets['statuses'])
        else:
            break

Summary

In the query q contained in the response parameter next_results,% after URL encoding was extra encoded. As a result, the query inheritance did not work and there was a problem getting tweets.

The solution was to restore the extra-encoded% in next_results with string replacement.

reference

Twitter API official documentation https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets Twitter developer documentation Japanese translation http://westplain.sakuraweb.com/translate/twitter/Documentation/REST-APIs/Public-API/GET-search-tweets.cgi Get over 100 tweets with Twitter API search / tweets (PHP) https://blog.apar.jp/php/3007/ URL encoding / decoding https://tech-unlimited.com/urlencode.html

I can't use the "next_results" parameter in the Twitter API Search API! ?? Causes and remedies

About Twitter's Search API

Features of Standard Search API

Request parameters

Response parameters

Response example

python

Occurrence of problem

Background

program

get_tweet.py

Investigation of the cause

Check request parameters and response parameters

Survey results

solution

Program modification

get_tweet.py

Summary

reference

`python`

`get_tweet.py`

`get_tweet.py`