I was addicted to collecting tweets using the Twitter API Search API. !! There is a pitfall in an unexpected place ...
This time, I used the Search API to get tweets. A brief summary of the Search API specifications.
There are three types of Search API. --Standard Search API (free) --Premium Search API (paid) --Enterprise Search API (paid)
This time, we will use the Standard Search API, which can be used for free.
--Free available --There is a limit to the number of requests --When authenticating with user auth (OAuth1) 180 requests / 15 minutes --When authenticating with app auth (OAuth2) 450 requests / 15 minutes -** You can get up to the last 7 days ** tweets --With a paid API, you can get more past tweets.
Parameters | Description | Remarks |
---|---|---|
q | Search query (required) | Search similar to tweet search on Twitter is possible, only character strings are possible |
geocode | Where you tweeted | Specify by latitude, longitude, radius |
lang | Specifying the language of tweets | |
locale | Query language specification | Currently in Japaneseja Only valid |
result_type | Specify the type of acquired tweet | recent Then the latest tweet,popular Then popular tweets,mixed Then both |
count | Specify the number of acquisitions | The default is 15 and the maximum is 100. |
until | Specifying the tweet time | YYYY-MM-Get tweets before DD (cannot get before 7 days) |
since_id | ID value specification | Get tweets with ID values larger than the specified ID value |
max_id | ID value specification | Get tweets with ID values smaller than the specified ID value |
include_entities | entitiesWith or without | false If you specify, you will get tweets without including entities information. |
Parameters | Description | Remarks |
---|---|---|
statuses | List of tweets | The tweet object is stored in a list |
search_metadata | Search metadata | Contains search metadata |
python
{
"statuses": [
(Omitted because it is a tweet object),
...
],
"search_metadata": {
"max_id": 250126199840518145,
"since_id": 24012619984051000,
"refresh_url": "?since_id=250126199840518145&q=%23freebandnames&result_type=mixed&include_entities=1",
"next_results": "?max_id=249279667666817023&q=%23freebandnames&count=4&include_entities=1&result_type=mixed",
"count": 4,
"completed_in": 0.035,
"since_id_str": "24012619984051000",
"query": "%23freebandnames",
"max_id_str": "250126199840518145"
}
}
I was trying to collect a lot of tweets with the hashtag #Qiita
.
However, the Standard Search API can only retrieve up to 100 tweets in a single request.
Therefore, I tried to get 1000 tweets by recursively calling the API by using the request parameter next_results
.
A query is stored in next_results
, and by executing this query, you can get the 101st and subsequent tweets.
In other words
Request-> Response-> Parse next_results
-> Go to next request parameter-> Request-> ...
I will do it until I get 1000 cases.
(Reference: Get more than 100 tweets with Twitter API search / tweets (PHP))
However, ** requests are only executed 3 times ** and only 200 tweets can be taken! (The number of tweets acquired is 0 in the third response) Even though there are clearly over 200 tweets ...
The code was written in Python. In addition, various API keys are registered in environment variables.
get_tweet.py
from requests_oauthlib import OAuth1Session
import os
import json
#API key installation
CONSUMER_KEY = XXXXXXXXXXXXXXXXXXXXXX #API key
CONSUMER_SECRET = XXXXXXXXXXXXXXXXXXXXXX #API secret
ACCESS_TOKEN = XXXXXXXXXXXXXXXXXXXXXX
ACCESS_SECRET = XXXXXXXXXXXXXXXXXXXXXX
#URL for getting tweets
SEARCH_URL = 'https://api.twitter.com/1.1/search/tweets.json'
def search(params):
twitter = OAuth1Session(CONSUMER_KEY, CONSUMER_SECRET, ACCESS_TOKEN, ACCESS_SECRET)
req = twitter.get(SEARCH_URL, params = params)
tweets = json.loads(req.text)
return tweets
#Parse in PHP_Instead of str function
def parseToParam(parse_str, parse=None):
if parse is None:
parse = '&'
return_params = {}
parsed_str = parse_str.split(parse)
for param_string in parsed_str:
param, value = param_string.split('=', 1)
return_params[param] = value
return return_params
def main():
search_word = '#Qiita'
tweet_data = []
# Tweet Search
params = {
'q' : search_word,
'count' : 100,
}
tweet_count = 0
while tweet_count < 1000:
tweets = search(params)
for tweet in tweets['statuses']:
tweet_data.append(tweet)
# tweets['search_metadata']['next_results']Parse to param
if 'next_results' in tweets['search_metadata'].keys():
next_results = tweets['search_metadata']['next_results']
next_results = next_results.lstrip('?') #At the beginning?Delete
params = parseToParam(next_results)
tweet_count += len(tweets['statuses'])
else:
break
if __name__=='__main__':
main()
Since the response parameter next_results
is used for the next request parameter,
--Request parameters
--Response parameter next_results
Check the two points.
First time Request parameters
{
'q' : '#Qiita',
'count': 100
}
Response parameter next_results
?max_id=1250763045871079425&q=%23Qiita&count=100&include_entities=1
Second time Request parameters
{
'max_id': '1250763045871079425',
'q' : '%23Qiita',
'count': 100,
'include_entities': '1'
}
Response parameter next_results
?max_id=1250673475351572480&q=%2523Qiita&count=100&include_entities=1
Third time Request parameters
{
'max_id': '1250673475351572480',
'q' : '%2523Qiita',
'count': 100,
'include_entities': '1'
}
Response parameter next_results
None
Apparently, the same query is inherited originally
#Qiita
→ %23Qiita
→ %2523Qiita
It seems that the query is changing.
#Qiita
and% 23Qiita
are compatible with each other by URL encoding, but % 2523Qiita
is a completely different query.
(You can check it by trying encoding / decoding at here.)
That is,
It seems that the cause of the problem is that **% is decoded ** in the process of % 23Qiita
→% 2523Qiita
.
After parsing the response parameter next_results
, replace **% 25 in the request parameter with% **.
Added parameter replacement process in while statement
get_tweet.py
while tweet_count < 1000:
tweets = search(params)
for tweet in tweets['statuses']:
tweet_data.append(tweet)
# tweets['search_metadata']['next_results']Parse to param
if 'next_results' in tweets['search_metadata'].keys():
next_results = tweets['search_metadata']['next_results']
next_results = next_results.lstrip('?') #At the beginning?Delete
params = parseToParam(next_results)
# %Added 25 replacement processes
params['q'] = params['q'].replace('%25', '%')
tweet_count += len(tweets['statuses'])
else:
break
In the query q
contained in the response parameter next_results
,% after URL encoding was extra encoded.
As a result, the query inheritance did not work and there was a problem getting tweets.
The solution was to restore the extra-encoded% in next_results
with string replacement.
Twitter API official documentation https://developer.twitter.com/en/docs/tweets/search/api-reference/get-search-tweets Twitter developer documentation Japanese translation http://westplain.sakuraweb.com/translate/twitter/Documentation/REST-APIs/Public-API/GET-search-tweets.cgi Get over 100 tweets with Twitter API search / tweets (PHP) https://blog.apar.jp/php/3007/ URL encoding / decoding https://tech-unlimited.com/urlencode.html
Recommended Posts