How to collect tweets from tweetid as soon as possible (72000 tweets / hour)

environment

MacOS, Linux python v3.4.3 PHP v2.5.0

Overview

Data distribution is often tweetid, but Twitter crawls are limited and often cumbersome to collect. This time, I will show you how to collect tweets from tweetid. The one I use is the one that gets 100 officially announced tweets per request called GET statuses / lookup. (I don't know the details)

Twitter account required for crawling

As many of you may know, you need an account to crawl Twitter. In addition, you need to obtain the following four information from Twitter developers. The explanation is rolling around here, so please collect it yourself.

Consumer Key (API Key) =
Consumer Secret (API Secret) =
Access Token =
Access Token Secret =

Code and usage

Please refer to the script published on Github. Twitter crawl script

Use Github if needed

git clone https://github.com/ace12358/twitter/

Please use it because you can prepare the necessary scripts at. Below is an example of using the code in the src / repository.

Next, add the four information you got to the script in tweetid2json.php.

Consumer Key =
Consumer Secret =
Access Token =
Access Token Secret =

Once it's done

php tweetid2json.php 418033807850496002

If you do, you can crawl in json format. here

php tweetid2json.php 418033807850496002 | python json_reader3.4.3.py

418033807850496002 Happy New Year! Output is possible with tab delimiters such as. By the way, you can request up to 100 tweetids like 418033807850496002, separated by commas. There is a shell script that summarizes these

bash make_tweet.sh ../data/tweet_id_list.txt

Reads and crawls one line (tweetid (s)) of the file every 6 seconds by executing. Every 6 seconds is because it doesn't hit the limit.

Limit: 180 request / 15 min

That is all for the explanation. To collect most efficiently A file in which 100 tweetids are concatenated with',' creates a single line.

bash make_tweet.sh ../data/tweet_id_list.txt

It would be nice to run.

It takes about a day to collect data of about 1 million tweets. On the server etc.

nohop bash make_tweet.sh ../data/tweet_id_list.txt > tweetid_tweet.txt &

It is good to leave it as such. If you are in a hurry, you can create multiple accounts and process them in parallel.

If you get an Error with Call to undefined function curl_init () after installing php

It worked well when I referred to here
→ When the error "Call to undefined function curl_init ()" appears in apache

References

Reference for php crawl

if you have some trouble

Please contact @ Ace12358. I think I can reply to you soon.