How to collect tweets from tweetid as soon as possible (72000 tweets / hour)

How to collect tweets from tweetid as soon as possible (72000 tweets / hour)

environment

MacOS, Linux python v3.4.3 PHP v2.5.0

Overview

Data distribution is often tweetid, but Twitter crawls are limited and often cumbersome to collect. This time, I will show you how to collect tweets from tweetid. The one I use is the one that gets 100 officially announced tweets per request called GET statuses / lookup. (I don't know the details)

Twitter account required for crawling

As many of you may know, you need an account to crawl Twitter. In addition, you need to obtain the following four information from Twitter developers. The explanation is rolling around here, so please collect it yourself.

Code and usage

Please refer to the script published on Github. Twitter crawl script

Use Github if needed

git clone https://github.com/ace12358/twitter/

Please use it because you can prepare the necessary scripts at. Below is an example of using the code in the src / repository.

Next, add the four information you got to the script in tweetid2json.php.

Once it's done

php tweetid2json.php 418033807850496002

If you do, you can crawl in json format. here

php tweetid2json.php 418033807850496002 | python json_reader3.4.3.py

so

418033807850496002 Happy New Year! Output is possible with tab delimiters such as. By the way, you can request up to 100 tweetids like 418033807850496002, separated by commas. There is a shell script that summarizes these

bash make_tweet.sh ../data/tweet_id_list.txt

Reads and crawls one line (tweetid (s)) of the file every 6 seconds by executing. Every 6 seconds is because it doesn't hit the limit.

That is all for the explanation. To collect most efficiently A file in which 100 tweetids are concatenated with',' creates a single line.

bash make_tweet.sh ../data/tweet_id_list.txt

It would be nice to run.

It takes about a day to collect data of about 1 million tweets. On the server etc.

nohop bash make_tweet.sh ../data/tweet_id_list.txt > tweetid_tweet.txt &

It is good to leave it as such. If you are in a hurry, you can create multiple accounts and process them in parallel.

If you get an Error with Call to undefined function curl_init () after installing php

References

if you have some trouble

Please contact @ Ace12358. I think I can reply to you soon.

Recommended Posts

How to collect tweets from tweetid as soon as possible (72000 tweets / hour)
How to make Selenium as light as possible
Study from Python Hour7: How to use classes
How to resolve the error from toimage (from PIL.Image import fromarray as toimage)
How to get a job as an engineer from your 30s
How to use SWIG from waf
How to collect images in Python
How to launch Explorer from WSL
How to access wikipedia from python
How to convert from .mgz to .nii.gz
How to collect machine learning data
Dedicated to beginners! How to learn programming without spending as much money as possible
How to create a clone from Github
How to collect Twitter data without programming
How to easily convert format from Markdown
How to update Google Sheets from Python
How to install CatBoost [as of January 2020]
[TF] How to use Tensorboard from Keras
How to utilize multi-core from multiple languages
How to access RDS from Lambda (python)
How to operate Linux from the console
How to collect face images relatively easily
How to create a repository from media
How to access the Datastore from the outside