MacOS, Linux python v3.4.3 PHP v2.5.0
Data distribution is often tweetid, but Twitter crawls are limited and often cumbersome to collect. This time, I will show you how to collect tweets from tweetid. The one I use is the one that gets 100 officially announced tweets per request called GET statuses / lookup. (I don't know the details)
As many of you may know, you need an account to crawl Twitter. In addition, you need to obtain the following four information from Twitter developers. The explanation is rolling around here, so please collect it yourself.
Please refer to the script published on Github. Twitter crawl script
Use Github if needed
git clone https://github.com/ace12358/twitter/
Please use it because you can prepare the necessary scripts at. Below is an example of using the code in the src / repository.
Next, add the four information you got to the script in tweetid2json.php.
Once it's done
php tweetid2json.php 418033807850496002
If you do, you can crawl in json format. here
php tweetid2json.php 418033807850496002 | python json_reader3.4.3.py
so
418033807850496002 Happy New Year! Output is possible with tab delimiters such as. By the way, you can request up to 100 tweetids like 418033807850496002, separated by commas. There is a shell script that summarizes these
bash make_tweet.sh ../data/tweet_id_list.txt
Reads and crawls one line (tweetid (s)) of the file every 6 seconds by executing. Every 6 seconds is because it doesn't hit the limit.
That is all for the explanation. To collect most efficiently A file in which 100 tweetids are concatenated with',' creates a single line.
bash make_tweet.sh ../data/tweet_id_list.txt
It would be nice to run.
It takes about a day to collect data of about 1 million tweets. On the server etc.
nohop bash make_tweet.sh ../data/tweet_id_list.txt > tweetid_tweet.txt &
It is good to leave it as such. If you are in a hurry, you can create multiple accounts and process them in parallel.
Please contact @ Ace12358. I think I can reply to you soon.
Recommended Posts