I wanted a lot of Japanese sentences for machine learning, so I decided to collect a lot of tweets on Twitter's Streaming API.
Even if I didn't collect it myself, if there was a file of someone doing the same thing somewhere, I thought I would use it, but after searching Google for a few minutes, I couldn't find the right one. Since it was the beginning of writing in 2017, I wrote it myself.
I'm using a library called Twython. (I used to use Tweepy, but apparently Twython is more popular these days.)
--Tweets with images --Tweets containing URLs --Tweets with hashtags --Retweet --Tweets with reply mentions
I've excluded these tweets because I thought they weren't suitable for use as a corpus material.
--LF delimited by 1 line 1 tweet --The line breaks included in tweets are separated by CR
It is in the format.
By doing this, you can leave line break information later, and you can make it easy to handle from the program "1 line 1 tweet".
If you execute it like this, it will output to the standard output.
In this example, it ends when you get 10 valid tweets. (You can specify this number with the -n
option.)
$ python tweetcorpus.py -n 10
If you shell like this, even if there is an error on the way, you can continue to take it almost infinitely.
You can see the progress visually as you pipe the TTY with tee
.
Since it pipes gzip, it is a little safe to collect a large number of tweets.
$ while true; do python -u tweetcorpus.py -n 500 | tee /dev/tty | gzip -cn >> tweet.gz ; sleep 1 ; done
(For the gzip combination used in ↑, see Gzip-compressed text files can be connected with cat)
(For the Python -u
option used in ↑, see Option to disable the stdout / stderr buffer in Python)
Personally, I prefer the style of simplifying the program itself and connecting it with pipes, rather than using the gzip module for each programming language.
The OAuth token for the Twitter API is read from the environment variables ʻAPP_KEY, ʻAPP_SECRET
, ʻOAUTH_TOKEN, ʻOAUTH_TOKEN_SECRET
.
Create an application on Twitter and Prepare the following files
.env
#!/bin/sh
export APP_KEY='XXXXXXXXXXXXX'
export APP_SECRET='XXXXXXXXXXXXXXXXXXXX'
export OAUTH_TOKEN='XXXXX-XXXXXXXXXX'
export OAUTH_TOKEN_SECRET='XXXXXXXXXX'
source ./.env
Let's read it in advance.
If you have a Python environment, install Twython and you're good to go.
$ pip3 install twython==3.4.0
tweetcorpus.py
import argparse
import html
import os
import sys
from twython import TwythonStreamer
class CorpusStreamer(TwythonStreamer):
def __init__(self, *args,
max_corpus_tweets=100,
write_file=sys.stdout):
super().__init__(*args)
self.corpus_tweets = 0
self.max_corpus_tweets = max_corpus_tweets
self.write_file = write_file
def exit_when_corpus_tweets_exceeded(self):
if self.corpus_tweets >= self.max_corpus_tweets:
self.disconnect()
def write(self, text):
corpus_text = text.replace('\n', '\r')
self.write_file.write(corpus_text + '\n')
self.corpus_tweets += 1
def on_success(self, tweet):
if 'text' not in tweet:
#Exclude other than tweet information(Notification etc.)
return
if 'retweeted_status' in tweet:
#Exclude retweets
return
if any(tweet['entities'].values()):
'''
tweet.entities.url
tweet.entities.media
tweet.entities.symbol
Exclude tweets containing information that cannot be handled by natural language processing alone
'''
return
text = html.unescape(tweet['text'])
self.write(text)
self.exit_when_corpus_tweets_exceeded()
def main():
parser = argparse.ArgumentParser()
parser.add_argument('-n', '--number-of-corpus-tweets',
type=int, default=100)
parser.add_argument('-o', '--outfile',
type=argparse.FileType('w', encoding='UTF-8'),
default=sys.stdout)
parser.add_argument('-l', '--language', type=str, default='ja')
app_key = os.environ['APP_KEY']
app_secret = os.environ['APP_SECRET']
oauth_token = os.environ['OAUTH_TOKEN']
oauth_token_secret = os.environ['OAUTH_TOKEN_SECRET']
args = parser.parse_args()
stream = CorpusStreamer(app_key, app_secret,
oauth_token, oauth_token_secret,
max_corpus_tweets=args.number_of_corpus_tweets,
write_file=args.outfile)
stream.statuses.sample(language=args.language)
if __name__ == '__main__':
main()
I tried it with the latest version of Python 3.6, but I think it will work if twython can be installed on 3 series.
Python 3.6.0 (default, Dec 29 2016, 18:49:32) [GCC 4.2.1 Compatible Apple LLVM 8.0.0 (clang-800.0.42.1)] on darwin
twython==3.4.0
Recommended Posts