Collect tweets using tweepy in Python and save them in MongoDB

I usually use C # and Java for work, but I've always been interested in Python and have been interested in Python. Data analysis and machine learning are so popular that I decided to take this opportunity to study Python! !!

For Python, recently released [Introduction to Python3](http://www.amazon.co.jp/%E5%85%A5%E9%96%80-Python-3-Bill-Lubanovic/dp/4873117380/ref = sr_1_1? ie = UTF8 & qid = 1457003044 & sr = 8-1 & keywords =% E5% 85% A5% E9% 96% 80 Python3) is reading and learning.

This time, various people have already introduced it, but I will introduce it because I wrote a program to save the results of Twitter search in MongoDB. I would be very happy if you could dig into more things like this! !!

environment

Python 3.5
PyYAML 3.1.1
pymongo 3.2
tweepy 3.5.0

various settings

Please set according to your environment.

`config.py`


# coding=utf-8
# write code...

# mongodb
HOST = 'localhost'
PORT = 27017
DB_NAME = 'twitter-archive'
COLLECTION_NAME = 'tweets'

# twitter
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
ACCESS_TOKEN_KEY = ''
ACCESS_TOKEN_SECRET = ''

Search keyword

I decided to manage the keywords specified when searching Twitter in a YAML file.

`keywords.yml`


#Define Twitter search keywords as a list.
#The following is an example.
- 'hamburger'
- 'baseball'
- 'Christmas'

Log output class

I created a wrapper class while investigating how to use logging. There are many things I don't understand yet, so I'm studying detailed settings, but I've confirmed that I can output logs.

`logger.py`


import logging
from logging.handlers import TimedRotatingFileHandler

# coding=utf-8
# write code...

class Logger:
    def __init__(self, log_type):
        logger = logging.getLogger(log_type)
        logger.setLevel(logging.DEBUG)
        #I want to rotate every day, but I haven't done it yet. .. ..
        handler = TimedRotatingFileHandler(filename='archive.log', when='D', backupCount=30)
        formatter = logging.Formatter('[%(asctime)s] %(name)s %(levelname)s %(message)s')
        handler.setFormatter(formatter)
        logger.addHandler(handler)
        self.logger = logger

    def info(self, msg, *args, **kwargs):
        self.logger.info(msg, *args, **kwargs)

    def debug(self, msg, *args, **kwargs):
        self.logger.debug(msg, *args, **kwargs)

    def error(self, msg, *args, **kwargs):
        self.logger.error(msg, *args, **kwargs)

    def exception(self, msg, *args, exc_info=True, **kwargs):
        self.logger.exception(msg, *args, exc_info, **kwargs)

Main search & save process

I'm thinking of starting a batch once a week and accumulating tweets on a regular basis. I had a harder time understanding the specifications of the Twitter API than I expected. I thought about how to control using since_id and max_id so as not to get duplicate tweets. How was it good to do it? .. ..

`archive.py`


import sys
import config
import yaml
from tweepy import *
from tweepy.parsers import JSONParser
from pymongo import *
from logger import Logger


# coding: UTF-8
# write code...

def archive():

    #Read the list of search keywords from the YAML file and generate a string for OR search.
    with open('keywords.yml', 'r') as file:
        keywords = yaml.load(file)
    query_string = ' OR '.join(keywords)

    #Initialization of log output object
    logger = Logger('archive')

    #Generate client for Twitter search
    auth = OAuthHandler(config.CONSUMER_KEY, config.CONSUMER_SECRET)
    auth.set_access_token(config.ACCESS_TOKEN_KEY, config.ACCESS_TOKEN_SECRET)
    #I want to receive the result in JSON, so set JSON Parser.
    #Even if the search limit is reached, the library will do the best. Should be.
    twitter_client = API(auth, parser=JSONParser(), wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
    if twitter_client is None:
        logger.error('certification failed.')
        sys.exit(-1)

    #Initialize a collection of mongodb to store tweets
    client = MongoClient(config.HOST, config.PORT)
    tweet_collection = client[config.DB_NAME][config.COLLECTION_NAME]

    #Get the latest tweets from the acquired tweets, and set to get the id and later of the tweets.
    last_tweet = tweet_collection.find_one(sort=[('id', DESCENDING)])
    since_id = None if last_tweet is None else last_tweet['id']

    #When searching for the first time, max_Do not set id-Set 1.
    max_id = -1

    # tweet_count is max_tweet_When the count is reached, the search ends.
    # max_tweet_Set a large value for count.
    tweet_count = 0
    max_tweet_count = 100000

    logger.info('maximum{0}Collect individual tweets.'.format(max_tweet_count))
    while tweet_count < max_tweet_count:
        try:
            params = {
                'q': query_string,
                'count': 100,
                'lang': 'ja'
            }
            # max_id and since_Only pass id as a parameter if it is set.
            if max_id > 0:
                params['max_id'] = str(max_id - 1)
            if since_id is not None:
                params['since_id'] = since_id

            search_result = twitter_client.search(**params)
            statuses = search_result['statuses']

            #Check if you could search to the end
            if statuses is None or len(statuses) == 0:
                logger.info('The tweet was not found.')
                break

            tweet_count += len(statuses)
            logger.debug('{0}I got the tweets.'.format(tweet_count))

            result = tweet_collection.insert_many([status for status in statuses])
            logger.debug('I saved it in MongoDB. ID is{0}is.'.format(result))

            #Update with the last Tweet ID you got.
            max_id = statuses[-1]['id']

        except (TypeError, TweepError) as e:
            print(str(e))
            logger.exception(str(e))
            break

if __name__ == '__main__':
    archive()

Summary

I haven't mastered Python at all yet, but I thought it was a language where I could write what I wanted to do. I will continue to study. In the future, I will try to analyze the collected tweets using the library for data analysis! !!