I usually use C # and Java for work, but I've always been interested in Python and have been interested in Python. Data analysis and machine learning are so popular that I decided to take this opportunity to study Python! !!
For Python, recently released [Introduction to Python3](http://www.amazon.co.jp/%E5%85%A5%E9%96%80-Python-3-Bill-Lubanovic/dp/4873117380/ref = sr_1_1? ie = UTF8 & qid = 1457003044 & sr = 8-1 & keywords =% E5% 85% A5% E9% 96% 80 Python3) is reading and learning.
This time, various people have already introduced it, but I will introduce it because I wrote a program to save the results of Twitter search in MongoDB. I would be very happy if you could dig into more things like this! !!
Please set according to your environment.
config.py
# coding=utf-8
# write code...
# mongodb
HOST = 'localhost'
PORT = 27017
DB_NAME = 'twitter-archive'
COLLECTION_NAME = 'tweets'
# twitter
CONSUMER_KEY = ''
CONSUMER_SECRET = ''
ACCESS_TOKEN_KEY = ''
ACCESS_TOKEN_SECRET = ''
I decided to manage the keywords specified when searching Twitter in a YAML file.
keywords.yml
#Define Twitter search keywords as a list.
#The following is an example.
- 'hamburger'
- 'baseball'
- 'Christmas'
I created a wrapper class while investigating how to use logging. There are many things I don't understand yet, so I'm studying detailed settings, but I've confirmed that I can output logs.
logger.py
import logging
from logging.handlers import TimedRotatingFileHandler
# coding=utf-8
# write code...
class Logger:
def __init__(self, log_type):
logger = logging.getLogger(log_type)
logger.setLevel(logging.DEBUG)
#I want to rotate every day, but I haven't done it yet. .. ..
handler = TimedRotatingFileHandler(filename='archive.log', when='D', backupCount=30)
formatter = logging.Formatter('[%(asctime)s] %(name)s %(levelname)s %(message)s')
handler.setFormatter(formatter)
logger.addHandler(handler)
self.logger = logger
def info(self, msg, *args, **kwargs):
self.logger.info(msg, *args, **kwargs)
def debug(self, msg, *args, **kwargs):
self.logger.debug(msg, *args, **kwargs)
def error(self, msg, *args, **kwargs):
self.logger.error(msg, *args, **kwargs)
def exception(self, msg, *args, exc_info=True, **kwargs):
self.logger.exception(msg, *args, exc_info, **kwargs)
I'm thinking of starting a batch once a week and accumulating tweets on a regular basis. I had a harder time understanding the specifications of the Twitter API than I expected. I thought about how to control using since_id and max_id so as not to get duplicate tweets. How was it good to do it? .. ..
archive.py
import sys
import config
import yaml
from tweepy import *
from tweepy.parsers import JSONParser
from pymongo import *
from logger import Logger
# coding: UTF-8
# write code...
def archive():
#Read the list of search keywords from the YAML file and generate a string for OR search.
with open('keywords.yml', 'r') as file:
keywords = yaml.load(file)
query_string = ' OR '.join(keywords)
#Initialization of log output object
logger = Logger('archive')
#Generate client for Twitter search
auth = OAuthHandler(config.CONSUMER_KEY, config.CONSUMER_SECRET)
auth.set_access_token(config.ACCESS_TOKEN_KEY, config.ACCESS_TOKEN_SECRET)
#I want to receive the result in JSON, so set JSON Parser.
#Even if the search limit is reached, the library will do the best. Should be.
twitter_client = API(auth, parser=JSONParser(), wait_on_rate_limit=True, wait_on_rate_limit_notify=True)
if twitter_client is None:
logger.error('certification failed.')
sys.exit(-1)
#Initialize a collection of mongodb to store tweets
client = MongoClient(config.HOST, config.PORT)
tweet_collection = client[config.DB_NAME][config.COLLECTION_NAME]
#Get the latest tweets from the acquired tweets, and set to get the id and later of the tweets.
last_tweet = tweet_collection.find_one(sort=[('id', DESCENDING)])
since_id = None if last_tweet is None else last_tweet['id']
#When searching for the first time, max_Do not set id-Set 1.
max_id = -1
# tweet_count is max_tweet_When the count is reached, the search ends.
# max_tweet_Set a large value for count.
tweet_count = 0
max_tweet_count = 100000
logger.info('maximum{0}Collect individual tweets.'.format(max_tweet_count))
while tweet_count < max_tweet_count:
try:
params = {
'q': query_string,
'count': 100,
'lang': 'ja'
}
# max_id and since_Only pass id as a parameter if it is set.
if max_id > 0:
params['max_id'] = str(max_id - 1)
if since_id is not None:
params['since_id'] = since_id
search_result = twitter_client.search(**params)
statuses = search_result['statuses']
#Check if you could search to the end
if statuses is None or len(statuses) == 0:
logger.info('The tweet was not found.')
break
tweet_count += len(statuses)
logger.debug('{0}I got the tweets.'.format(tweet_count))
result = tweet_collection.insert_many([status for status in statuses])
logger.debug('I saved it in MongoDB. ID is{0}is.'.format(result))
#Update with the last Tweet ID you got.
max_id = statuses[-1]['id']
except (TypeError, TweepError) as e:
print(str(e))
logger.exception(str(e))
break
if __name__ == '__main__':
archive()
I haven't mastered Python at all yet, but I thought it was a language where I could write what I wanted to do. I will continue to study. In the future, I will try to analyze the collected tweets using the library for data analysis! !!