Get a large amount of Starbucks Twitter data with python and try data analysis Part 1

I'm always grateful for the Mac Book Air, and I'm inconvenienced by staying for a long time. I would like to thank Mr. Starbucks and analyze the data to help him. This is an article about getting a lot of tweets containing "Starbucks" in the text and trying to find out what data analysis can provide. It's not a stemmer, but it may be a stemmer in the sense of giving back to Starbucks (・ ω ・)

Part 1: Import data with Twitter REST APIs and import it into mongoDB (this time) http://qiita.com/kenmatsu4/items/23768cbe32fe381d54a2

Part 2: Separation of spam from the acquired Twitter data http://qiita.com/kenmatsu4/items/8d88e0992ca6e443f446

Part 3: Why did the number of tweets increase after one day? http://qiita.com/kenmatsu4/items/02034e5688cc186f224b

Part 4: Visualization of location information hidden in Twitter http://qiita.com/kenmatsu4/items/114f3cff815aa5037535

1. Get account information to connect to Twitter API

With Google teacher ["twitter api account"](https://www.google.co.jp/search?q=twitter+api+%E3%82%A2%E3%82%AB%E3%82%A6%E3 If you search for% 83% B3% E3% 83% 88), you will find many sites that clearly describe how to register, so information for accessing the API by referring to them (especially consumer_key, consumer_secret, access_token) , Access_secret).

2. Installation of various required libraries

It is assumed that the basic Python environment such as iPython is in place. If you have the library in here, it's almost okay. It also installs an authentication library for using Twitter REST APIs.

pip install requests_oauthlib

Also, since mongoDB is used to store data, here and [here](http://qiita. Install by referring to com / hajimeni / items / 3c93fd981e92f66a20ce). For an overview of mongoDB, see "Thin book of MongoDB".

In order to access mongoDB from Python, we will also introduce pymongo.

pip install pymongo

3. Initialization process

from requests_oauthlib import OAuth1Session
from requests.exceptions import ConnectionError, ReadTimeout, SSLError
import json, datetime, time, pytz, re, sys,traceback, pymongo
#from pymongo import Connection     #Connection class is obsolete, so change to MongoClient
from pymongo import MongoClient
from collections import defaultdict
import numpy as np

KEYS = { #List the keys you got with your account below
        'consumer_key':'**********',
        'consumer_secret':'**********',
        'access_token':'**********',
        'access_secret''**********',
       }

twitter = None
connect = None
db      = None
tweetdata = None
meta    = None

def initialize(): #Initial processing such as twitter connection information and connection processing to mongoDB
    global twitter, twitter, connect, db, tweetdata, meta
    twitter = OAuth1Session(KEYS['consumer_key'],KEYS['consumer_secret'],
                            KEYS['access_token'],KEYS['access_secret'])
#   connect = Connection('localhost', 27017)     #Connection class is obsolete, so change to MongoClient
    connect = MongoClient('localhost', 27017)
    db = connect.starbucks
    tweetdata = db.tweetdata
    meta = db.metadata
    
initialize()

4. Search Tweet

Use the code below to import tweets that include "Starbucks" in the text into mongoDB.

#Get 100 Tweet data from Twitter REST APIs by specifying a search word
def getTweetData(search_word, max_id, since_id):
    global twitter
    url = 'https://api.twitter.com/1.1/search/tweets.json'
    params = {'q': search_word,
              'count':'100',
    }
    # max_Set if id is specified
    if max_id != -1:
        params['max_id'] = max_id
    # since_Set if id is specified
    if since_id != -1:
        params['since_id'] = since_id
    
    req = twitter.get(url, params = params)   #Get Tweet data

    #Decomposition of acquired data
    if req.status_code == 200: #If successful
        timeline = json.loads(req.text)
        metadata = timeline['search_metadata']
        statuses = timeline['statuses']
        limit = req.headers['x-rate-limit-remaining'] if 'x-rate-limit-remaining' in req.headers else 0
        reset = req.headers['x-rate-limit-reset'] if 'x-rate-limit-reset' in req.headers else 0              
        return {"result":True, "metadata":metadata, "statuses":statuses, "limit":limit, "reset_time":datetime.datetime.fromtimestamp(float(reset)), "reset_time_unix":reset}
    else: #If it fails
        print ("Error: %d" % req.status_code)
        return{"result":False, "status_code":req.status_code}

#Returns the character string in a date type that combines the Japan time 2 time zone
def str_to_date_jp(str_date):
    dts = datetime.datetime.strptime(str_date,'%a %b %d %H:%M:%S +0000 %Y')
    return pytz.utc.localize(dts).astimezone(pytz.timezone('Asia/Tokyo'))

#Returns the current time in UNIX Time
def now_unix_time():
    return time.mktime(datetime.datetime.now().timetuple())

Here is the tweet acquisition loop.

#-------------Get Tweet data repeatedly-------------#
sid=-1
mid = -1 
count = 0
 
res = None
while(True):    
    try:
        count = count + 1
        sys.stdout.write("%d, "% count)
        res = getTweetData(u'Starbucks', max_id=mid, since_id=sid)
        if res['result']==False:
            #Exit if failed
            print "status_code", res['status_code']
            break
        
        if int(res['limit']) == 0:    #I have reached the limit, so I take a break
            #Date type column'created_datetime'To add
            print "Adding created_at field."
            for d in tweetdata.find({'created_datetime':{ "$exists": False }},{'_id':1, 'created_at':1}):
                #print str_to_date_jp(d['created_at'])
                tweetdata.update({'_id' : d['_id']}, 
                     {'$set' : {'created_datetime' : str_to_date_jp(d['created_at'])}})
            #remove_duplicates()
            
            #Waiting time calculation.Resume after limit + 5 seconds
            diff_sec = int(res['reset_time_unix']) - now_unix_time()
            print "sleep %d sec." % (diff_sec+5)
            if diff_sec > 0:
                time.sleep(diff_sec + 5)
        else:
            #metadata processing
            if len(res['statuses'])==0:
                sys.stdout.write("statuses is none. ")
            elif 'next_results' in res['metadata']:
                #Store the result in mongoDB
                meta.insert({"metadata":res['metadata'], "insert_date": now_unix_time()})
                for s in res['statuses']:
                    tweetdata.insert(s)
                next_url = res['metadata']['next_results']
                pattern = r".*max_id=([0-9]*)\&.*"
                ite = re.finditer(pattern, next_url)
                for i in ite:
                    mid = i.group(1)
                    break
            else:
                sys.stdout.write("next is none. finished.")
                break
    except SSLError as (errno, request):
        print "SSLError({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
        time.sleep(5*60)
    except ConnectionError as (errno, request):
        print "ConnectionError({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
        time.sleep(5*60)
    except ReadTimeout as (errno, request):
        print "ReadTimeout({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
        time.sleep(5*60)
    except:
        print "Unexpected error:", sys.exc_info()[0]
        traceback.format_exc(sys.exc_info()[2])
        raise
    finally:
        info = sys.exc_info()

## 5. Twitter REST API data structure ## The structure of the data obtained by "[GET search / tweets](https://dev.twitter.com/rest/reference/get/search/tweets)" of Twitter REST APIs is as follows. ### Structure of TwitterListResponse ### A description of the main elements of Tweet information. 　　

Item		Description
id		Tweet ID. The new ones have old numbers and the old ones have young numbers. If you specify larger or smaller than this ID when searching, you can retrieve previous tweets after that.
id_str		It seems to be a character string version of "id", but the details are unknown because it is originally obtained as a character string.
user		User information. It has the following elements (only typical ones are picked up)
	id	User ID. A number ID that you don't normally see.
	name	The name of the longer user.
	screen_name	User name used when specifying with @ etc.
	description	User description information. Profile-like sentences.
	friends_count	Number of followers
	followers_count	Number of followers
	statuses_count	Number of tweets (including retweets)
	favourites_count	Number of favorites
	location	Where you live
	created_at	Registration date for this user
text		Tweet body
retweeted_status		Whether it is a retweet (True: retweet / False: normal tweet)
retweeted		Whether or not it was retweeted (True / False)
retweet_count		Number of retweets
favorited		Whether it was favorited (True / False)
favorite_count		Favorite number
coordinates		latitude / longitude
entities		Additional information shown below
	symbols
	user_mentions	User information specified by @ in the text
	hashtags	Hashtag in the body
	urls	URL information in the text
source		Information about the app / site that tweeted
lang		Language information
created_at		Tweet date and time
place		Location information related to tweets
in_reply_to_screen_name		The user name of the tweet source when the tweet was a reply
n_reply_to_status_id		Tweet ID of the tweet source when the tweet was a reply
in_reply_to_status_id_str		string version of n_reply_to_status_id

Metadata structure

A description of the metadata returned when searching for'https://api.twitter.com/1.1/search/tweets.json'.

item	Description
query	Search word
count	How many tweets did you get in a single search?
completed_in	How many seconds did the acquisition complete?
max_id	Newest ID among the acquired tweets
max_id_str	max_String version of id?(Both are strings, but ...)
since_id	The oldest ID of the tweets you got
since_id_str	since_String version of id?(Both are strings, but ...)
refresh_url	URL when you want to get newer tweets with the same search word
next_results	URL when you want to get older tweets with the same search word

Summary of the data obtained this time

Total number of acquisitions: 227,599
Acquisition data period: From 2015-03-11 04:43:42 to 2015-03-22 00:01:12
Number of tweets per second: 4.101 tweet/sec

Current issues

If you get up to the latter half of 100,000 with GET search / tweets, you can not get the tweets before that, the'statuses' element becomes empty, and the'next_results' element is not returned in the first place. I haven't solved it at the moment, but I got about 200,000 cases, so I would like to analyze this data from the next time. ** Update: ** I received a comment, but it seems that I can only get tweets for one week.

Continue to Part 2.

The full code described on this page is here

Referenced page

Access the Twitter API with Python Twitter official REST API document

Twitter Official API Documents 　https://dev.twitter.com/rest/reference/get/search/tweets 　https://dev.twitter.com/overview/api/tweets 　https://dev.twitter.com/overview/api/users 　https://dev.twitter.com/overview/api/entities