Get a large amount of Starbucks Twitter data with python and try data analysis Part 1

I'm always grateful for the Mac Book Air, and I'm inconvenienced by staying for a long time. I would like to thank Mr. Starbucks and analyze the data to help him. This is an article about getting a lot of tweets containing "Starbucks" in the text and trying to find out what data analysis can provide. It's not a stemmer, but it may be a stemmer in the sense of giving back to Starbucks (・ ω ・)


Part 1: Import data with Twitter REST APIs and import it into mongoDB (this time) http://qiita.com/kenmatsu4/items/23768cbe32fe381d54a2

Part 2: Separation of spam from the acquired Twitter data http://qiita.com/kenmatsu4/items/8d88e0992ca6e443f446

Part 3: Why did the number of tweets increase after one day? http://qiita.com/kenmatsu4/items/02034e5688cc186f224b

Part 4: Visualization of location information hidden in Twitter http://qiita.com/kenmatsu4/items/114f3cff815aa5037535


1. Get account information to connect to Twitter API

With Google teacher ["twitter api account"](https://www.google.co.jp/search?q=twitter+api+%E3%82%A2%E3%82%AB%E3%82%A6%E3 If you search for% 83% B3% E3% 83% 88), you will find many sites that clearly describe how to register, so information for accessing the API by referring to them (especially consumer_key, consumer_secret, access_token) , Access_secret).

2. Installation of various required libraries

It is assumed that the basic Python environment such as iPython is in place. If you have the library in here, it's almost okay. It also installs an authentication library for using Twitter REST APIs.

pip install requests_oauthlib

Also, since mongoDB is used to store data, here and [here](http://qiita. Install by referring to com / hajimeni / items / 3c93fd981e92f66a20ce). For an overview of mongoDB, see "Thin book of MongoDB".

In order to access mongoDB from Python, we will also introduce pymongo.

pip install pymongo

3. Initialization process

from requests_oauthlib import OAuth1Session
from requests.exceptions import ConnectionError, ReadTimeout, SSLError
import json, datetime, time, pytz, re, sys,traceback, pymongo
#from pymongo import Connection     #Connection class is obsolete, so change to MongoClient
from pymongo import MongoClient
from collections import defaultdict
import numpy as np

KEYS = { #List the keys you got with your account below
        'consumer_key':'**********',
        'consumer_secret':'**********',
        'access_token':'**********',
        'access_secret''**********',
       }

twitter = None
connect = None
db      = None
tweetdata = None
meta    = None

def initialize(): #Initial processing such as twitter connection information and connection processing to mongoDB
    global twitter, twitter, connect, db, tweetdata, meta
    twitter = OAuth1Session(KEYS['consumer_key'],KEYS['consumer_secret'],
                            KEYS['access_token'],KEYS['access_secret'])
#   connect = Connection('localhost', 27017)     #Connection class is obsolete, so change to MongoClient
    connect = MongoClient('localhost', 27017)
    db = connect.starbucks
    tweetdata = db.tweetdata
    meta = db.metadata
    
initialize()

4. Search Tweet

Use the code below to import tweets that include "Starbucks" in the text into mongoDB.

#Get 100 Tweet data from Twitter REST APIs by specifying a search word
def getTweetData(search_word, max_id, since_id):
    global twitter
    url = 'https://api.twitter.com/1.1/search/tweets.json'
    params = {'q': search_word,
              'count':'100',
    }
    # max_Set if id is specified
    if max_id != -1:
        params['max_id'] = max_id
    # since_Set if id is specified
    if since_id != -1:
        params['since_id'] = since_id
    
    req = twitter.get(url, params = params)   #Get Tweet data

    #Decomposition of acquired data
    if req.status_code == 200: #If successful
        timeline = json.loads(req.text)
        metadata = timeline['search_metadata']
        statuses = timeline['statuses']
        limit = req.headers['x-rate-limit-remaining'] if 'x-rate-limit-remaining' in req.headers else 0
        reset = req.headers['x-rate-limit-reset'] if 'x-rate-limit-reset' in req.headers else 0              
        return {"result":True, "metadata":metadata, "statuses":statuses, "limit":limit, "reset_time":datetime.datetime.fromtimestamp(float(reset)), "reset_time_unix":reset}
    else: #If it fails
        print ("Error: %d" % req.status_code)
        return{"result":False, "status_code":req.status_code}

#Returns the character string in a date type that combines the Japan time 2 time zone
def str_to_date_jp(str_date):
    dts = datetime.datetime.strptime(str_date,'%a %b %d %H:%M:%S +0000 %Y')
    return pytz.utc.localize(dts).astimezone(pytz.timezone('Asia/Tokyo'))

#Returns the current time in UNIX Time
def now_unix_time():
    return time.mktime(datetime.datetime.now().timetuple())

Here is the tweet acquisition loop.

#-------------Get Tweet data repeatedly-------------#
sid=-1
mid = -1 
count = 0
 
res = None
while(True):    
    try:
        count = count + 1
        sys.stdout.write("%d, "% count)
        res = getTweetData(u'Starbucks', max_id=mid, since_id=sid)
        if res['result']==False:
            #Exit if failed
            print "status_code", res['status_code']
            break
        
        if int(res['limit']) == 0:    #I have reached the limit, so I take a break
            #Date type column'created_datetime'To add
            print "Adding created_at field."
            for d in tweetdata.find({'created_datetime':{ "$exists": False }},{'_id':1, 'created_at':1}):
                #print str_to_date_jp(d['created_at'])
                tweetdata.update({'_id' : d['_id']}, 
                     {'$set' : {'created_datetime' : str_to_date_jp(d['created_at'])}})
            #remove_duplicates()
            
            #Waiting time calculation.Resume after limit + 5 seconds
            diff_sec = int(res['reset_time_unix']) - now_unix_time()
            print "sleep %d sec." % (diff_sec+5)
            if diff_sec > 0:
                time.sleep(diff_sec + 5)
        else:
            #metadata processing
            if len(res['statuses'])==0:
                sys.stdout.write("statuses is none. ")
            elif 'next_results' in res['metadata']:
                #Store the result in mongoDB
                meta.insert({"metadata":res['metadata'], "insert_date": now_unix_time()})
                for s in res['statuses']:
                    tweetdata.insert(s)
                next_url = res['metadata']['next_results']
                pattern = r".*max_id=([0-9]*)\&.*"
                ite = re.finditer(pattern, next_url)
                for i in ite:
                    mid = i.group(1)
                    break
            else:
                sys.stdout.write("next is none. finished.")
                break
    except SSLError as (errno, request):
        print "SSLError({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
        time.sleep(5*60)
    except ConnectionError as (errno, request):
        print "ConnectionError({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
        time.sleep(5*60)
    except ReadTimeout as (errno, request):
        print "ReadTimeout({0}): {1}".format(errno, strerror)
        print "waiting 5mins"
        time.sleep(5*60)
    except:
        print "Unexpected error:", sys.exc_info()[0]
        traceback.format_exc(sys.exc_info()[2])
        raise
    finally:
        info = sys.exc_info()


## 5. Twitter REST API data structure ## The structure of the data obtained by "[GET search / tweets](https://dev.twitter.com/rest/reference/get/search/tweets)" of Twitter REST APIs is as follows. ### Structure of TwitterListResponse ### A description of the main elements of Tweet information.   
Item Description   
id Tweet ID. The new ones have old numbers and the old ones have young numbers. If you specify larger or smaller than this ID when searching, you can retrieve previous tweets after that.
id_str It seems to be a character string version of "id", but the details are unknown because it is originally obtained as a character string.
user User information. It has the following elements (only typical ones are picked up)
   id User ID. A number ID that you don't normally see.
name The name of the longer user.
screen_name User name used when specifying with @ etc.
description User description information. Profile-like sentences.
friends_count Number of followers
followers_count Number of followers
statuses_count Number of tweets (including retweets)
favourites_count Number of favorites
location Where you live
created_at Registration date for this user
text Tweet body
retweeted_status Whether it is a retweet (True: retweet / False: normal tweet)
retweeted Whether or not it was retweeted (True / False)
retweet_count Number of retweets
favorited Whether it was favorited (True / False)
favorite_count Favorite number
coordinates latitude / longitude
entities Additional information shown below
symbols
user_mentions User information specified by @ in the text
hashtags Hashtag in the body
urls URL information in the text
source Information about the app / site that tweeted
lang Language information
created_at Tweet date and time
place Location information related to tweets
in_reply_to_screen_name The user name of the tweet source when the tweet was a reply
n_reply_to_status_id Tweet ID of the tweet source when the tweet was a reply
in_reply_to_status_id_str string version of n_reply_to_status_id

Metadata structure

A description of the metadata returned when searching for'https://api.twitter.com/1.1/search/tweets.json'.

item Description
query Search word
count How many tweets did you get in a single search?
completed_in How many seconds did the acquisition complete?
max_id Newest ID among the acquired tweets
max_id_str max_String version of id?(Both are strings, but ...)
since_id The oldest ID of the tweets you got
since_id_str since_String version of id?(Both are strings, but ...)
refresh_url URL when you want to get newer tweets with the same search word
next_results URL when you want to get older tweets with the same search word

Summary of the data obtained this time

Total number of acquisitions
227,599
Acquisition data period
From 2015-03-11 04:43:42 to 2015-03-22 00:01:12
Number of tweets per second
4.101 tweet/sec

Current issues

If you get up to the latter half of 100,000 with GET search / tweets, you can not get the tweets before that, the'statuses' element becomes empty, and the'next_results' element is not returned in the first place. I haven't solved it at the moment, but I got about 200,000 cases, so I would like to analyze this data from the next time. ** Update: ** I received a comment, but it seems that I can only get tweets for one week.

Continue to Part 2.

The full code described on this page is here

Referenced page

Access the Twitter API with Python Twitter official REST API document

Recommended Posts

Get a large amount of Starbucks Twitter data with python and try data analysis Part 1
Get rid of dirty data with Python and regular expressions
Get data from MySQL on a VPS with Python 3 and SQLAlchemy
Starbucks Twitter Data Location Visualization and Analysis
A well-prepared record of data analysis in Python
Data analysis with python 2
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 1
Beautiful graph drawing with python -seaborn makes data analysis and visualization easier Part 2
Practice of creating a data analysis platform with BigQuery and Cloud DataFlow (data processing)
Data analysis with Python
Get the stock price of a Japanese company with Python and make a graph
[Introduction to Python] How to get the index of data with a for statement
I made a program in Python that reads CSV data of FX and creates a large amount of chart images
Get financial data with python (then a little tinkering)
Challenge principal component analysis of text data with Python
Perform a Twitter search from Python and try to generate sentences with Markov chains.
Consolidate a large number of CSV files in folders with python (data without header)
Quickly create a Python data analysis dashboard with Streamlit and deploy it to AWS
Get Twitter timeline with python
Get Youtube data with python
Try scraping the data of COVID-19 in Tokyo with Python
Try hitting the Twitter API quickly and easily with Python
Perform isocurrent analysis of open channels with Python and matplotlib
Notes on handling large amounts of data with python + pandas
Get a list of purchased DMM eBooks with Python + Selenium
Detect objects of a specific color and size with Python
Try to bring up a subwindow with PyQt5 and Python
Sample of HTTP GET and JSON parsing with python of pepper
Get additional data to LDAP with python (Writer and Reader)
Create a decision tree from 0 with Python and understand it (3. Data analysis library Pandas edition)
"Measurement Time Series Analysis of Economic and Finance Data" Solving Chapter End Problems with Python
Try morphological analysis and Markov chains with Django (Ari with a lot of room for improvement)
Get a list of CloudWatch Metrics and a correspondence table for Unit units with Python boto
A simple data analysis of Bitcoin provided by CoinMetrics in Python
Try to get a list of breaking news threads in Python.
Practical exercise of data analysis with Python ~ 2016 New Coder Survey Edition ~
Manage the overlap when drawing scatter plots with a large amount of data (Matplotlib, Pandas, Datashader)
[Python] Get economic data with DataReader
Crawling with Python and Twitter API 2-Implementation of user search function
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)
How to create a large amount of test data in MySQL? ??
[Python] Read a csv file with a large data size using a generator
Data analysis starting with python (data visualization 1)
Try to get CloudWatch metrics with re: dash python data source
Coexistence of Python2 and 3 with CircleCI (1.0)
Data analysis starting with python (data visualization 2)
Get a list of files in a folder with python without a path
A Python beginner first tried a quick and easy analysis of weather data for the last 10 years.
Create a USB boot Ubuntu with a Python environment for data analysis
Introduction of "scikit-mobility", a library that allows you to easily analyze human flow data with Python (Part 1)
Recognize the contour and direction of a shaped object with OpenCV3 and Python3 (Principal component analysis: PCA, eigenvectors)
I tried to get and analyze the statistical data of the new corona with Python: Data of Johns Hopkins University
Introduction and usage of Python bottle ・ Try to set up a simple web server with login function
Calculate the shortest route of a graph with Dijkstra's algorithm and Python
Get the number of searches with a regular expression. SeleniumBasic VBA Python
Get images of OpenStreetMap and Geographical Survey Institute maps with Python + py-staticmaps
Get a list of packages installed in your current environment with python
Try to image the elevation data of the Geographical Survey Institute with Python
[In-Database Python Analysis Tutorial with SQL Server 2017] Step 3: Data Exploration and Visualization
A summary of Python e-books that are useful for free-to-read data analysis
I have 0 years of programming experience and challenge data processing with python