I'm always grateful for the Mac Book Air, and I'm inconvenienced by staying for a long time. I would like to thank Mr. Starbucks and analyze the data to help him. This is an article about getting a lot of tweets containing "Starbucks" in the text and trying to find out what data analysis can provide. It's not a stemmer, but it may be a stemmer in the sense of giving back to Starbucks (・ ω ・)
Part 1: Import data with Twitter REST APIs and import it into mongoDB (this time) http://qiita.com/kenmatsu4/items/23768cbe32fe381d54a2
Part 2: Separation of spam from the acquired Twitter data http://qiita.com/kenmatsu4/items/8d88e0992ca6e443f446
Part 3: Why did the number of tweets increase after one day? http://qiita.com/kenmatsu4/items/02034e5688cc186f224b
Part 4: Visualization of location information hidden in Twitter http://qiita.com/kenmatsu4/items/114f3cff815aa5037535
With Google teacher ["twitter api account"](https://www.google.co.jp/search?q=twitter+api+%E3%82%A2%E3%82%AB%E3%82%A6%E3 If you search for% 83% B3% E3% 83% 88), you will find many sites that clearly describe how to register, so information for accessing the API by referring to them (especially consumer_key, consumer_secret, access_token) , Access_secret).
It is assumed that the basic Python environment such as iPython is in place. If you have the library in here, it's almost okay. It also installs an authentication library for using Twitter REST APIs.
pip install requests_oauthlib
Also, since mongoDB is used to store data, here and [here](http://qiita. Install by referring to com / hajimeni / items / 3c93fd981e92f66a20ce). For an overview of mongoDB, see "Thin book of MongoDB".
In order to access mongoDB from Python, we will also introduce pymongo.
pip install pymongo
from requests_oauthlib import OAuth1Session
from requests.exceptions import ConnectionError, ReadTimeout, SSLError
import json, datetime, time, pytz, re, sys,traceback, pymongo
#from pymongo import Connection #Connection class is obsolete, so change to MongoClient
from pymongo import MongoClient
from collections import defaultdict
import numpy as np
KEYS = { #List the keys you got with your account below
'consumer_key':'**********',
'consumer_secret':'**********',
'access_token':'**********',
'access_secret''**********',
}
twitter = None
connect = None
db = None
tweetdata = None
meta = None
def initialize(): #Initial processing such as twitter connection information and connection processing to mongoDB
global twitter, twitter, connect, db, tweetdata, meta
twitter = OAuth1Session(KEYS['consumer_key'],KEYS['consumer_secret'],
KEYS['access_token'],KEYS['access_secret'])
# connect = Connection('localhost', 27017) #Connection class is obsolete, so change to MongoClient
connect = MongoClient('localhost', 27017)
db = connect.starbucks
tweetdata = db.tweetdata
meta = db.metadata
initialize()
Use the code below to import tweets that include "Starbucks" in the text into mongoDB.
#Get 100 Tweet data from Twitter REST APIs by specifying a search word
def getTweetData(search_word, max_id, since_id):
global twitter
url = 'https://api.twitter.com/1.1/search/tweets.json'
params = {'q': search_word,
'count':'100',
}
# max_Set if id is specified
if max_id != -1:
params['max_id'] = max_id
# since_Set if id is specified
if since_id != -1:
params['since_id'] = since_id
req = twitter.get(url, params = params) #Get Tweet data
#Decomposition of acquired data
if req.status_code == 200: #If successful
timeline = json.loads(req.text)
metadata = timeline['search_metadata']
statuses = timeline['statuses']
limit = req.headers['x-rate-limit-remaining'] if 'x-rate-limit-remaining' in req.headers else 0
reset = req.headers['x-rate-limit-reset'] if 'x-rate-limit-reset' in req.headers else 0
return {"result":True, "metadata":metadata, "statuses":statuses, "limit":limit, "reset_time":datetime.datetime.fromtimestamp(float(reset)), "reset_time_unix":reset}
else: #If it fails
print ("Error: %d" % req.status_code)
return{"result":False, "status_code":req.status_code}
#Returns the character string in a date type that combines the Japan time 2 time zone
def str_to_date_jp(str_date):
dts = datetime.datetime.strptime(str_date,'%a %b %d %H:%M:%S +0000 %Y')
return pytz.utc.localize(dts).astimezone(pytz.timezone('Asia/Tokyo'))
#Returns the current time in UNIX Time
def now_unix_time():
return time.mktime(datetime.datetime.now().timetuple())
Here is the tweet acquisition loop.
#-------------Get Tweet data repeatedly-------------#
sid=-1
mid = -1
count = 0
res = None
while(True):
try:
count = count + 1
sys.stdout.write("%d, "% count)
res = getTweetData(u'Starbucks', max_id=mid, since_id=sid)
if res['result']==False:
#Exit if failed
print "status_code", res['status_code']
break
if int(res['limit']) == 0: #I have reached the limit, so I take a break
#Date type column'created_datetime'To add
print "Adding created_at field."
for d in tweetdata.find({'created_datetime':{ "$exists": False }},{'_id':1, 'created_at':1}):
#print str_to_date_jp(d['created_at'])
tweetdata.update({'_id' : d['_id']},
{'$set' : {'created_datetime' : str_to_date_jp(d['created_at'])}})
#remove_duplicates()
#Waiting time calculation.Resume after limit + 5 seconds
diff_sec = int(res['reset_time_unix']) - now_unix_time()
print "sleep %d sec." % (diff_sec+5)
if diff_sec > 0:
time.sleep(diff_sec + 5)
else:
#metadata processing
if len(res['statuses'])==0:
sys.stdout.write("statuses is none. ")
elif 'next_results' in res['metadata']:
#Store the result in mongoDB
meta.insert({"metadata":res['metadata'], "insert_date": now_unix_time()})
for s in res['statuses']:
tweetdata.insert(s)
next_url = res['metadata']['next_results']
pattern = r".*max_id=([0-9]*)\&.*"
ite = re.finditer(pattern, next_url)
for i in ite:
mid = i.group(1)
break
else:
sys.stdout.write("next is none. finished.")
break
except SSLError as (errno, request):
print "SSLError({0}): {1}".format(errno, strerror)
print "waiting 5mins"
time.sleep(5*60)
except ConnectionError as (errno, request):
print "ConnectionError({0}): {1}".format(errno, strerror)
print "waiting 5mins"
time.sleep(5*60)
except ReadTimeout as (errno, request):
print "ReadTimeout({0}): {1}".format(errno, strerror)
print "waiting 5mins"
time.sleep(5*60)
except:
print "Unexpected error:", sys.exc_info()[0]
traceback.format_exc(sys.exc_info()[2])
raise
finally:
info = sys.exc_info()
Item td> | Description th> | |
---|---|---|
id | Tweet ID. The new ones have old numbers and the old ones have young numbers. If you specify larger or smaller than this ID when searching, you can retrieve previous tweets after that. td> | |
id_str | It seems to be a character string version of "id", but the details are unknown because it is originally obtained as a character string. td> | |
user | User information. It has the following elements (only typical ones are picked up) td> | |
id | User ID. A number ID that you don't normally see. td> | |
name | The name of the longer user. td> | |
screen_name | User name used when specifying with @ etc. td> | |
description | User description information. Profile-like sentences. td> | |
friends_count | Number of followers td> | |
followers_count | Number of followers td> | |
statuses_count | Number of tweets (including retweets) td> | |
favourites_count | Number of favorites td> | |
location | Where you live td> | |
created_at | Registration date for this user td> | |
text | Tweet body td> | |
retweeted_status | Whether it is a retweet (True: retweet / False: normal tweet) td> | |
retweeted | Whether or not it was retweeted (True / False) td> | |
retweet_count | Number of retweets td> | |
favorited | Whether it was favorited (True / False) td> | |
favorite_count | Favorite number td> | |
coordinates | latitude / longitude td> | |
entities | Additional information shown below td> | |
symbols | ||
user_mentions | User information specified by @ in the text td> | |
hashtags | Hashtag in the body td> | |
urls | URL information in the text td> | |
source | Information about the app / site that tweeted td> | |
lang | Language information td> | |
created_at | Tweet date and time td> | |
place | Location information related to tweets td> | |
in_reply_to_screen_name | The user name of the tweet source when the tweet was a reply td> | |
n_reply_to_status_id | Tweet ID of the tweet source when the tweet was a reply td> | |
in_reply_to_status_id_str | string version of n_reply_to_status_id td> |
A description of the metadata returned when searching for'https://api.twitter.com/1.1/search/tweets.json'.
item | Description |
---|---|
query | Search word |
count | How many tweets did you get in a single search? |
completed_in | How many seconds did the acquisition complete? |
max_id | Newest ID among the acquired tweets |
max_id_str | max_String version of id?(Both are strings, but ...) |
since_id | The oldest ID of the tweets you got |
since_id_str | since_String version of id?(Both are strings, but ...) |
refresh_url | URL when you want to get newer tweets with the same search word |
next_results | URL when you want to get older tweets with the same search word |
If you get up to the latter half of 100,000 with GET search / tweets, you can not get the tweets before that, the'statuses' element becomes empty, and the'next_results' element is not returned in the first place. I haven't solved it at the moment, but I got about 200,000 cases, so I would like to analyze this data from the next time. ** Update: ** I received a comment, but it seems that I can only get tweets for one week.
Continue to Part 2.
The full code described on this page is here
Access the Twitter API with Python Twitter official REST API document
Recommended Posts