This is an explanation of how to collect Twitter data with python and how to detect bursts for time-series text data.
Technically, it's similar to the previous article below.
Past articles: I collected tweets about "Princess Kuppa" with python and tried burst detection https://qiita.com/pocket_kyoto/items/de4b512b8212e53bbba3
In order to confirm the versatility of the method adopted at this time, as of March 10, 2020, we practiced collecting Twitter data and detecting bursts of words that co-occur with "corona" using the topic "corona" as a keyword. I tried to.
The collection method is basically the same as the past articles.
First, prepare for tweet collection, such as loading the library.
#Login key information for collecting Twitter data
KEYS = { #List the key you got with your account
'consumer_key':'*********************',
'consumer_secret':'*********************',
'access_token':'*********************',
'access_secret':'*********************',
}
#Collection of Twitter data (preparation for collection)
import json
from requests_oauthlib import OAuth1Session
twitter = OAuth1Session(KEYS['consumer_key'],KEYS['consumer_secret'],KEYS['access_token'],KEYS['access_secret'])
For information on how to obtain a login key for collecting Twitter data, the Reference [1] site is easy to understand.
The function for collecting tweets is defined as follows. Since the tweet location is not used this time, the default argument (None) can be set. Also, since you can only search up to 100 tweets at a time, you need to make repeated requests with a for statement, but it was smarter to manage it outside the Twitter data acquisition function, so I implemented it that way. This area follows the writing method of Reference [2].
#Twitter data acquisition function
def getTwitterData(key_word, latitude=None, longitude=None, radius=None, mid=-1):
url = "https://api.twitter.com/1.1/search/tweets.json"
params ={'q': key_word, 'count':'100', 'result_type':'recent'} #Acquisition parameters
if latitude is not None: #Judgment only by latitude
params = {'geocode':'%s,%s,%skm' % (latitude, longitude, radius)}
params['max_id'] = mid #Get only tweets with IDs older than mid
req = twitter.get(url, params = params)
if req.status_code == 200: #When normal communication is possible
tweets = json.loads(req.text)['statuses'] #Get tweet information from response
#Ingenuity for taking the oldest tweets (* There seems to be a better way to write)
user_ids = []
for tweet in tweets:
user_ids.append(int(tweet['id']))
if len(user_ids) > 0:
min_user_id = min(user_ids)
else:
min_user_id = -1
#Meta information
limit = req.headers['x-rate-limit-remaining'] if 'x-rate-limit-remaining' in req.headers else 0
reset = req.headers['x-rate-limit-reset'] if 'x-rate-limit-reset' in req.headers else 0
return {'tweets':tweets, 'min_user_id':min_user_id, 'limit':limit, 'reset':reset}
else: #When normal communication is not possible
print("Failed: %d" % req.status_code)
return {}
I created a control function (getTwitterDataRepeat) to execute the above function continuously. To avoid getting caught in the request limit, it will automatically wait when you are about to get caught in the limit.
#Continuous acquisition of Twitter data
import datetime, time
def getTwitterDataRepeat(key_word, latitude=None, longitude=None, radius=None, mid=-1, repeat=10):
tweets = []
for i in range(repeat):
res = getTwitterData(key_word, latitude, longitude, radius, mid)
if 'tweets' not in res: #Leave if an error occurs
break
else:
sub_tweets = res['tweets']
for tweet in sub_tweets:
tweets.append(tweet)
if int(res['limit']) == 0: #Take a break if you reach the limit
#Waiting time calculation.Resume after limit + 5 seconds
now_unix_time = time.mktime(datetime.datetime.now().timetuple()) #Get the current time
diff_sec = int(res['reset']) - now_unix_time
print ("sleep %d sec." % (diff_sec+5))
if diff_sec > 0:
time.sleep(diff_sec + 5)
mid = res['min_user_id'] - 1
print("Number of tweets acquired:%s" % len(tweets))
return tweets
By implementing in this way, it is possible to automatically collect tweets without worrying about the upper limit of requests. After that, I wanted to collect tweets separately by time zone, so I ran the following script.
#reference[3]I borrowed the function that was created in
import time, calendar
def YmdHMS(created_at):
time_utc = time.strptime(created_at, '%a %b %d %H:%M:%S +0000 %Y')
unix_time = calendar.timegm(time_utc)
time_local = time.localtime(unix_time) # 2018/9/Fixed to 24
return time.strftime("%Y/%m/%d %H:%M:%S", time_local)
#Get tweets about Corona every 6 hours for a week
tweet_corona = {}
mid = -1
for t in range(4*7):
tweets = getTwitterDataRepeat("corona", mid=mid, repeat=10)
old_tweet = tweets[-1] #The oldest tweet we've collected
key = YmdHMS(old_tweet["created_at"]) #YmdHMS function
tweet_corona[key] = tweets #Save the time of the oldest tweet as a key
mid = old_tweet["id"] - 15099494400000*6 #Collect about 6 hours back
I wanted to collect tweets by going back 6 hours each, so I'm subtracting 15,099,494,400,000 * 6 from the oldest tweet mid. This value of 15,099,494,400,000 is determined by Tweeter's tweet ID specification. Twitter's tweet ID has a structure in which the millisecond time stamp + the number of the machine issuing the ID + the sequence number is pushed into 64 bits. (Reference [4])
So far, we have been able to collect tweets containing "Corona" in chronological order. First of all, in order to understand the data, I would like to visualize the frequency of occurrence of words in chronological order.
I defined the following function, morphologically analyzed it with janome, and counted the frequency of occurrence of words.
#Morphological analysis of sentences and conversion to Bag of Words
from janome.tokenizer import Tokenizer
import collections
import re
def CountWord(tweets):
tweet_list = [tweet["text"] for tweet in tweets]
all_tweet = "\n".join(tweet_list)
t = Tokenizer()
#Transformed into the original form, nouns only, one character removed, limited to continuous drinking of kanji, hiragana, and katakana
c = collections.Counter(token.base_form for token in t.tokenize(all_tweet)
if token.part_of_speech.startswith('noun') and len(token.base_form) > 1
and token.base_form.isalpha() and not re.match('^[a-zA-Z]+$', token.base_form))
freq_dict = {}
mc = c.most_common()
for elem in mc:
freq_dict[elem[0]] = elem[1]
return freq_dict
WordCloud was used as the visualization method. I implemented it as follows.
#Visualization with Word Cloud, Word Cloud visualization function
def color_func(word, font_size, position, orientation, random_state, font_path):
return 'white'
from wordcloud import WordCloud
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname=r'C:\WINDOWS\Fonts\meiryo.ttc', size=50) #Japanese support
def DrawWordCloud(word_freq_dict, fig_title):
#Change the default settings and change the colormap"rainbow"change to
wordcloud = WordCloud(background_color='white', min_font_size=15, font_path='C:\WINDOWS\Fonts\meiryo.ttc',
max_font_size=200, width=1000, height=500, prefer_horizontal=1.0, relative_scaling=0.0, colormap="rainbow")
wordcloud.generate_from_frequencies(word_freq_dict)
plt.figure(figsize=[20,20])
plt.title(fig_title, fontproperties=fp)
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
With these, Visualize the frequency of occurrence of words in chronological order.
output: (Omitted) : : (Omitted) : : (Omitted) : : (Omitted) : : (Omitted) :
The influence of the super-family that easily co-occurs with the word "corona" such as "new type", "virus", and "infection" came out strongly. From this visualization result, it is difficult to understand the word that became a hot topic due to the influence of "corona", so we will try to detect it automatically.
Using the data set collected this time and a method called burst detection, I would like to automatically detect the word that became a hot topic due to the influence of "corona". Regarding the method of burst detection, in the book, "Machine learning of web data (machine learning professional series) = UTF8 & btkr = 1) ”, but there are few commentary articles on the net. This time, Commentary article of Tohoku University Inui Suzuki Laboratory, which is famous as a laboratory on natural language processing. I would like to try to implement and apply the burst detection method with reference to 2FTrend% 20Analysis).
This time, I tried to detect bursts using an index called Moving Average Convergence Divergence (MACD). As a burst detection method, the method announced by Kleinberg in 2002 seems to be often used as a baseline, but MACD announced by He and Parker in 2010 seems to be simpler and less computationally expensive.
↓ As for the explanation of MACD, I would like to quote it as it is because it is easy to understand from the Inui-Suzuki laboratory.
[Explanation of MACD]
The MACD at a certain time is
MACD = (Movement index average of past f period of time series value)-(Movement index average of past s period of time series value) Signal = (Movement index average of MACD value over the past t period) Histgram = MACD - Signal
Here, f, s, and t are parameters (f <s), and these are collectively written as MACD (f, s, t). In this experiment, MACD (4, 8, 5), which was also used in the experiment of He and Parker (2010), was adopted. When MACD is used as a technical index, it is said that the Histgram indicates the strength of the trend, with the status of "Signal <MACD" being raised and the trend being "MACD <Signal" being lowered. This time, the period of 15 minutes is taken as a group (15 minutes), and the frequency of appearance of words appearing on Twitter within that period divided by 15, that is, the appearance speed [times / minute] is used as the observed value. We performed trend analysis by MACD. The value of the moving index average required for MACD calculation can be calculated sequentially, and this trend analysis can be implemented as a streaming algorithm, so we think that it is suitable for trend analysis from big data.
From the above explanation, MACD was implemented as follows.
# Moving Average Convergence Divergence (MACD)Calculation
class MACDData():
def __init__(self,f,s,t):
self.f = f
self.s = s
self.t = t
def calc_macd(self, freq_list):
n = len(freq_list)
self.macd_list = []
self.signal_list = []
self.histgram_list = []
for i in range(n):
if i < self.f:
self.macd_list.append(0)
self.signal_list.append(0)
self.histgram_list.append(0)
else :
macd = sum(freq_list[i-self.f+1:i+1])/len(freq_list[i-self.f+1:i+1]) - sum(freq_list[max(0,i-self.s):i+1])/len(freq_list[max(0,i-self.s):i+1])
self.macd_list.append(macd)
signal = sum(self.macd_list[max(0,i-self.t+1):i+1])/len(self.macd_list[max(0,i-self.t+1):i+1])
self.signal_list.append(signal)
histgram = macd - signal
self.histgram_list.append(histgram)
Using this program, from Wednesday, March 4, 2020 to Tuesday, March 10, 2020 Due to the influence of corona, I would like to automatically detect the word that became a hot topic.
#Burst detection of terms ranked in the top 100 words in tweets in each time zone
top_100_words = []
i = 0
for freq_dict in datetime_freq_dicts:
for k,v in freq_dict.items():
top_100_words.append(k)
i += 1
if i >= 100:
i = 0
break
top_100_words = list(set(top_100_words)) #Limited to unique words
print(len(top_100_words))
#Acquisition of MACD calculation result
word_list_dict = {}
for freq_dict in datetime_freq_dicts:
for word in top_100_words:
if word not in word_list_dict:
word_list_dict[word] = []
if word in freq_dict:
word_list_dict[word].append(freq_dict[word])
else:
word_list_dict[word].append(0)
#Normalization
word_av_list_dict = {}
for k, v in word_list_dict.items():
word_av_list = [elem/sum(v) for elem in v]
word_av_list_dict[k] = word_av_list
#Calculation(He and Parker(2010)Same parameters as)
f = 4
s = 8
t = 5
word_macd_dict = {}
for k, v in word_av_list_dict.items():
word_macd_data = MACDData(f,s,t)
word_macd_data.calc_macd(v)
word_macd_dict[k] = word_macd_data
#Burst detection
word_burst_dict = {}
for k,v in word_macd_dict.items():
burst = max(v.histgram_list) #Since Histgram shows the strength of the trend, take the maximum value within the period
word_burst_dict[k] = burst
The result of inputting the data is as follows.
i = 1
for k, v in sorted(word_burst_dict.items(), key=lambda x: -x[1]):
print(str(i) + "Rank:" + str(k))
i += 1
output: 1st place: Kuro 2nd place: Lotte Marines 3rd place: Ground 4th place: Ward office 5th place: Dignity 6th place: brim 7th place: Self-study 8th place: Deliveryman 9th place: Methanol 10th place: Kohoku 11th place: Serum 12th place: Eplus 13th place: Harassment 14th place: Equipment 15th place: Snack 16th place: Sagawa Express 17th place: Libero 18th place: Miyuki 19th place: Goddess 20th place: Psychedelic 21st place: Live 22nd place: Yokohama City University 23rd place: Depression 24th place: whole volume 25th place: Korohara 26th place: Epizootic 27th place: Refund 28th place: Appearance 29th place: Obligation 30th place: Display : (Omitted) :
"Tsuba", "Kuro", "Lotte Marines", etc. were detected as words that became a hot topic due to the influence of "Corona". The results for the other words were generally convincing.
Next, I also tried to estimate the time when it became a hot topic.
#Visualization of results
import numpy as np
import matplotlib.pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
from matplotlib.font_manager import FontProperties
fp = FontProperties(fname=r'C:\WINDOWS\Fonts\meiryo.ttc', size=10) #Japanese support
x = np.array(sorted(tweet_corona.keys()))
y1 = np.array(word_macd_dict["Lotte Marines"].histgram_list)
y2 = np.array(word_macd_dict["Self-study"].histgram_list)
y3 = np.array(word_macd_dict["Deliveryman"].histgram_list)
y4 = np.array(word_macd_dict["methanol"].histgram_list)
y5 = np.array(word_macd_dict["snack"].histgram_list)
y6 = np.array(word_macd_dict["harassment"].histgram_list)
plt.plot(x, y1, marker="o")
plt.plot(x, y2, marker="+", markersize=10, markeredgewidth=2)
plt.plot(x, y3, marker="s", linewidth=1)
plt.plot(x, y4, marker="o")
plt.plot(x, y5, marker="+", markersize=10, markeredgewidth=2)
plt.plot(x, y6, marker="s", linewidth=1)
plt.xticks(rotation=90)
plt.title("Burst detection result", fontproperties=fp)
plt.xlabel("Date and time", fontproperties=fp)
plt.ylabel("Burst detection result", fontproperties=fp)
plt.ylim([0,0.2])
plt.legend([""Lotte Marines"",""Self-study"", ""Deliveryman"",""methanol"", ""snack"",""harassment""], loc="best", prop=fp)
The visualization result is as follows.
The Yakult Swallows vs. Lotte Marines unattended match was held It's Saturday, March 7th, so it seems that we can estimate it correctly. As of March 10th (Tuesday), "methanol" seems to be one of the hottest words.
The results of inputting the data from 3/11 (Wednesday) to 3/18 (Wednesday) are as follows.
i = 1
for k, v in sorted(word_burst_dict.items(), key=lambda x: -x[1]):
print(str(i) + "Rank:" + str(k))
i += 1
output: 1st place: Terms 2nd place: Saiyan 3rd place: Majestic Legon 4th place: tough 5th place: Civil 6th place: Earthling 7th place: Juan 8th place: City 9th place: Cannabis 10th place: Paraiso 11th place: Fighting conference 12th place: Ranbu 13th place: Laura Ashley 14th place: Musical 15th place: Impossible 16th place: Estimate 17th place: Honey 18th place: Chasing 19th place: Lemon 20th place: Performance 21st place: Receipt 22nd place: Sword 23rd place: Investigation 24th place: Macron 25th place: Crowdfunding 26th place: Okeya 27th place: Grandmother 28th place: Smile 29th place: Full amount 30th place: Owned : (Omitted) :
Etc. were detected as tweets that became a hot topic momentarily.
The time when it became a hot topic is as follows.
This time, I tried to detect bursts with the theme of "corona". Technically, it is a reprint of the content of the past article, but I think that a reasonable analysis result was obtained. In the past article, the theme was "Princess Kuppa", but we were able to confirm that the method itself is highly versatile.
I would like to continue taking on the challenge of Twitter data analysis.
[1] [2019] Specific method to register with Twitter API and obtain access key token https://miyastyle.net/twitter-api [2] Get a large amount of Starbucks Twitter data with python and try data analysis Part 1 https://qiita.com/kenmatsu4/items/23768cbe32fe381d54a2 [3] How to handle tweets acquired by Streaming API http://blog.unfindable.net/archives/4302 [4] Scalable numbering and snowflake https://kyrt.in/2014/06/08/snowflake_c.html [5] Tohoku University Inui Suzuki Laboratory Project 311 / Trend Analysis http://www.cl.ecei.tohoku.ac.jp/index.php?Project%20311%2FTrend%20Analysis [6] Dan He and D. Stott Parker(2010) 「Topic Dynamics: An Alternative Model of 'Bursts' in Streams of Topics」 https://dollar.biz.uiowa.edu/~street/HeParker10.pdf
Recommended Posts