I was collecting live baseball tweets in my university research, so I summarized them. I mainly write about scraping stories and getting a lot of tweets using tweepy.
We will publish the results of continuing to acquire live tweets of 2019 Professional Baseball (NPB) with hashtags every day. I ran the Python file with cron every day and got it for a year.
The searched hashtags are as follows. There may be active hashtags that haven't been discovered yet. In Hanshin, is it "#tiger van"?
Central League | Pacific League | ||
---|---|---|---|
Giant | #kyojin, #giants | Nippon ham | #lovefighters |
Chunichi | # dragons | Softbank | #sbhawks |
Hiroshima | #carp | Rakuten | #rakuteneagles |
Yakult | #swallows,#yakultswallows | Seibu | #seibulions |
Hanshin | #hanshin,#tigers | Lotte | #chibalotte |
DeNA | #baystars | Orix | #Orix_Buffaloes |
Obtained by scraping websites that provide breaking sports news.
--Scraping destination - SPORTS BULL(https://sportsbull.jp/stats/npb/) --Sports Navi (by Yahoo! JAPAN) (https://baseball.yahoo.co.jp/npb/schedule/)
Hashtag dictionary object (tag_list) key: team_id set by myself item: Hashtag (query when searching for tweets)
tag_list = {0: '#kyojin OR #giants', 1: '#dragons',\
2: '#carp', 3: '#swallows OR #yakultswallows',4: '#hanshin OR #tigers',\
5: '#baystars', 6: '#lovefighters', 7: '#sbhawks',8: '#rakuteneagles',\
9: '#seibulions', 10: '#chibalotte', 11: '#Orix_Buffaloes'}
Published on github → here
The library used this time is as follows. Please install as appropriate.
getLiveTweet_NPB.py
from urllib.request import urlopen
import tweepy
from datetime import timedelta
import time
import sqlite3
from contextlib import closing
import datetime
from bs4 import BeautifulSoup
import urllib.request as req
SPORTS BULL(https://sportsbull.jp/stats/npb/) Scraping the match played on the date and time specified by. You can get it from Sportsnavi, but this one has a simpler HTML structure.
getLiveTweet_NPB.py
def get_gameteamId(gamedate):
url = 'https://sportsbull.jp/stats/npb/home/index/' + gamedate
print(url)
res = req.urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
q = soup.select('.game-block a')
gameId_list = []
flag_list = [1 for i in range(12)]
i = 0
for p in q:
urls = p.get('href')
#Processing when canceled
p_ = p.select('.st-03')
for p__ in p_:
if 'Cancel' in str(p__.text):
print('Cancel')
flag_list[i] = 0
flag_list[i+1] = 0
if flag_list[i] == 1:
print(urls[-10:])
gameId_list.append(urls[-10:])
i += 2
print('flag_list: ',flag_list)
q = soup.select('.game-block .play-box01 dt')
teamId_list = []
teamId_dict = {'Giant': 0, 'Chunichi': 1, 'Hiroshima': 2, 'Yakult': 3, 'Hanshin': 4, 'DeNA': 5,
'Nippon ham': 6, 'Softbank': 7, 'Rakuten': 8, 'Seibu': 9, 'Lotte': 10, 'Orix': 11}
i = 0
for p in q:
if flag_list[i] == 1:
teamId_list.append(teamId_dict[p.text])
i += 1
return gameId_list, teamId_list
#date
def get_date(days_ago):
date = datetime.date.today()
date -= datetime.timedelta(days=days_ago)
date_str = str(date)
date_str = date_str[:4]+date_str[5:7]+date_str[8:10]
return date_str
#Example--------------------------
n = 1
game_date = get_date(n) #Automatic(Get data n days ago)
game_date = '20200401' #Manual input
print(game_date,'Get the data of,')
# -----------------------------
#List of game IDs and team IDs
gameteamId_list = get_gameteamId(game_date)
gameId_list = gameteamId_list[0]
teamId_list = gameteamId_list[1]
print('gameId_list:',gameId_list)
print('teamId_list:',teamId_list)
Example of execution result
Get the data of 20200401
https://sportsbull.jp/stats/npb/home/index/20200401
flag_list: [1,1,1,1,0,0,0,0,0,0,0,0]
gameId_list: [2020040101,2020040102]
teamId_list: [0,1,2,3]
in this case, Giants (Home) vs Chunichi (away) at gameId = 2019040101 Hiroshima (Home) vs Yakult (away) at gameId = 2019040102 The match was held
Yahoo! Sportsnavi (https://baseball.yahoo.co.jp/npb/schedule/) Each match page https://baseball.yahoo.co.jp/npb/game/[game_id]/top Since the start time and match time can be taken from, add them up to get the start time and end time.
getLiveTweet_NPB.py
#Get the start time and end time of the match by scraping
def gametime(game_id):
url = 'https://baseball.yahoo.co.jp/npb/game/' + game_id + '/top'
res = req.urlopen(url)
soup = BeautifulSoup(res, 'html.parser')
time = []
#Start time
css_select = '#gm_match .gamecard .column-center .stadium'
q = soup.select(css_select)
time.append(q[0].text[-6:-4])
time.append(q[0].text[-3:-1])
#ending time
minutes = []
while True:
try:
css_select = '#yjSNLiveDetaildata td'
q = soup.select(css_select)
minutes = q[1].text.split('time')
minutes[1] = minutes[1][:-1]
break
except:
continue
time = time + minutes
return time
↑ Output of this function
#Start time 18:00, when the match time is 3 hours and 15 minutes
[18,0,21,15]
Use the Twitter API search to get all tweets in time. Since 100 tweets can be acquired with one request, repeat it. If you get stuck in the API limit, pause and wait 15 minutes.
The target is tweets from the start of the game to 5 minutes after the end of the game.
getLiveTweet_NPB.py
# TwitterAPI
APIK = 'consumer_key'
APIS = 'consumer_secret'
AT = 'access_token'
AS = 'access_token_secret'
auth = tweepy.OAuthHandler(APIK, APIS)
auth.set_access_token(AT, AS)
api = tweepy.API(auth)
#Twitter API search
def search_livetweet(team_num, api, game_id, query):
print(query) #Get from the latest tweets
print('Search page: 1')
try:
tweet_data = api.search(q=query, count=1)
except tweepy.TweepError as e:
print('Error: wait 15 minutes')
time.sleep(60 * 15)
tweet_data = api.search(q=query, count=100)
table_name = 'team' + str(team_num)
#This function is for saving to the database
saveDB_tweet(table_name, 0, tweet_data, game_id)
print('************************************************\n')
next_max_id = tweet_data[-1].id
page = 1
while True:
page += 1
print('Search page:' + str(page))
try:
tweet_data = api.search(q=query, count=100, max_id=next_max_id - 1)
if len(tweet_data) == 0:
break
else:
next_max_id = tweet_data[-1].id
#This function is for saving to the database
saveDB_tweet(table_name, page - 1, tweet_data, game_id)
except tweepy.TweepError as e:
print('Error: wait 15 minutes')
print(datetime.datetime.now().strftime("%Y/%m/%d %H:%M:%S"))
print(e.reason)
time.sleep(60 * 15)
continue
print('*'*40 + '\n')
#Specify time → Create query → Tweet search function (search)_livetweet())
def get_livetweet(team_id, game_id):
date = game_id[:4] + '-' + game_id[4:6] + '-' + game_id[6:8]
time = gametime(game_id)
sh, sm = time[0], time[1]
eh = int(time[0]) + int(time[2])
em = int(time[1]) + int(time[3]) + 5 #5 minutes after the end
if em >= 60:
em -= 60
eh += 1
eh = '{0:02d}'.format(eh)
em = '{0:02d}'.format(em)
print(date, sh, sm, eh, em)
tag_list = {0: '#kyojin OR #giants', 1: '#dragons',\
2: '#carp', 3: '#swallows OR #yakultswallows',4: '#hanshin OR #tigers',\
5: '#baystars', 6: '#lovefighters', 7: '#sbhawks',8: '#rakuteneagles',\
9: '#seibulions', 10: '#chibalotte', 11: '#Orix_Buffaloes'}
tag = tag_list[team_num]
query = tag + ' exclude:retweets exclude:replies\
since:' + date + '_' + sh + ':' + sm + ':00_JST \
until:' + date + '_' + eh + ':' + em + ':59_JST lang:ja'
search_livetweet(team_id, api, game_id, query)
Get tweets from two teams for each match from the gameId_list and teamId_list created above.
getLiveTweet_NPB.py
for i in range(len(gameId_list)):
game_id = gameId_list[i]
#away
team_id = teamId_list[2*i+1]
get_livetweet(team_id, game_id)
print('='*60 + '\n')
#home
team_id = teamId_list[2*i]
get_livetweet(team_id, game_id)
print('='*60 + '\n')
If the game is interrupted by rain, you may not be able to get all the tweets. It seems that improvement around that is necessary.
The part to get tweets by specifying the time can be used in any domain, so I hope it will be helpful.
Recommended Posts