I've always been interested in scraping and python. As an introduction to both, using python scrapy, I would like to extract the batting results from the professional baseball game information site.
Copyright is involved in the implementation of scraping. After investigating, it seems that there is no problem in collecting and analyzing information. (For websites other than membership system)
I refer to the following.
-Notes on legal validity issues related to crawling and web scraping | Welcome to Singularity -Let's talk about the law of web scraping! --Qiita
Since it is based on the following books, please refer to that for details.
-[Python Crawling & Scraping-Practical Development Guide for Data Collection and Analysis- | Gihyo Digital Publishing ... Gijutsu-Hyoronsha's e-book](https://gihyo.jp/dp/ebook/2016/978-4-7741 -8684-9)
I chose a site with the following characteristics as a target for scraping.
--Not a membership system ――The batting record can be understood up to the defensive position where it flew, such as Uyan, Sangoro ... --The batting record is updated during the match --It has an html structure that is easy to process by scraping.
By the way, the selected site structure is as follows.
(○ Monthly match list page)
├─ → ○ Match on the 1st of the month G vs De ├─ → Score(Batting record page)
├─ → ○ Match on the 1st of the month Ys vs C ├─ → Score(Batting record page)
├─ → ○ Match on the 2nd of the month G vs De ├─ → Score(Batting record page)
└─→...
However, we plan to include some source code for scraping, so So as not to bother the information acquisition source The site name and url are not listed.
-Python environment construction on Mac (pyenv, virtualenv, anaconda, ipython notebook) --Qiita -Install scrapy in python anaconda environment --Qiita
With reference to the above, I made the following settings.
--Linking the python version to the development directory --scrapy install
As a result of the settings, the environment is in the following state.
(baseballEnv) 11:34AM pyenv versions [~/myDevelopPj/baseball]
system
3.5.1
anaconda3-4.2.0
anaconda3-4.2.0/envs/baseballEnv
* baseballEnv (set by /Users/username/myDevelopPj/baseball/.python-version)
(baseballEnv) 11:34AM scrapy version [
Scrapy 1.3.3
Create a project with the start command.
scrapy startproject scrapy_baseball
The result is as follows.
(baseballEnv) 7:32AM tree [~/myDevelopPj/baseball]
.
├── readme.md
└── scrapy_baseball
├── scrapy.cfg
└── scrapy_baseball
├── init.py
├── pycache
├── items.py
├── middlewares.py
├── pipelines.py
├── settings.py
└── spiders
├── init.py
└── pycache
5 directories, 8 files
Set the following in settings.py. If you don't do this, the download interval will be 0 seconds, so It will put a high load.
DOWNLOAD_DELAY = 1
Make sure you are in the directory where scrapy.cfg exists.
(baseballEnv) 10:13AM ls scrapy.cfg [~/myDevelopPj/baseball/scrapy_baseball]
scrapy.cfg
Execute scrapy genspider battingResult xxxxxxxxx. (xxxxxxxxx: domain name) This will create battingResult.py.
# -*- coding: utf-8 -*-
import scrapy
class BattingresultSpider(scrapy.Spider):
name = "battingResult"
allowed_domains = ["xxxxxxxxx"]
start_urls = ['http://xxxxxxxxx/']
def parse(self, response):
pass
Then, implement as follows.
# -*- coding: utf-8 -*-
import scrapy
class BattingresultSpider(scrapy.Spider):
name = "battingResult"
allowed_domains = ["xxxxxxxxx"]
start_urls = ['http://xxxxxxxxx/']
xpath_team_name_home = '//*[@id="wrapper"]/div/dl[2]/dt'
xpath_batting_result_home = '//*[@id="wrapper"]/div/div[6]/table/tr'
xpath_team_name_visitor = '//*[@id="wrapper"]/div/dl[3]/dt'
xpath_batting_result_visitor = '//*[@id="wrapper"]/div/div[9]/table/tr'
def parse(self, response):
"""
start_Extract the links to individual games from the game list of the month of the page specified in the url.
"""
for url in response.css("table.t007 tr td a").re(r'../pastgame.*?html'):
yield scrapy.Request(response.urljoin(url), self.parse_game)
def parse_game(self, response):
"""
Extract each player's batting record from each game
"""
#Home team data
teamName = self.parse_team_name(response,self.xpath_team_name_home)
print(teamName)
self.prrse_batting_result(response,self.xpath_batting_result_home)
#Visitor team data
teamName = self.parse_team_name(response,self.xpath_team_name_visitor)
print(teamName)
self.prrse_batting_result(response,self.xpath_batting_result_visitor)
def parse_team_name(self,response,xpath):
teamName = response.xpath(xpath).css('dt::text').extract_first()
return teamName
def prrse_batting_result(self,response,xpath):
for record in response.xpath(xpath):
playerName = record.css('td.player::text').extract_first()
if playerName is None:
continue
outputData = ''
for result in record.css('td.result'):
outputData = ','.join([outputData,result.css('::text').extract_first()])
print(playerName + outputData)
pass
The match list page for the month is specified for start_urls
.
From there, it transitions to each match page and extracts the batting results.
Since you can get the list object of the node that matches the css selector with .css ()
,
You can use .re ()
to get only the part that matches the regular expression,
For example, get a text node with a pseudo selector, like css ('td.player :: text')
.
The phenomenon that tbody cannot be obtained by xpath () has been solved below. Not able to extract text from the td tag/element using python scrapy - Stack Overflow
The following helped me to get and validate the xpath. Take and verify XPath in Chrome-Qiita
Execute the following command.
scrapy crawl battingResult
As a result, the following data was obtained. (The data is an image.) Since the team name and player name have been obtained, It seems that it can be distinguished even if there are players with the same name in another team.
DeNA / batter results
Player A,Three go,-,Three go,-,Left line 2,-,Walks,-,Yugo
Player B,Sky strikeout,-,Left middle book,-,Sky strikeout,-,Saan,-,-
Player C,Ichigo,-,Mino,-,Saan,-,Three go together,-,-
Player D,-,Walks,Uyan,-,Yugo,-,-,Left flight,-
Player E,-,Left book,Left flight,-,-,Zhong'an,-,Uyan,-
Player F,-,Yugo,-,Walks,-,Nakahi,-,Sky strikeout,-
Player G,-,Nigo,-,Middle right 2,-,Left flight,-,Second flight,-
Player H,-,Fly right,-,Strikeout,-,Sky strikeout,-,-,-
Player I,-,-,-,-,-,-,-,-,-
Player J,-,-,-,-,-,-,-,-,-
Player K,-,-,-,-,-,-,-,-,-
Player L,-,-,-,-,-,-,-,-,Nakahi
Player M,-,-,-,-,-,-,-,-,-
Player N,-,-,Lost two,Left flight,-,-,Sky strikeout,-,Uyan
Giants / batter results
Player 1,Sky strikeout,-,Fly right,-,-,Lost one,-,Nigo,-
Player 2,Strikeout,-,-,Strikeout,-,Left flight,-,Zhong'an,-
Player 3,Saan,-,-,Three go,-,Left middle book,-,Saan,-
Player 4,Second flight,-,-,Zhong'an,-,Nigo,-,Three evil flights,-
Player 5,-,Strikeout,-,Ichigo,-,Three go,-,Left flight,-
Player 6,-,Sky strikeout,-,-,Fly right,-,Fly right,-,Flying
Player 7,-,Strikeout,-,-,Middle right 2,-,Saan,-,Fly right
Player 8,-,-,Flying,-,Yugo,-,Three go,-,-
Player 9,-,-,-,-,-,-,-,-,Sky strikeout
Player 10,-,-,Ichigo,-,-,-,-,-,-
Player 11,-,-,-,-,-,-,-,-,-
Player 12,-,-,-,-,Sky strikeout,-,-,-,-
Player 13,-,-,-,-,-,-,-,-,-
Player 14,-,-,-,-,-,-,Strikeout,-,-
Player 15,-,-,-,-,-,-,-,-,-
As a result of data acquisition In achieving the goal of'extracting data in a format that can record the batting results of all seasons for each player' The following issues remained.