I've always been interested in scraping and python. As an introduction to both, using python scrapy, I would like to extract the batting results from the professional baseball game information site.

Introduction

Copyright is involved in the implementation of scraping. After investigating, it seems that there is no problem in collecting and analyzing information. (For websites other than membership system)

I refer to the following.

-Is it okay to collect information on the web by scraping and create a database in-house? | Internet legal affairs of IT companies, lawyers who are strong in law | Hidetoshi Nakano

-Notes on legal validity issues related to crawling and web scraping | Welcome to Singularity -Let's talk about the law of web scraping! --Qiita

reference

Since it is based on the following books, please refer to that for details.

-[Python Crawling & Scraping-Practical Development Guide for Data Collection and Analysis- | Gihyo Digital Publishing ... Gijutsu-Hyoronsha's e-book](https://gihyo.jp/dp/ebook/2016/978-4-7741 -8684-9)

Achievement goal

Extract data in a format that can record the batting results of all seasons for each player

Selection of information acquisition site

I chose a site with the following characteristics as a target for scraping.

--Not a membership system ――The batting record can be understood up to the defensive position where it flew, such as Uyan, Sangoro ... --The batting record is updated during the match --It has an html structure that is easy to process by scraping.

By the way, the selected site structure is as follows.

(○ Monthly match list page) 
├─ → ○ Match on the 1st of the month G vs De ├─ → Score(Batting record page)
├─ → ○ Match on the 1st of the month Ys vs C ├─ → Score(Batting record page)
├─ → ○ Match on the 2nd of the month G vs De ├─ → Score(Batting record page)
└─→...

However, we plan to include some source code for scraping, so So as not to bother the information acquisition source The site name and url are not listed.

Environmental setting

-Python environment construction on Mac (pyenv, virtualenv, anaconda, ipython notebook) --Qiita -Install scrapy in python anaconda environment --Qiita

With reference to the above, I made the following settings.

--Linking the python version to the development directory --scrapy install

As a result of the settings, the environment is in the following state.

(baseballEnv) 11:34AM pyenv versions                                                                     [~/myDevelopPj/baseball]
 system
 3.5.1
 anaconda3-4.2.0
 anaconda3-4.2.0/envs/baseballEnv
* baseballEnv (set by /Users/username/myDevelopPj/baseball/.python-version)

(baseballEnv) 11:34AM scrapy version                                                                       [
Scrapy 1.3.3

Implementation

Start the project

Create a project with the start command.

scrapy startproject scrapy_baseball

The result is as follows.

(baseballEnv)  7:32AM tree                                                                          [~/myDevelopPj/baseball]
.
├── readme.md
└── scrapy_baseball
  ├── scrapy.cfg
  └── scrapy_baseball
    ├── init.py
    ├── pycache
    ├── items.py
    ├── middlewares.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
      ├── init.py
      └── pycache

5 directories, 8 files

First setting

Set the following in settings.py. If you don't do this, the download interval will be 0 seconds, so It will put a high load.

DOWNLOAD_DELAY = 1

Creating a Spider

Make sure you are in the directory where scrapy.cfg exists.

(baseballEnv) 10:13AM ls scrapy.cfg                                                             [~/myDevelopPj/baseball/scrapy_baseball]
scrapy.cfg

Execute scrapy genspider battingResult xxxxxxxxx. (xxxxxxxxx: domain name) This will create battingResult.py.

battingResult.py

# -*- coding: utf-8 -*-
import scrapy


class BattingresultSpider(scrapy.Spider):
  name = "battingResult"
  allowed_domains = ["xxxxxxxxx"]
  start_urls = ['http://xxxxxxxxx/']

  def parse(self, response):
    pass

Then, implement as follows.

# -*- coding: utf-8 -*-
import scrapy


class BattingresultSpider(scrapy.Spider):
  name = "battingResult"
  allowed_domains = ["xxxxxxxxx"]
  start_urls = ['http://xxxxxxxxx/']
  xpath_team_name_home = '//*[@id="wrapper"]/div/dl[2]/dt'
  xpath_batting_result_home = '//*[@id="wrapper"]/div/div[6]/table/tr'
  xpath_team_name_visitor = '//*[@id="wrapper"]/div/dl[3]/dt'
  xpath_batting_result_visitor = '//*[@id="wrapper"]/div/div[9]/table/tr'

  def parse(self, response):
    """
    start_Extract the links to individual games from the game list of the month of the page specified in the url.
    """
    for url in response.css("table.t007 tr td a").re(r'../pastgame.*?html'):
      yield scrapy.Request(response.urljoin(url), self.parse_game)

  def parse_game(self, response):
    """
Extract each player's batting record from each game
    """

    #Home team data
    teamName = self.parse_team_name(response,self.xpath_team_name_home)
    print(teamName)
    self.prrse_batting_result(response,self.xpath_batting_result_home)

    #Visitor team data
    teamName = self.parse_team_name(response,self.xpath_team_name_visitor)
    print(teamName)
    self.prrse_batting_result(response,self.xpath_batting_result_visitor)

  def parse_team_name(self,response,xpath):
    teamName = response.xpath(xpath).css('dt::text').extract_first()
    return teamName

  def prrse_batting_result(self,response,xpath):
    for record in response.xpath(xpath):
      playerName = record.css('td.player::text').extract_first()

      if playerName is None:
        continue

      outputData = ''
      for result in record.css('td.result'):
        outputData = ','.join([outputData,result.css('::text').extract_first()])
      print(playerName + outputData)
    pass

The match list page for the month is specified for start_urls. From there, it transitions to each match page and extracts the batting results. Since you can get the list object of the node that matches the css selector with .css (), You can use .re () to get only the part that matches the regular expression, For example, get a text node with a pseudo selector, like css ('td.player :: text').

The phenomenon that tbody cannot be obtained by xpath () has been solved below. Not able to extract text from the td tag/element using python scrapy - Stack Overflow

The following helped me to get and validate the xpath. Take and verify XPath in Chrome-Qiita

Run

Execute the following command.

scrapy crawl battingResult

As a result, the following data was obtained. (The data is an image.) Since the team name and player name have been obtained, It seems that it can be distinguished even if there are players with the same name in another team.

DeNA / batter results
Player A,Three go,-,Three go,-,Left line 2,-,Walks,-,Yugo
Player B,Sky strikeout,-,Left middle book,-,Sky strikeout,-,Saan,-,-
Player C,Ichigo,-,Mino,-,Saan,-,Three go together,-,-
Player D,-,Walks,Uyan,-,Yugo,-,-,Left flight,-
Player E,-,Left book,Left flight,-,-,Zhong'an,-,Uyan,-
Player F,-,Yugo,-,Walks,-,Nakahi,-,Sky strikeout,-
Player G,-,Nigo,-,Middle right 2,-,Left flight,-,Second flight,-
Player H,-,Fly right,-,Strikeout,-,Sky strikeout,-,-,-
Player I,-,-,-,-,-,-,-,-,-
Player J,-,-,-,-,-,-,-,-,-
Player K,-,-,-,-,-,-,-,-,-
Player L,-,-,-,-,-,-,-,-,Nakahi
Player M,-,-,-,-,-,-,-,-,-
Player N,-,-,Lost two,Left flight,-,-,Sky strikeout,-,Uyan

Giants / batter results
Player 1,Sky strikeout,-,Fly right,-,-,Lost one,-,Nigo,-
Player 2,Strikeout,-,-,Strikeout,-,Left flight,-,Zhong'an,-
Player 3,Saan,-,-,Three go,-,Left middle book,-,Saan,-
Player 4,Second flight,-,-,Zhong'an,-,Nigo,-,Three evil flights,-
Player 5,-,Strikeout,-,Ichigo,-,Three go,-,Left flight,-
Player 6,-,Sky strikeout,-,-,Fly right,-,Fly right,-,Flying
Player 7,-,Strikeout,-,-,Middle right 2,-,Saan,-,Fly right
Player 8,-,-,Flying,-,Yugo,-,Three go,-,-
Player 9,-,-,-,-,-,-,-,-,Sky strikeout
Player 10,-,-,Ichigo,-,-,-,-,-,-
Player 11,-,-,-,-,-,-,-,-,-
Player 12,-,-,-,-,Sky strikeout,-,-,-,-
Player 13,-,-,-,-,-,-,-,-,-
Player 14,-,-,-,-,-,-,Strikeout,-,-
Player 15,-,-,-,-,-,-,-,-,-

Task

As a result of data acquisition In achieving the goal of'extracting data in a format that can record the batting results of all seasons for each player' The following issues remained.

It is difficult to cover data patterns such as batting information'end in the middle'and extended games.

[Scrapy] Extract batting results for each player from the professional baseball game information site