Get data of all Premier League players by scraping with Ruby (nokogiri)

Nice to meet you. My name is Tatsu. Currently a university student, I am a fledgling student who is studying to become an engineer. I'm nervous about the first Qiita post in my life. Thank you.

The theme of the memorable first post is scraping using the Ruby library nokogiri. Let's get into the main subject immediately.

Overview

The site targeted for scraping this time is a site called Worldfootball.net that covers information on soccer around the world.

A friend of mine who is doing data analysis in a university study asked me to help with collecting detailed data of all Premier League players in excel, so I said, "If you write that code, it will end in an instant, right?" I accepted it. This is the first programming job in my life. (Although it is a complete volunteer ...)

This site consists of a list page (with pagination) in which the basic data is in table format and a detail page with more detailed data for each player. The general procedure is to first get the basic data of all players and the links to the detail pages of each player from the list page, and then get the data from each detail page. The basic data and detailed data consist of the following items.

--Detailed data --Number of participation --Number of points --Number of yellow cards and red cards --Number of changes on the way
etc ...

↓ Player list page https://www.worldfootball.net/players_list/eng-premier-league-2019-2020/nach-name/1/ ↓ Detail page (Example: Minamino player's detail page) https://www.worldfootball.net/player_summary/takumi-minamino/

0. Preparation

Type the following command to install nokogiri. gem install nokogiri

Next, load the required library this time.

get_all.rb


require 'nokogiri' 
require 'open-uri' #A library for accessing URLs.
require 'csv' #A library for reading and writing csv files.

1. Acquisition of basic data

The first target is basic data. In other words, it is the data that can be obtained from the list page. The list page is paginated and you will get multiple URLs. Therefore, store each URL in an array as shown below.

get_all.rb


urls = []

(1..14).each do |num|
  urls.push("https://www.worldfootball.net/players_list/eng-premier-league-2019-2020/nach-mannschaft/#{num}/")
end

Now that we have a 14-page list page, we are ready to get the basic data for all the players in question! !!

From here, I'm not sure if the method I did is correct, but I'm happy to introduce it because I was able to achieve my purpose. I would be very happy if you could point out something.

The next thing I did was to get the basic data of each player in an array for each data. With nokogiri, you can use CSS selectors to get the desired data on your site as follows.

sampe


doc = Nokogiri::HTML(URI.open(url)) #Get data from the target URL

target = doc.css(".container > div > a") #Get the a element directly under the div directly under the container class.
link = target[:href] #Get the href attribute in the a element.
text = target.inner_html #Get the text in the a element.

Based on this, we will acquire the basic data of each player from the list page. Now that we have an array of URLs, we'll iterate over it and get the data from every page.

get_all.rb


players_pages = [] #Store the detailed URL of each player
names = [] #Name of each player 〃
teams = []#Team to which each player belongs 〃
birthdays = [] #Birthday of each player 〃
height_data = [] #Height of each player 〃

urls.each do |url|
  doc = Nokogiri::HTML(URI.open(url))

  doc.css(".standard_tabelle td:nth-child(1) > a").each do |name|
    players_pages.push(name[:href])
    names.push(name.inner_html)
  end

  doc.css(".standard_tabelle td:nth-child(3) > a").each do |team|
    teams.push(team.inner_html)
  end

  doc.css(".standard_tabelle td:nth-child(4)").each do |born|
    birthdays.push(born.inner_html)
  end

  doc.css(".standard_tabelle td:nth-child(5)").each do |height|
    height_data.push(height.inner_html)
  end

end

With this, we were able to acquire the basic data of all players. At this stage, if you write p names and execute it, the names of all the players will be spit out! !!

The csv library allows you to write and read csv files. First of all, I would like to export the csv file of the link list of the player details page. This file will be loaded and used in the next step (getting detailed data).

get_all.rb


CSV.open('players_pages.csv', 'w') do |csv|
  csv << players_pages
end

Finally, I would like to export the retrieved basic data in the form of a csv file. It's very easy, just add the following 5 lines. You don't have to have header, but I think it's better to have it because you can output it to excel or spreadsheet as it is.

get_all.rb


CSV.open('data_1.csv', 'w') do |csv|
  headers = %w(name team born height)
  csv << headers
  names.zip(teams, birthdays, height_data).each { |data| csv << data }
end

This completes a file that contains the basic data of all players. The csv file is automatically created in the directory you are working in. For the time being, the acquisition of basic data is now complete! !!

get_all.rb


require 'nokogiri'
require 'open-uri'
require 'csv'

urls = []

(1..14).each do |num|
  urls.push("https://www.worldfootball.net/players_list/eng-premier-league-2019-2020/nach-mannschaft/#{num}/")
end

players_pages = []
names = []
teams = []
birthdays = []
height_data = []

urls.each do |url|
  doc = Nokogiri::HTML(URI.open(url))
  
  doc.css(".standard_tabelle td:nth-child(1) > a").each do |name|
    players_pages.push(name[:href])
    names.push(name.inner_html)
  end

  doc.css(".standard_tabelle td:nth-child(3) > a").each do |team|
    teams.push(team.inner_html)
  end

  doc.css(".standard_tabelle td:nth-child(4)").each do |born|
    birthdays.push(born.inner_html)
  end

  doc.css(".standard_tabelle td:nth-child(5)").each do |height|
    height_data.push(height.inner_html)
  end

end

CSV.open('players_pages.csv', 'w') do |csv|
  csv << players_pages
end

CSV.open('data_1.csv', 'w') do |csv|
  headers = %w(name team born height)
  csv << headers
  names.zip(names, teams, birthdays, height_data).each { |data| csv << data }
end

2. Get detailed data

I had a hard time from here. The reason is that the data acquisition methods were different. What this means is that the detail page also contains information about all seasons of Hazono players, as well as information about national teams and other leagues, accessing the data of a player's purpose and repeating it. I couldn't do that. For example, if you had to access the first row of the table to access the target data (19/20 season Premier League data) on the player A's detail page, but on the player B's page as well. The first row may have been in another season, which makes it impossible to collect the intended data. I struggled like this, but I managed to implement it using conditional branching, so I would like to introduce it.

I created another file this time, but the same file is fine. First, load the necessary library as in the case of basic data, and load the URL of each player's detail page from the csv file created last time.

get_detail.rb


require 'nokogiri'
require 'open-uri'
require 'csv'

urls = CSV.read('players_pages.csv') #Read the csv file and store it in an array.

You have now created an array (urls) containing the URLs of all player detail pages. The work of acquiring data for all of these is performed using the each statement, but the problem is the processing performed in the each statement. This time I decided to get the data of each player as a hash for each item and create an array of hashes that summarizes them. First, define an array to store all the data.

get_detail.rb


details = []

Conditional branching will be performed in the processing to be performed from now on. The conditions for branching are first whether or not there is Premier League data, and then whether or not there is Premier League data for the 19/20 season.

If you are not familiar with soccer, you may not know what it is, but the players targeted this time include players who have participated in even one match. It is possible that such players do not have Premier League data. In addition, some players may not have the target table on their pages, even though they are registered but do not have data for the 19/20 season. Therefore, such a branch is necessary.

First, I will write the process of branching depending on whether there is Premier League data.

get_detail.rb


urls.each do |u|

  url =  u[0] #urls[0][0]It is an image. If just u["htpps://~~~~.com"]Will be output in the form of.

  doc = Nokogiri::HTML(URI.open(url))

  prLeagues = []

  doc.css("td > a").each do |a| #Processes all a elements below td.
    if a.text == "Pr. League"
      prLeagues.push(a)
    end
  end

  tds = []

  if prLeagues.any?
    prLeagues.each { |league| tds.push(league.parent) } #Gets the td element, which is the parent element of the a element (Premier League).
  else #If there is no Premier League. Create a hash with all data as 0.
    data = { appearances: 0, scores: 0, yellow: 0, red_with_2yellow: 0, red: 0 }
    details.push(data)
    next #Finish and move on to the next process.
  end

# appearances=Number of games played, scores=Number of points

Next, we will make a conditional branch depending on whether or not there is data for the Premier League for the 19/20 season. An array called tds contains td elements that contain the string Premier League ("pr. League"). Immediately after td where the league is shown, the season is shown, so access it and extract only valid tables. In this case, only one row can be valid for the table. Therefore, I used a lot of +, which is a CSS selector that represents "immediately after", in order to acquire data that spreads horizontally.

get_detail.rb


  valid_table = nil #Initialization

  tds.each do |td|
    valid_table = td if td.css(" + td > a").text == "2019/2020" #19/If you have data for 20 seasons
  end

  if valid_table
    scores = valid_table.css("+ td + td + td + td").text
    yellow_cards = valid_table.css("+ td + td + td + td + td + td + td + td").text
    red_cards_with_2yellow = valid_table.css("+ td + td + td + td + td + td + td + td + td").text
    red_cards = valid_table.css("+ td + td + td + td + td + td + td + td + td + td").text
    
    data = { appearances: appearances, scores: scores, yellow: yellow_cards, red_with_2yellow: red_cards_with_2yellow, red: red_cards  }
    details.push(data)
  else #19/If you don't have a table for 20 seasons(For nil). In this case as well, the data is created with all 0s.
    data = { appearances: 0, scores: 0, yellow: 0, red_with_2yellow: 0, red: 0 }
    details.push(data)
  end

end #End of each statement

This completes the acquisition of all the data! !! I had a hard time and I got the impression that it was quite messy, but I managed to get the data I was aiming for. I feel that it is okay to use nth-child after getting the parent element to say "immediately after ..." by using + a lot. There seems to be a better way to branch. I hope you will receive it as a struggle record for inexperienced people.

Finally, we will export the csv file again. The value of the hash can be retrieved as follows.

get_detail.rb


CSV.open('data_2.csv', 'w') do |csv|
  headers = %w(appearances score yellow red(2yellow) red(only) )
  csv << headers
  details.each { |detail| csv << detail.values }
end

This completes the process! !!

get_detail.rb


require 'nokogiri'
require 'open-uri'
require 'csv'

urls = CSV.read('players_pages.csv')

details = []

urls.each do |u|
  p details.count
  
  url =  u[0]

  doc = Nokogiri::HTML(URI.open(url))

  prLeagues = []

  doc.css("td > a").each do |a|
    if a.text == "Pr. League"
      prLeagues.push(a)
    end
  end

  tds = []

  if prLeagues.any?
    prLeagues.each { |league| tds.push(league.parent) }
  else
    data = { appearances: 0, scores: 0, yellow: 0, red_with_2yellow: 0, red: 0 }
    details.push(data)
    next
  end

  valid_table = nil

  tds.each do |td|
    valid_table = td if td.css(" + td > a").text == "2019/2020"
  end

  if valid_table
    appearances = valid_table.css("+ td + td + td > a").text
    scores = valid_table.css("+ td + td + td + td").text
    yellow_cards = valid_table.css("+ td + td + td + td + td + td + td + td").text
    red_cards_with_2yellow = valid_table.css("+ td + td + td + td + td + td + td + td + td").text
    red_cards = valid_table.css("+ td + td + td + td + td + td + td + td + td + td").text
    
    data = { appearances: appearances, scores: scores, yellow: yellow_cards, red_with_2yellow: red_cards_with_2yellow, red: red_cards  }
    details.push(data)
  else
    data = { appearances: 0, scores: 0, yellow: 0, red_with_2yellow: 0, red: 0 }
    details.push(data)
  end

end

CSV.open('data_2.csv', 'w') do |csv|
  headers = %w(appearances score yellow red(2yellow) red(only) )
  csv << headers
  details.each { |detail| csv << detail.values }
end

You can create a table like an image by reading the two completed csv files with excel or spreadsheet. スクリーンショット 2020-10-14 15.13.55.png

Finally

I tried scraping for the first time this time, and it was a lot of fun even though I had a hard time. I think it took a long time, but it was much faster than working manually. Few people may want to do exactly the same thing, but we hope it helps anyone who wants to scrape. (When scraping, please be careful not to violate the rules!)

I am still a fledgling and immature person, but I would like to improve my technology and send out more useful information! !! Thank you for reading the article. We look forward to working with you in the future! !! !!

Recommended Posts

Get data of all Premier League players by scraping with Ruby (nokogiri)
[Ruby] Get Qiita trend articles by web scraping
Get data with api created by curl command
Scraping yahoo news with [Ruby + Nokogiri] → Save CSV
Ruby, Nokogiri: Get the element name of the selected node