I was worried about scraping and wanted to get some data for the time being, so I tried scraping while referring to the following site. https://www.atmarkit.co.jp/ait/articles/1910/18/news015_2.html I will write it as a review, so I hope it will be helpful for those who are new to scraping! Written in Google Colab using Python. Therefore, there may be some differences from the local description.
I scraped with request and Beautiful so up. In request, the specified web k and other files are acquired, and the desired information is extracted from the file acquired by Beautiful soup. As you can see on the site, I am writing a program to get the J League standings. In addition, I have written up to the point of additionally saving to CSV. The code used this time is shown below.
qiita.rb
from bs4 import BeautifulSoup
from urllib import request
url = 'https://www.jleague.jp/standings/j1/'
response = request.urlopen(url)
content = response.read()
response.close()
charset = response.headers.get_content_charset()
html = content.decode(charset, 'ignore')
soup = BeautifulSoup(html)
table = soup.find_all('tr')
standing = []
for row in table:
tmp = []
for item in row.find_all('td'):
if item.a:
tmp.append(item.text[0:len(item.text) // 2])
else:
tmp.append(item.text)
del tmp[0]
del tmp[-1]
standing.append(tmp)
for item in standing:
print(item)
import pandas as pd
from google.colab import files
del standing[0]
df = pd.DataFrame(standing,columns = ['Ranking', 'Club name', 'Points', 'Number of games', 'Win', 'Minutes', 'negative', 'score', 'Conceded', '得Conceded'])
from google.colab import drive
filename = 'j1league.csv'
path = '/content/drive/My Drive/' + filename
with open(path, 'w', encoding = 'utf-8-sig') as f:
df.to_csv(f,index=False)
Since I implemented it while checking it in detail on the way, I put print () in between, but here I have implemented it up to saving it to a file at once.
Recommended Posts