I needed to scrape the table on the web page in my research, so I will introduce the python program I used at that time. By the way, since I had no scraping history, I made it while investigating various things, but there was almost no explanation about how to convert the table part of HTML to csv after converting the table on the Web page to HTML. So I wrote this article.
Please see the following URL for notes on scraping. https://qiita.com/Azunyan1111/items/b161b998790b1db2ff7a
The entire program can be found at here.
import
import csv
import urllib
from bs4 import BeautifulSoup
Description of the imported library -Csv is a Python standard library, and this time it is used for writing CSV files. -Urllib is used to access and acquire data (HTML) on the web. -BeautifulSoup is used to extract targeted data from HTML
url = "https://en.wikipedia.org/wiki/List_of_cities_in_Japan"
html = urllib.request.urlopen(url)
soup = BeautifulSoup(html, 'html.parser')
#Table from HTML(table tag)Get all the parts of
table = soup.find_all("table")
This time, I will scrape the wikipedia table that summarizes Japanese cities.
The program urllib.request.urlopen gets the HTML of the specified url. After that, format it so that it is easy to handle using Beautiful Soup, and then get all the part with the table (the part surrounded by the table tag) from HTML with soup.find_all ("table") and you are ready to go.
If you are using a chrome browser, you can enter the developer tools (black screen in the screenshot) by pressing F12 (command + option + I on mac). After that, you can see the HTML source code from Elements, so search for the table tag you want to scrape. This time, I would like to get the table selected in blue. Actually, this can be obtained by simply selecting the one whose className is "wikitable" from all the table tags.
for tab in table:
table_className = tab.get("class")
print(table_className)
if table_className[0] == "wikitable":
break
#Output result when there is no break statement
# ['vertical-navbox', 'nowraplinks', 'hlist']
# ['wikitable'] <-here,Exit using a break statement
# ['wikitable', 'sortable']
# ['wikitable', 'sortable']
# ['wikitable']
# ['nowraplinks', 'mw-collapsible', 'autocollapse', 'navbox-inner']
-The reason why table_className [0] is set is that the wikitable comes at the beginning of className. -Also, in this case, there are multiple other tables with the same name as the wikitable on HTML, but since the table I want this time is always the first wikitable, after passing the if statement for the first time, immediately issue a break statement. I'm using it to get out of the loop.
Finally, add the CSV save function to the above program.
for tab in table:
table_className = tab.get("class")
if table_className[0] == "wikitable":
#CSV save part
with open("test.csv", "w", encoding='utf-8') as file:
writer = csv.writer(file)
rows = tab.find_all("tr")
for row in rows:
csvRow = []
for cell in row.findAll(['td', 'th']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
break
The part of the CSV save function is to extract the table tag in the row direction ("tr"), take it out in the column direction ("td", "th"), append it in list format, and save it as CSV (table). If you can extract the tag, you can use it in copy and paste).
import pandas as pd
pd.read_csv("test.csv")
Safely, the csv saved one could be displayed by pandas!
It depends on the site you want to scrape, but I think you can get the table in CSV format in this way! Thank you for visiting us so far!
https://qiita.com/Azunyan1111/items/b161b998790b1db2ff7a
Recommended Posts