From tables that are updated frequently or tables that are difficult to copy and paste I wondered if I could improve the efficiency of data collection, this time. I wrote the code to scrape with python and write it to CSV.
MacBook Air (13-inch, Mid 2011) Processor: 1.8 GHz Intel Core i7 Memory: 4 GB 1333 MHz DDR3 Version: 10.11.5 Python: 3.6.2
Install Beautiful Soup. BeautifulSoup is a library that can retrieve data from HTML and XML.
This time I installed it using pip.
$ pip3 install beautifulsoup4
Collecting beautifulsoup4
Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
100% |████████████████████████████████| 92kB 1.8MB/s
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.6.0
Other options include easy_install, apt-get, and direct code download and installation. For more information, please read "Installing Beautiful Soup" in the official document below.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Once you have beautifulsoup4 installed, Let's get the new publication information of O'Reilly at once.
** 2019/03/20 update **: The write file is opened with with.
scraping_table.py
import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ssl._create_default_https_context = ssl._create_unverified_context
#Specifying the URL
html = urlopen("https://www.oreilly.co.jp/ebook/")
bsObj = BeautifulSoup(html, "html.parser")
#Specify table
table = bsObj.findAll("table", {"class":"tablesorter"})[0]
rows = table.findAll("tr")
with open("ebooks.csv", "w", encoding='utf-8') as file:
writer = csv.writer(file)
for row in rows:
csvRow = []
for cell in row.findAll(['td', 'th']):
csvRow.append(cell.get_text())
writer.writerow(csvRow)
The exported CSV looks like this. If you do it regularly, you won't miss any new publications! By the way, in the above code, since it was specified by get_text (), the image link in the "Add to cart" column is empty.
ISBN,Title,price,Issue month,add to cart
978-4-87311-755-3,Design design to improve performance,"2,073",2016/06,
978-4-87311-700-3,Network security through data analysis,"3,110",2016/06,
978-4-87311-754-6,UX strategy,"2,592",2016/05,
978-4-87311-768-3,An introduction to mathematics starting with Python,"2,419",2016/05,
978-4-87311-767-6,What is the software doing without your knowledge?,"2,246",2016/05,
978-4-87311-763-8,Fermentation technique,"3,110",2016/04,
978-4-87311-765-2,First Ansible,"2,764",2016/04,
978-4-87311-764-5,Kanban work technique,"3,110",2016/03,
Basically, you can easily get the tables of other sites by modifying the following part of the code.
#Specify table
table = bsObj.findAll("table",{"class":"tablesorter"})[0]
rows = table.findAll("tr")
I'm using a Mac, so the exported CSV was utf-8. If you read it in Excel as it is, the characters will be garbled, so it is easy to use if you convert the character code and format it. If you want to know how to convert, click here [http://help.peatix.com/customer/portal/articles/530797-%E3%83%80%E3%82%A6%E3%83%B3%E3 % 83% AD% E3% 83% BC% E3% 83% 89% E3% 81% 97% E3% 81% 9F% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83 % AB% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 8C% 96% E3% 81% 91% E3% 81% AB% E3% 81% A4% E3% 81% 84 % E3% 81% A6-for-mac) Please (another site)
Recommended Posts