From tables that are updated frequently or tables that are difficult to copy and paste I wondered if I could improve the efficiency of data collection, this time. I wrote the code to scrape with python and write it to CSV.

Set environment

MacBook Air (13-inch, Mid 2011) Processor: 1.8 GHz Intel Core i7 Memory: 4 GB 1333 MHz DDR3 Version: 10.11.5 Python: 3.6.2

Preparation

Install Beautiful Soup. BeautifulSoup is a library that can retrieve data from HTML and XML.

This time I installed it using pip.

$ pip3 install beautifulsoup4
Collecting beautifulsoup4
  Downloading beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
    100% |████████████████████████████████| 92kB 1.8MB/s 
Installing collected packages: beautifulsoup4
Successfully installed beautifulsoup4-4.6.0

Other options include easy_install, apt-get, and direct code download and installation. For more information, please read "Installing Beautiful Soup" in the official document below.

https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Try scraping table elements

Once you have beautifulsoup4 installed, Let's get the new publication information of O'Reilly at once.

Image of book information table

** 2019/03/20 update **: The write file is opened with with.

`scraping_table.py`


import csv
from urllib.request import urlopen
from bs4 import BeautifulSoup
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

#Specifying the URL
html = urlopen("https://www.oreilly.co.jp/ebook/")
bsObj = BeautifulSoup(html, "html.parser")

#Specify table
table = bsObj.findAll("table", {"class":"tablesorter"})[0]
rows = table.findAll("tr")

with open("ebooks.csv", "w", encoding='utf-8') as file:
    writer = csv.writer(file)
    for row in rows:
        csvRow = []
        for cell in row.findAll(['td', 'th']):
            csvRow.append(cell.get_text())
        writer.writerow(csvRow)

The exported CSV looks like this. If you do it regularly, you won't miss any new publications! By the way, in the above code, since it was specified by get_text (), the image link in the "Add to cart" column is empty.

ISBN,Title,price,Issue month,add to cart
978-4-87311-755-3,Design design to improve performance,"2,073",2016/06,
978-4-87311-700-3,Network security through data analysis,"3,110",2016/06,
978-4-87311-754-6,UX strategy,"2,592",2016/05,
978-4-87311-768-3,An introduction to mathematics starting with Python,"2,419",2016/05,
978-4-87311-767-6,What is the software doing without your knowledge?,"2,246",2016/05,
978-4-87311-763-8,Fermentation technique,"3,110",2016/04,
978-4-87311-765-2,First Ansible,"2,764",2016/04,
978-4-87311-764-5,Kanban work technique,"3,110",2016/03,

How to apply to other sites

Basically, you can easily get the tables of other sites by modifying the following part of the code.

Change the class name of the table you want to get
If there are multiple tables with the same class name in the site, specify the number by the number in [].

#Specify table
table = bsObj.findAll("table",{"class":"tablesorter"})[0]
rows = table.findAll("tr")

About CSV

I'm using a Mac, so the exported CSV was utf-8. If you read it in Excel as it is, the characters will be garbled, so it is easy to use if you convert the character code and format it. If you want to know how to convert, click here [http://help.peatix.com/customer/portal/articles/530797-%E3%83%80%E3%82%A6%E3%83%B3%E3 % 83% AD% E3% 83% BC% E3% 83% 89% E3% 81% 97% E3% 81% 9F% E3% 83% 95% E3% 82% A1% E3% 82% A4% E3% 83 % AB% E3% 81% AE% E6% 96% 87% E5% AD% 97% E5% 8C% 96% E3% 81% 91% E3% 81% AB% E3% 81% A4% E3% 81% 84 % E3% 81% A6-for-mac) Please (another site)

[Python] Scraping a table using Beautiful Soup