Scraping with Beautiful Soup in 10 minutes

Introduction

Hello. I work as an accountant in an IT company. I'm working for an IT company, so I want to make use of programming in my back office work ...! I tried to touch Python with that feeling.

I wrote for such a person

--People who want to know about scraping ――People who are looking for something they can do with a little bit of programming --People in the management department who are considering whether to automate their work

table of contents

――What can you do with scraping? ――How does scraping work? --Sample ① --Sample ② (application example)

What can scraping do?

** Arbitrary information can be extracted and processed ** from the acquired HTML. Specifically, you can use scraping to do this.

--Automate the work of accessing the same website every day and posting information to a csv file --Download all images hit by a specific keyword on SNS --Create a sales list by extracting companies that apply to a specific keyword

How does scraping work?

Roughly speaking ** Acquisition of information on the website → Extraction of information → Output of results ** There are 3 steps.

In this article, I will show you how to scrape using Python. This time, I used the following two.

--BeautifulSoup library (information extraction, result output) --A module called urllib (acquisition of information on the website)

Specifically, the following processing is performed.

  1. Get arbitrary information from HTML and XML. (This article focuses on HTML.)
  2. Send a request to the URL to be scraped and get the HTML.
  3. Create a BeautifulSoup object from the acquired HTML
  4. Specify the HTML element that contains the information you want to extract.
  5. The value contained in the specified HTML element can be extracted.

When specifying HTML elements, find methods and select methods are provided. This time, I will focus on find methods.

--find (): returns only one element --findAll (): Returns a list of elements that meet the conditions

Let's take a look at the sample immediately.

Preparation

① Download python https://www.javadrive.jp/python/install/index1.html

② Install beautiful soup You can install it with pip.

$ pip install beautifulsoup4

③ Install urllib This can also be installed with pip.

$ pip install urllib3

Sample code

I will try to get the title from our homepage.

sample.py


import urllib.request #urllib available
from bs4 import BeautifulSoup  #Beautiful Soup available

r = urllib.request.urlopen(‘https://www.is-tech.co.jp/’) #Get information Get the HTML of the page to deal with
html = r.read().decode('utf-8', 'ignore') #HTML utf-Convert to 8 and read
parser = "html.parser" #Specify the information destination in HTML
soup = BeautifulSoup(html, parser) #Completion of BeautifulSoup object
     
title = soup.find("title").text #Specify the characters included in the title tag in HTML as the extraction target

print(title) #Display the characters contained in the title tag

result

image2.png

I was able to extract it.

Application example

Suppose you have a fictitious website like the one in the image below. This is a site where financial information of fictitious listed companies is summarized on each page. The sales for the last 3 years are extracted for each company and output to csv. The url is https://www.example.com/NNNN.

image3.png

example.com/NNNN.html


<!DOCTYPE html>
</html>
<html>
<head>
   <meta charset="utf-8">
   <title>Financial information</title>
</head>
​
<body>
   <h1 class="company">Hogehoge Co., Ltd.</h1>
   <table class="information">
       <tr>
       <td> </td>
       <td>First term</td>
       <td>2 terms ago</td>
       <td>3 terms ago</td>
       </tr>
​
       <tr>
       <td>amount of sales</td>
       <td>1,200 million yen</td>
       <td>1,100 million yen</td>
       <td>1,000 million yen</td>
       </tr>
​
       <tr>
       <td>Ordinary income</td>
       <td>240 million yen</td>
       <td>220 million yen</td>
       <td>200 million yen</td>
       </tr>
​
       <tr>
       <td>Net income</td>
       <td>120 million yen</td>
       <td>110 million yen</td>
       <td>100 million yen</td>
       </tr>
   </table>
</body>

Add the following arrangement to the sample that extracted the title from the homepage earlier.

--Repeat processing of 4-digit securities code (1000 to 9999) --Do nothing if the security code does not exist --Also use the findAll () method to specify the HTML element under the condition of what number △△ in ○○. ――Pause for 5 seconds after each process (to reduce the load on the accessed server) --Finally output to csv

sample2.py


import urllib.request
from bs4 import BeautifulSoup
import csv #csv related processing available
import time #Ability to pause processing
 
class Scraper:
   def __init__(self, site, code):  #Create class
       self.site = site
       self.code = code
 
   def scrape(self): #Create an instance
       url = str(self.site) + str(self.code) #Corporate url
 
       r = urllib.request.urlopen(url) 
       html = r.read().decode('utf-8', 'ignore')
       parser = "html.parser"
       soup = BeautifulSoup(html, parser) #Completion of soup object
 
       company = soup.find("h1") #Get company name
 
       if "Company information page not found" in company.text: #If there is no company with the corresponding securities code, no processing will be performed.
           pass
 
       else: #Otherwise
           table = soup.find("table") #Table with financial information for the last 3 terms
           data = table.findAll("td") #Specify all tables with td tags
           sales = data[5].text.split('One million yen')[0] #Sales in the previous term(=6th td tag in data)
           sales_two_years_ago = data[6].text.split('One million yen')[0] #Sales two terms ago(=7th td tag in data)
           sales_three_years_ago = data[7].text.split('One million yen')[0] #Sales 3 terms ago(=8th td tag in data)
 
           row = [code, company, sales, sales_two_years_ago, sales_three_years_ago] #Listed for output to csv
           
           with open('test.csv', 'a', newline ='') as f: #Export to csv
               writer = csv.writer(f)
               writer.writerow(row)
           time.sleep(5) #Pause processing for 5 seconds
 
source = "https://www.example.com/" #Common part of url
for i in range(1000, 10000):      #Iterative processing for securities codes 1000 to 9999
   Scraper(source, i).scrape()

result

image1.png In this way, you can output all company information to csv.

Caution

I wrote the code to scrape Yahoo! Finance to automate the business, but it was prohibited by the terms of use ... When using scraping, please ** check the terms of use of the website in advance **. Depending on the site, the API may be open to the public, so there is also a way to use the API for processing.

Recommended Posts

Scraping with Beautiful Soup in 10 minutes
Scraping with Beautiful Soup
Table scraping with Beautiful Soup
Try scraping with Python + Beautiful Soup
Scraping multiple pages with Beautiful Soup
Scraping with Python and Beautiful Soup
Website scraping with Python's Beautiful Soup
Write a basic headless web scraping "bot" in Python with Beautiful Soup 4
Scraping with selenium in Python
Scraping with chromedriver in python
Scraping with Selenium in Python
Scraping with Tor in Python
Crawl practice with Beautiful Soup
Note that I dealt with HTML in Beautiful Soup
Beautiful Soup
Scraping Google News search results in Python (2) Use Beautiful Soup
[Python] Scraping a table using Beautiful Soup
Remove unwanted HTML tags with Beautiful Soup
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
Beautiful Soup memo
Beautiful soup spills
Scraping with Selenium
Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium
Build a Django environment with Vagrant in 5 minutes
Achieve scraping with Python & CSS selector in 1 minute
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Learn Pandas in 10 minutes
I get an Import Error in Python Beautiful Soup
Scraping with Python + PhantomJS
My Beautiful Soup (Python)
Super Primer to python-Getting started with python3.5 in 3 minutes
I was addicted to scraping with Selenium (+ Python) in 2020
Scraping with scrapy shell
Understand in 10 minutes Selenium
Selenium running in 15 minutes
I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.
Scraping with Selenium [Python]
Scraping with Python + PyQuery
[Python] Delete by specifying a tag with Beautiful Soup
Scraping RSS with Python
[Raspberry Pi] Scraping of web pages that cannot be obtained with python requests + Beautiful Soup
Try scraping the data of COVID-19 in Tokyo with Python
I tried scraping with Python
Automatically download images with scraping
Use "$ in" operator with mongo-go-driver
Web scraping with python + JupyterLab
Scraping with Selenium + Python Part 1
[Python] Scraping in AWS Lambda
Working with LibreOffice in Python
Web scraping notes in python3
Festive scraping with Python, scrapy
Debugging with pdb in Python
Save images with web scraping
Working with sounds in Python
Trade-offs in web scraping & crawling
Easy web scraping with Scrapy