Learning record (2nd day) Scraping by #BeautifulSoup

content of study

Scraping with Beautiful Soup

A library that extracts and analyzes information from HTML and XML. There is no download function, so use it in combination with ʻurllib`.

Below, the basic usage of Beautiful Soup

# Library import
from bs4 import BeautifulSoup

html1 = """
<html><body>
 <h1> Scraping </ h1>
 <p> Web page analysis </ p>
 <p> Extraction of arbitrary parts </ p>
</body></html>
"""

# HTML parsing
soup = BeautifulSoup(html1, 'html.parser')

# Extract any element
h1 = soup.html.body.h1
p1 = soup.html.body.p
p2 = p1.next_sibling.next_sibling

print(h1.string)
print(p1.string)
print(p2.string)

Execution result

Scraping Extract web pages Extraction of arbitrary parts

Scraping by using Beautiful Soup and ʻurllib` together

# Library import
import urllib.request as req
from bs4 import BeautifulSoup

url = "https://api.aoikujira.com/zip/xml/1500042"

res = req.urlopen(url)

# Analyze the data acquired by urlopen ()
soup = BeautifulSoup(res, 'html.parser')

ken = soup.find("ken").string
shi = soup.find("shi").string
cho = soup.find("cho").string

print(ken, shi, cho)

References

I have attached the GitHub published from the book I referred to. Supplementary revision Python scraping & machine learning development technique

Recommended Posts

Learning record (2nd day) Scraping by #BeautifulSoup
Learning record No. 18 (22nd day)
Learning record No. 28 (32nd day)
Learning record (3rd day) #CSS selector description method #BeautifulSoup scraping
Learning record 4 (8th day)
Learning record 9 (13th day)
Learning record 3 (7th day)
Learning record 5 (9th day)
Learning record 6 (10th day)
Programming learning record day 2
Learning record 8 (12th day)
Learning record 1 (4th day)
Learning record 7 (11th day)
Learning record 2 (6th day)
Learning record 16 (20th day)
Learning record 22 (26th day)
_ 3rd day until good accuracy is obtained by Leaf Classification
_1st day until good accuracy is obtained by Leaf Classification
Learning record (2nd day) Scraping by #BeautifulSoup
Learning record No. 21 (25th day)
Learning record 13 (17th day) Kaggle3
Learning record No. 17 (21st day)
Learning record No. 24 (28th day)
Learning record No. 19 (23rd day)
Learning record No. 29 (33rd day)
Learning record No. 23 (27th day)
Learning record No. 25 (29th day)
Learning record No. 20 (24th day)
Learning record No. 27 (31st day)
Learning record No. 14 (18th day) Kaggle4
Learning record No. 15 (19th day) Kaggle5
Learning record 11 (15th day) Kaggle participation
Learning record
Learning record # 1
Learning record # 2
Python learning day 4
Learning record so far
Go language learning record
Linux learning record ① Plan
Collect machine learning data by scraping from bio-based public databases