I started learning Python. I want to deepen my understanding of web scraping, so I will summarize it in my own way.
I will omit it in this article, but if you are developing a distributed system, you need to understand it to some extent. Personally, I recommend this book for learning. [Technologies that support the Web-HTTP, URI, HTML, and REST (WEB + DB PRESS plus)](https://www.amazon.co.jp/Web%E3%82%92%E6%94%AF%E3 % 81% 88% E3% 82% 8B% E6% 8A% 80% E8% A1% 93-HTTP% E3% 80% 81URI% E3% 80% 81HTML% E3% 80% 81% E3% 81% 9D% E3 % 81% 97% E3% 81% A6REST-WEB-PRESS-plus / dp / 4774142042 / ref = pd_lpo_14_t_2 / 357-3513078-6123409? _Encoding = UTF8 & pd_rd_i = 4774142042 & pd_rd_r = 7fe1ea20-e9d9-47a1-b1cc-f9c4 = 4b55d259-ebf0-4306-905a-7762d1b93740 & pf_rd_r = 9KK4FFTSP6VV300G2BH3 & psc = 1 & refRID = 9KK4FFTSP6VV300G2BH3)
This is the main subject. In books, etc., it is described as a library that parses HTML. Also check the Official Site. The features are the following three points.
Install the BeautifulSoup library.
--Since I'm using MacOS, I use the "pip3" command. --The latest version of BeautifulSoup is 4.9.1 (as of May 23, 2020).
Run the following command in an interactive shell.
> pip3 install BeautifulSoup4
If you can import it, the installation is successful. bs4 is a library.
>>> from bs4 import BeautifulSoup4
This time, we will extract the title and URL of the news list of YAHOO! JAPAN.
--Use requests to get site information. --Use BeautifulSoup to analyze the elements. --Use re to get the item with a regular expression. --Identify the tag structure to be acquired from the developer tools of the browser. --This time, you can get it by matching the href attribute "news.yahoo.co.jp/pickup". --Import the re module, which is a standard library, to use regular expressions. --Check Official Documents later. --Extract the text attribute and href attribute from the acquired items.
ScrapingSample.py
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.yahoo.co.jp/"
#Get site information using requests
result = requests.get(url)
#Analyze elements
bs = BeautifulSoup(result.text, "html.parser")
#The link is"news.yahoo.co.jp/pickup"Get items that match
news_list = bs.find_all(href=re.compile("news.yahoo.co.jp/pickup"))
#Extract text attribute and href attribute from the acquired items
for news in news_list:
print("{0} , {1}".format(news.getText(), news.get('href')))
3 prefectures released Mask shoppers, https://news.yahoo.co.jp/pickup/6360522
Rice discusses resumption of nuclear test US newspaper, https://news.yahoo.co.jp/pickup/6360527
Light and dark NEW at Subaru and Mitsubishi Corona, https://news.yahoo.co.jp/pickup/6360528
Antimalarial drug increased risk of death NEW, https://news.yahoo.co.jp/pickup/6360523
A woman in her 80s with a seismic intensity of 4 broke before dawn, https://news.yahoo.co.jp/pickup/6360529
Mask delivery in Iwate Voice of nowadays NEW, https://news.yahoo.co.jp/pickup/6360521
Equestrian club pinch I want to avoid culling, https://news.yahoo.co.jp/pickup/6360510
Rina Akiyama gives birth to a second baby boy NEW, https://news.yahoo.co.jp/pickup/6360531
"NEW" has also been extracted, but I think it's okay to replace it if it's unnecessary (not included in this implementation).
It was a simple content, but I would like to deepen my understanding by reading the official documents.
Recommended Posts