It is said that it is for beginners, but I am also a beginner. After understanding the simple sample code of web scraping, I wanted to show my own originality, so I tried it while investigating. When I executed it according to the reference code of web scraping, I was able to extract the title and so on! If is level 1, I think that this time it is about level 2. So, I think there may be some misunderstandings, so if you have any suggestions, please comment.
python 3.7.3 I developed it with visual studio code.
Python has an HTTP library called "urlib2", but it's not easy to use, so I use the "Requests" and "BeautifulSoup" libraries for web scraping. Get the web page with Requests and extract its HTML with Beautiful Soup.
Nikkei Business Electronic Edition https://business.nikkei.com/ I will try to get the headline and URL of the new article from.
Access with Google Chrome and press F12 to access the developer tools (verification mode).
I want to know which part of the HTML the new article part is, so press Ctrl + Shift + C to move the cursor to the headline.
Then, I found that the serialized name of the article is in the part where the class is category.The article headline is in the h3 tag. Also, you can see that the URL is in the a tag part a little above. The composition of this relationship is as follows. Later, I would like to explain it together with the program.
code.py
import requests
from bs4 import BeautifulSoup
import re
urlName = "https://business.nikkei.com"
url = requests.get(urlName)
soup = BeautifulSoup(url.content, "html.parser")
Make an http connection with the requests library and analyze html with Beautiful Soup.
code.py
elems = soup.find_all("span")
First, store all span elements in elems.
code.py
for elem in elems:
try:
string = elem.get("class").pop(0)
if string in "category":
print(elem.string)
title = elem.find_next_sibling("h3")
print(title.text.replace('\n',''))
r = elem.find_previous('a')
print(urlName + r.get('href'), '\n')
except:
pass
Next, extract the class name from the span element to determine if it is a category. If the class is category, the text of the serial name is extracted using .string.
Then, the next step is to get the contents of the heading. The heading was on the h3 tag. The h3 tag was at the same depth, just below. So use find_next_sibling () to find h3 at the same depth after the element.
The extracted text may also have an image, and it may or may not include line breaks, so I deleted it if it did.
Finally, I would like to extract the URL. It was the same depth earlier, but the a tag is one depth higher. So I used find_previous () to look for the a tag and used the get method to get the specified attribute value of the element to get the address of the href.
Below are some of the execution results.
Yuka Ikematsu's direct flight from New York
A huge hospital ship of the US Navy enters NY. Still not enough beds
https://business.nikkei.com/atcl/gen/19/00119/033100011/
Yohei Ichishima's Silicon Valley Insai ...
Living the "20% Demand Economy" Post-Corona Thinking and Moving US Food Service Industry
https://business.nikkei.com/atcl/gen/19/00137/033100002/
Muneaki Hashimoto looks ahead of medicine and medical care
Shionogi, President Teshiroki's conspiracy refrain from partnering with Ping An Insurance
https://business.nikkei.com/atcl/gen/19/00110/033100012/
In this way, I was able to get it.
I'm still studying, so I'm wondering if there are any misunderstandings or better ways. I would like to practice it while deepening my understanding little by little.
Recommended Posts