This is a continuation of the article [For beginners] Trying web scraping with Python. Last time, the electronic version of Nikkei Business https://business.nikkei.com/ I got the headline and URL of the new article from.
However, with this alone, you can find out by actually accessing this URL.
When you browse the news site, if you find any news that interests you, click it to see the details. Nikkei Business articles, though not all news, have an article introduction of about 150 characters that makes you want to read before the content. By displaying this content together, you can use it as a basis for deciding whether to read the article or not. It is difficult to access all the articles one by one and read the introductory text of the article. We will bring out the goodness of web scraping.
code.py
import requests
from bs4 import BeautifulSoup
import re
urlName = "https://business.nikkei.com"
url = requests.get(urlName)
soup = BeautifulSoup(url.content, "html.parser")
elems = soup.find_all("span")
for elem in elems:
try:
string = elem.get("class").pop(0)
if string in "category":
print(elem.string)
title = elem.find_next_sibling("h3")
print(title.text.replace('\n',''))
r = elem.find_previous('a')
#I'm getting the URL of the article
print(urlName + r.get('href'), '\n')
#Write a program to get the article introduction text of the URL destination in this part
except:
pass
See the previous article for more details. When I clicked on the news, the URL to transition to was displayed and the last time was over. This time, access the URL to get the contents.
First of all, this time we will make the requests and BeautifulSoup parts into functions.
subFunc.py
import requests
from bs4 import BeautifulSoup
def setup(url):
url = requests.get(url)
soup = BeautifulSoup(url.content, "html.parser")
return url, soup
main.py
import re
import subFunc
urlName = "https://business.nikkei.com"
url, soup = subFunc.setup(urlName)
elems= soup.find_all("span")
for elem in elems:
try:
string = elem.get("class").pop(0)
if string in "category":
print('\n', elem.string)
title = elem.find_next_sibling("h3")
print(title.text.replace('\n',''))
r = elem.find_previous('a')
nextPage = urlName + r.get('href')
print(nextPage)
#Newly written part from here
nextUrl, nextSoup = subFunc.setup(nextPage)
abst = nextSoup.find('p', class_="bplead")
if len(abst) != 0:
print(abst.get_text().replace('\n',''))
except:
pass
To be honest, what I do is the same. Get the information of the transition destination URL using requests and BeautifulSoup. In the introductory text of the article, class was in the element of bplead. However, some articles do not have an introductory text, so I tried to display them if they did.
The execution result is as follows. (Omitted)
Co-creation / competition / startup
The new corona is a long-term battle xxxxxxxxxxx
https://business.nikkei.com/atcl/gen/19/00101/040100009/
He complained of the epidemic of the new coronavirus xxxxxxxxxxxx.
When I looked it up, some other methods were introduced, but I tried to get the contents of the transition destination with a simple method.
Recommended Posts