[For beginners] Web scraping with Python "Access the URL in the page to get the contents"

Introduction

Last review

This is a continuation of the article [For beginners] Trying web scraping with Python. Last time, the electronic version of Nikkei Business https://business.nikkei.com/ I got the headline and URL of the new article from.

However, with this alone, you can find out by actually accessing this URL.

Purpose of this time

When you browse the news site, if you find any news that interests you, click it to see the details. Nikkei Business articles, though not all news, have an article introduction of about 150 characters that makes you want to read before the content. By displaying this content together, you can use it as a basis for deciding whether to read the article or not. It is difficult to access all the articles one by one and read the introductory text of the article. We will bring out the goodness of web scraping.

Review of the previous code

`code.py`


import requests
from bs4 import BeautifulSoup
import re

urlName = "https://business.nikkei.com"
url = requests.get(urlName)
soup = BeautifulSoup(url.content, "html.parser")

elems = soup.find_all("span")

for elem in elems: 
  try:
    string = elem.get("class").pop(0)
    if string in "category":
      print(elem.string)
      title = elem.find_next_sibling("h3")
      print(title.text.replace('\n',''))
      r = elem.find_previous('a')
      #I'm getting the URL of the article
      print(urlName + r.get('href'), '\n')

      #Write a program to get the article introduction text of the URL destination in this part

  except:
    pass

See the previous article for more details. When I clicked on the news, the URL to transition to was displayed and the last time was over. This time, access the URL to get the contents.

programming

First of all, this time we will make the requests and BeautifulSoup parts into functions.

`subFunc.py`


import requests
from bs4 import BeautifulSoup

def setup(url):
  url = requests.get(url)
  soup = BeautifulSoup(url.content, "html.parser")
  return url, soup

`main.py`


import re
import subFunc

urlName = "https://business.nikkei.com"
url, soup = subFunc.setup(urlName)

elems= soup.find_all("span")

for elem in elems: 
  try:
    string = elem.get("class").pop(0)
    if string in "category":
      print('\n', elem.string)

      title = elem.find_next_sibling("h3")
      print(title.text.replace('\n',''))

      r = elem.find_previous('a')
      nextPage = urlName + r.get('href')
      print(nextPage)
      
      #Newly written part from here
      nextUrl, nextSoup = subFunc.setup(nextPage)
      abst = nextSoup.find('p', class_="bplead")
      if len(abst) != 0:
        print(abst.get_text().replace('\n',''))
  except:
    pass

To be honest, what I do is the same. Get the information of the transition destination URL using requests and BeautifulSoup. In the introductory text of the article, class was in the element of bplead. However, some articles do not have an introductory text, so I tried to display them if they did.

The execution result is as follows. (Omitted)

Co-creation / competition / startup
The new corona is a long-term battle xxxxxxxxxxx
https://business.nikkei.com/atcl/gen/19/00101/040100009/    
He complained of the epidemic of the new coronavirus xxxxxxxxxxxx.

at the end

When I looked it up, some other methods were introduced, but I tried to get the contents of the transition destination with a simple method.