Introduction

Since I was a university student, I have been working on acquiring stock prices and news articles using a PC in the laboratory. Recently, however, it has become necessary to take on the challenge of acquiring and accumulating ** "English" ** news articles at work.

So, let's try to realize the process of getting "English" news articles in a Python program. This time, the news source is ** Reuters **.

What to introduce in this article

--Obtaining headlines (titles, summaries) from Reuters --Getting the article text from Reuters

Based on the code described in the link below, we added the code to get the article text that is the link destination of "NEWS HEADLINES".

How to scrape news headlines from Reuters? Business News Headlines

In addition, the author has confirmed the operation with the following version.

Python: 3.6.8 --Google Chrome: 86.0.4240.75 (Official Build) (64-bit)
Selenium: 3.141.0
chromedriver-binary: 86.0.4240.22.0
BeautifulSoup4: 4.9.1

Not introduced in this article

--How to install and use the Python library

Selenium
requests
BeautifulSoup4

For the installation of Selenium, I referred to the following article. [For selenium] How to install Chrome Driver with pip (no need to pass through, version can be specified)

Sample code

Since the amount of Code is not large, I will introduce the entire Code. There are two points.

1. Explicit wait

It is a must to implement standby processing (Sleep) even in ** because it does not impose a load on the access destination **. And, assuming that it takes time for the URL (page) to be loaded by the Web browser, it is better to implement the wait process.

I referred to the following article. [Python] Selenium usage memo Story of standby processing with Selenium Three settings to make for stable operation of Selenium (also supports Headless mode)

2. Specifying tag elements

It is a must to look at the Source of each page, specify the element in consideration of the tag structure, and acquire the information with Selenium or BeautifulSoup4. This time, the headline is Selenium and the article text is BeautifulSoup4.

Introducing Code

The part processed using Selenium is almost the same as Reference Code. It is a form that additionally implements the process of acquiring the link (href attribute) of each article body and the process of acquiring the article body.

When you run the code, the CSV file will be output to the folder specified in ** outputdirpath **. (CSV file is page by page) I'm a little worried that I didn't seriously implement the handling of Error and character code.

`crawler_reuters.py`


import chromedriver_binary
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import dateutil.parser
import time
import os
import datetime
import csv
import io
import codecs
import requests
from bs4 import BeautifulSoup

'''
#Below, for the workplace or internal network (proxy environment).(2020/11/02 Update)
os.environ["HTTP_PROXY"] = "http://${Proxy server IP address}:${Proxy server port number}/"
os.environ["HTTPS_PROXY"] = "http://${Proxy server IP address}:${Proxy server port number}/"
'''

def createOutputDirpath():
  workingdirpath = os.getcwd()
  outputdirname = 'article_{0:%Y%m%d}'.format(datetime.datetime.now())
  outputdirpath = "..\\data\\%s" %(outputdirname)
  if not os.path.exists(os.path.join(workingdirpath, outputdirpath)):
    os.mkdir(os.path.join(workingdirpath, outputdirpath))
  return os.path.join(workingdirpath, outputdirpath)

def getArticleBody(url):
  html = requests.get(url)
  #soup = BeautifulSoup(html.content, "html.parser")
  soup = BeautifulSoup(html.content, "lxml")
  wrapper = soup.find("div", class_="ArticleBodyWrapper")
  paragraph = [element.text for element in wrapper.find_all("p", class_="Paragraph-paragraph-2Bgue")]
  #paragraph = []
  #for element in wrapper.find_all("p", class_="Paragraph-paragraph-2Bgue"):
  #  paragraph.append(element.text)
  return paragraph

outputdirpath = createOutputDirpath()
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get('https://www.reuters.com/news/archive/businessnews?view=page&page=5&pageSize=10')

count = 0
for x in range(5):
  try:
    print("=====")
    print(driver.current_url)
    print("-----")
    #f = open(os.path.join(outputdirpath, "reuters_news.csv"), "w", newline = "")
    f = codecs.open(os.path.join(outputdirpath, "reuters_news_%s.csv" %(x)), "w", "UTF-8")
    writer = csv.writer(f, delimiter=',', quoting=csv.QUOTE_ALL, quotechar="\"")
    WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "control-nav-next")))
    loadMoreButton = driver.find_element_by_class_name("control-nav-next") # or "control-nav-prev"
    # driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
    #news_headlines = driver.find_elements_by_class_name("story-content")
    news_headlines = driver.find_elements_by_class_name("news-headline-list")[0].find_elements_by_class_name("story-content")
    for headline in news_headlines:
      #print(headline.text)
      #print(headline.get_attribute("innerHTML"))
      href = headline.find_element_by_tag_name("a").get_attribute("href")
      title = headline.find_element_by_class_name("story-title").text
      smry = headline.find_element_by_tag_name("p").text
      stmp = headline.find_element_by_class_name("timestamp").text
      body = getArticleBody(href)
      print(href)
      #print(title)
      #print(smry)
      #print(stmp)
      #print(body)      
      writer.writerow([href, title, smry, stmp, '\r\n'.join(body)])
      time.sleep(1)
    f.close()
    count += 1
    loadMoreButton.click()
    time.sleep(10)
  except Exception as e:
    print(e)
    break

After all it is convenient, Python. Let's change the URL parameters of Reuters (page number and number of articles per page) and use it at work.

But is the Java version of Selenium easier to use? .. .. ??

Summary

Introducing how to get (crawling) news articles (Reuters articles) using Selenium and BeautifulSoup4.

Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4.