Since I was a university student, I have been working on acquiring stock prices and news articles using a PC in the laboratory. Recently, however, it has become necessary to take on the challenge of acquiring and accumulating ** "English" ** news articles at work.
So, let's try to realize the process of getting "English" news articles in a Python program. This time, the news source is ** Reuters **.
--Obtaining headlines (titles, summaries) from Reuters --Getting the article text from Reuters
Based on the code described in the link below, we added the code to get the article text that is the link destination of "NEWS HEADLINES".
How to scrape news headlines from Reuters? Business News Headlines
In addition, the author has confirmed the operation with the following version.
--How to install and use the Python library
For the installation of Selenium, I referred to the following article. [For selenium] How to install Chrome Driver with pip (no need to pass through, version can be specified)
Since the amount of Code is not large, I will introduce the entire Code. There are two points.
It is a must to implement standby processing (Sleep) even in ** because it does not impose a load on the access destination **. And, assuming that it takes time for the URL (page) to be loaded by the Web browser, it is better to implement the wait process.
I referred to the following article. [Python] Selenium usage memo Story of standby processing with Selenium Three settings to make for stable operation of Selenium (also supports Headless mode)
It is a must to look at the Source of each page, specify the element in consideration of the tag structure, and acquire the information with Selenium or BeautifulSoup4. This time, the headline is Selenium and the article text is BeautifulSoup4.
The part processed using Selenium is almost the same as Reference Code. It is a form that additionally implements the process of acquiring the link (href attribute) of each article body and the process of acquiring the article body.
When you run the code, the CSV file will be output to the folder specified in ** outputdirpath **. (CSV file is page by page) I'm a little worried that I didn't seriously implement the handling of Error and character code.
crawler_reuters.py
import chromedriver_binary
from selenium import webdriver
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import dateutil.parser
import time
import os
import datetime
import csv
import io
import codecs
import requests
from bs4 import BeautifulSoup
'''
#Below, for the workplace or internal network (proxy environment).(2020/11/02 Update)
os.environ["HTTP_PROXY"] = "http://${Proxy server IP address}:${Proxy server port number}/"
os.environ["HTTPS_PROXY"] = "http://${Proxy server IP address}:${Proxy server port number}/"
'''
def createOutputDirpath():
workingdirpath = os.getcwd()
outputdirname = 'article_{0:%Y%m%d}'.format(datetime.datetime.now())
outputdirpath = "..\\data\\%s" %(outputdirname)
if not os.path.exists(os.path.join(workingdirpath, outputdirpath)):
os.mkdir(os.path.join(workingdirpath, outputdirpath))
return os.path.join(workingdirpath, outputdirpath)
def getArticleBody(url):
html = requests.get(url)
#soup = BeautifulSoup(html.content, "html.parser")
soup = BeautifulSoup(html.content, "lxml")
wrapper = soup.find("div", class_="ArticleBodyWrapper")
paragraph = [element.text for element in wrapper.find_all("p", class_="Paragraph-paragraph-2Bgue")]
#paragraph = []
#for element in wrapper.find_all("p", class_="Paragraph-paragraph-2Bgue"):
# paragraph.append(element.text)
return paragraph
outputdirpath = createOutputDirpath()
driver = webdriver.Chrome()
driver.implicitly_wait(10)
driver.get('https://www.reuters.com/news/archive/businessnews?view=page&page=5&pageSize=10')
count = 0
for x in range(5):
try:
print("=====")
print(driver.current_url)
print("-----")
#f = open(os.path.join(outputdirpath, "reuters_news.csv"), "w", newline = "")
f = codecs.open(os.path.join(outputdirpath, "reuters_news_%s.csv" %(x)), "w", "UTF-8")
writer = csv.writer(f, delimiter=',', quoting=csv.QUOTE_ALL, quotechar="\"")
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, "control-nav-next")))
loadMoreButton = driver.find_element_by_class_name("control-nav-next") # or "control-nav-prev"
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight)")
#news_headlines = driver.find_elements_by_class_name("story-content")
news_headlines = driver.find_elements_by_class_name("news-headline-list")[0].find_elements_by_class_name("story-content")
for headline in news_headlines:
#print(headline.text)
#print(headline.get_attribute("innerHTML"))
href = headline.find_element_by_tag_name("a").get_attribute("href")
title = headline.find_element_by_class_name("story-title").text
smry = headline.find_element_by_tag_name("p").text
stmp = headline.find_element_by_class_name("timestamp").text
body = getArticleBody(href)
print(href)
#print(title)
#print(smry)
#print(stmp)
#print(body)
writer.writerow([href, title, smry, stmp, '\r\n'.join(body)])
time.sleep(1)
f.close()
count += 1
loadMoreButton.click()
time.sleep(10)
except Exception as e:
print(e)
break
After all it is convenient, Python. Let's change the URL parameters of Reuters (page number and number of articles per page) and use it at work.
But is the Java version of Selenium easier to use? .. .. ??
Introducing how to get (crawling) news articles (Reuters articles) using Selenium and BeautifulSoup4.
Recommended Posts