Scraping Google News search results in Python (2) Use Beautiful Soup

Just below

Click ▶ to open the lower layer

<span class = ・・・

The first few lines of text in the article were displayed just below. Although it is not displayed on the search result web page, it was hidden in such a place. This is called summary.

The script that gets this text summary = entry.find ("span"). Text.

For the release date information of the article, click ▶ of <div class = "Qmr ..." just below to open the lower layer, and "datetime = 2019-12-13 ..." is directly under "<time class =". had.

The script to get this datetime is time_elm = entry.find ("time ").

Finally, the url of the article page, which is in the light blue part of the verification. It means that the linked information is placed in the title of the article.

<a class="VDXfz" jsname="hXuDdf" jslog="85008; 2:https://prtimes.jp/main/thml/rd/p/000001434.000011710.html;

It is the part of https: // ~. I used the following two scripts. ~~ url_elm = entry.find("a")~~ ~~ url_elm = entry.find("a", class_= "VDXfz")~~ url_elm = entry.find("article") link = url_elm.get("jslog")

Let's introduce the script through. Use lstrip () and rstrip () to delete unnecessary characters at the end of the acquired information. If there is no release date information, "0000-00-00" is substituted instead in exception handling. The acquired information is converted into a data frame with the library pandas and saved in a csv file.

2. Google News search result scraping script

environment

Windows10 Python 3.6.2

script

google_news
#Calling the required library
import pandas as pd    #To save the scraping result in a cvs file in data frame format
import pprint    #To display part of a data frame
from bs4 import BeautifulSoup  #Analysis and extraction of acquired web page information
import requests     #Get information on web pages
import urllib       #Get keyword url encoding

#Convert the search word "tapiru" into characters and insert it between the urls on the search result page.
s = "Tapiru"
s_quote = urllib.parse.quote(s)
url_b4 = 'https://news.google.com/search?q=' + s_quote + '&hl=ja&gl=JP&ceid=JP%3Aja'

#Get information on search result page
res = requests.get(url_b4)
soup = BeautifulSoup(res.content, "html.parser")

#Select information for all articles
articles = soup.select(".xrnccd")

#Get the information of each article repeatedly for ~ enumerate and assign it to the list
news = list()   #Create an empty list for assignment

for i, entry in enumerate(articles, 1):
    title = entry.find("h3").text
    summary = entry.find("span").text
    summary = title + "。" + summary
    #url_elm = entry.find("a")Changed to
    url_elm = entry.find("article")
    link = url_elm.get("jslog")
    link = link.lstrip("85008; 2:")		#Delete left edge
    link = link.rstrip("; track:click")	#Delete right edge
    time_elm = entry.find("time")
    try:	#Exception handling
        ymd = time_elm.get("datetime")
    except AttributeError:
	    ymd = "0000-00-00"
	ymd = ymd[0:10]
	ymd = ymd.replace("-", "/")		#Replacement
	sortkey = ymd[0:4] + ymd[5:7] + ymd[8:10] #For sorting by date
				
	tmp = {				#Stored as a dictionary
	    "title": title,
	    "summary": summary,
	    "link": link,
	    "published": ymd,
	    "sortkey": sortkey
        }

	news.append(tmp)  #Add information for each article to the list
	
	#Convert to data frame and save as csv file
	news_df = pd.DataFrame(news)
	pprint.pprint(news_df.head())  #Display the first 5 lines to check the data
	filename = s + ".csv"
	news_df.to_csv(filename, encoding='utf-8-sig', index=False)	
The Google News search script is used for the following articles.

[Find the seeds of food hits in data science! (1) --The secret of Lawson's Basque hit](https://blog.hatena.ne.jp/yamtakumol/yamtakumol.hatenablog.com/edit?entry= 26006613407003507)

[Let's find the seeds of food hits! (2) --- "Complete meal" and "Weathering with You recipe" from June to August 2019](https://blog.hatena.ne.jp/yamtakumol/ yamtakumol.hatenablog.com/edit?entry=26006613422742161)

[Let's find the seeds of food hits! (3) --September 2019 is the food from Taiwan following bubble tea, especially "cheese tea"](https://blog.hatena.ne.jp/yamtakumol/ yamtakumol.hatenablog.com/edit?entry=26006613447159392)

Let's find the seeds of food hits! --Sweet potato pie in October 2019

** Seeds of food hits expected in 2020-Cheese balls-**

reference:

What is HTML? If you read this, even beginners can definitely write HTML! What is an HTML div class? Commentary with examples in 5 minutes

Recommended Posts
Scraping Google News search results in Python (2) Use Beautiful Soup

[Python selenium] After scraping Google search results, output title and URL in csv

Try scraping with Python + Beautiful Soup

Scraping with Python and Beautiful Soup

Scraping with Beautiful Soup in 10 minutes

[Python] Scraping a table using Beautiful Soup

Write a basic headless web scraping "bot" in Python with Beautiful Soup 4

Use Search Tweets: Full Archive / Sandbox in Python

Use config.ini in Python

Use dates in Python

Binary search in Python

Use Valgrind in Python

Scraping google search (image)

Linear search in Python

Use profiler in Python

Scraping with Beautiful Soup

Binary search in Python (binary search)

I get an Import Error in Python Beautiful Soup

Let's use def in python

Use let expression in Python

Scraping with selenium in Python

Use Measurement Protocol in Python

[Python] Scraping in AWS Lambda

Web scraping notes in python3

Scraping with chromedriver in python

Use callback function in Python

Use parameter store in Python

Use HTTP cache in Python

Search for strings in Python

Use MongoDB ODM in Python

Use list-keyed dict in Python

Scraping with Selenium in Python

Use Random Forest in Python

Use regular expressions in Python

Binary search in Python / C ++

Algorithm in Python (binary search)

Scraping with Tor in Python

[Python scraping] I tried google search top10 using Beautifulsoup & selenium

Google search for the last line of the file in Python

Table scraping with Beautiful Soup

Write a binary search in Python

Scraping multiple pages with Beautiful Soup

Use fabric as is in python (fabric3)

Scraping with Selenium in Python (Basic)

[Python] A memorandum of beautiful soup4

How to use SQLite in Python

Download Google Drive files in Python

Algorithm in Python (depth-first search, dfs)

Scraping pages with pagination with Beautiful Soup

Use rospy with virtualenv in Python3

Snippets (scraping) registered in Google Colaboratory

How to use Mysql in python

Use Python in pyenv with NeoVim

Write a depth-first search in Python

How to use ChemSpider in Python

How to use PubChem in Python

Beginners use Python for web scraping (1)

Use OpenCV with Python 3 in Window

Beginners use Python for web scraping (4) ―― 1

Website scraping with Python's Beautiful Soup

Depth-first search using stack in Python

Scraping Google News search results in Python (2) Use Beautiful Soup

1. Analysis of article information on search result page by Google Chrome

2. Google News search result scraping script

environment

script

`google_news`

The Google News search script is used for the following articles.

reference: