If you search by keywords or sentences that you are interested in, Google News will display 100 articles organized by relevance and release date. In order to find out how the hit food products appeared, search for keywords and sentences that are likely to be related to the hit foods, investigate past news, and check the degree of increase in interest at the time of those news releases on Google Trends. By doing so, it seems that we can explore the process leading up to the hit. It can also be used to catch topics that lead to new hits. In the previous report, I introduced how to parse Google News RSS in Python (feed parser). Scraping Google News in Python and editing in R. However, with this method, the summary text has become the same as the title text since around October 2019.
Therefore, this time, I will introduce a script that uses Beautiful Soup to acquire article information on the search result page of Google News. Unlike feedparser, which provides article information organized, it is necessary to search the search result web page for the location of the article information and specify the information to be extracted by tags, elements, and attributes.
Here, I will introduce how to search for the article information you want to retrieve with Google Ghrome, and a script to retrieve the article information from the obtained page structure information using the library requests and Beautful Soup.
The search word used was "Tapiru," which was selected as one of the top ten new word and buzzword awards in 2019. The search results shown below are displayed.
To examine the structure of this page, hover over the article title and right-click and click Validate at the bottom of the menu that appears.
The element configuration of the HTML page is displayed in the upper right. From this window, identify the location of the article information and understand the tags and attributes required to obtain the information.
If you look at the displayed HTML code, you'll be shy, but the information you need is always near this light blue zone, so it's important to search carefully and persistently. Just below the light blue zone
When you click ▶, the lower layer opens and the title text "# Tapiru's English do you know? ..." is displayed. I was able to confirm that the information of the first article was written near the light blue zone.
So, if you look for the grouping tag "div" (see the end-of-sentence reference for the div tag) on the gray part to find the top tag that contains the information in this article,
▼<div class="xrnccd"
There seems to be article information you want in this lower layer, so roughly select the information of about 100 articles using "xrnccd" of the class that identifies this tag as the selector of beautiful Soup. All article information searched by the script below can be assigned to articles.
articles = soup.select(".xrnccd")
Next, find and get the part where the title, summary, url of the original article, and release date of each article are described. The title text "# Tapiru's English ..." is just below the light blue zone.
Just below
Click ▶ to open the lower layer
<span class = ・ ・ ・
The first few lines of text in the article were displayed just below. Although it is not displayed on the search result web page, it was hidden in such a place. This is called summary.
The script that gets this text
summary = entry.find ("span"). Text
.For the release date information of the article, click ▶ of <div class = "Qmr ..." just below to open the lower layer, and "datetime = 2019-12-13 ..." is directly under "<time class =". had.
The script to get this datetime is
time_elm = entry.find ("time ")
.Finally, the url of the article page, which is in the light blue part of the verification. It means that the linked information is placed in the title of the article.
<a class="VDXfz" jsname="hXuDdf" jslog="85008; 2:https://prtimes.jp/main/thml/rd/p/000001434.000011710.html;
It is the part of https: // ~. I used the following two scripts. ~~
url_elm = entry.find("a")
~~ ~~url_elm = entry.find("a", class_= "VDXfz")
~~url_elm = entry.find("article")
link = url_elm.get("jslog")
Let's introduce the script through. Use lstrip () and rstrip () to delete unnecessary characters at the end of the acquired information. If there is no release date information, "0000-00-00" is substituted instead in exception handling. The acquired information is converted into a data frame with the library pandas and saved in a csv file.
2. Google News search result scraping script
environment
Windows10 Python 3.6.2
script
google_news
#Calling the required library import pandas as pd #To save the scraping result in a cvs file in data frame format import pprint #To display part of a data frame from bs4 import BeautifulSoup #Analysis and extraction of acquired web page information import requests #Get information on web pages import urllib #Get keyword url encoding #Convert the search word "tapiru" into characters and insert it between the urls on the search result page. s = "Tapiru" s_quote = urllib.parse.quote(s) url_b4 = 'https://news.google.com/search?q=' + s_quote + '&hl=ja&gl=JP&ceid=JP%3Aja' #Get information on search result page res = requests.get(url_b4) soup = BeautifulSoup(res.content, "html.parser") #Select information for all articles articles = soup.select(".xrnccd") #Get the information of each article repeatedly for ~ enumerate and assign it to the list news = list() #Create an empty list for assignment for i, entry in enumerate(articles, 1): title = entry.find("h3").text summary = entry.find("span").text summary = title + "。" + summary #url_elm = entry.find("a")Changed to url_elm = entry.find("article") link = url_elm.get("jslog") link = link.lstrip("85008; 2:") #Delete left edge link = link.rstrip("; track:click") #Delete right edge time_elm = entry.find("time") try: #Exception handling ymd = time_elm.get("datetime") except AttributeError: ymd = "0000-00-00" ymd = ymd[0:10] ymd = ymd.replace("-", "/") #Replacement sortkey = ymd[0:4] + ymd[5:7] + ymd[8:10] #For sorting by date tmp = { #Stored as a dictionary "title": title, "summary": summary, "link": link, "published": ymd, "sortkey": sortkey } news.append(tmp) #Add information for each article to the list #Convert to data frame and save as csv file news_df = pd.DataFrame(news) pprint.pprint(news_df.head()) #Display the first 5 lines to check the data filename = s + ".csv" news_df.to_csv(filename, encoding='utf-8-sig', index=False)
The Google News search script is used for the following articles.
[Find the seeds of food hits in data science! (1) --The secret of Lawson's Basque hit](https://blog.hatena.ne.jp/yamtakumol/yamtakumol.hatenablog.com/edit?entry= 26006613407003507)
[Let's find the seeds of food hits! (2) --- "Complete meal" and "Weathering with You recipe" from June to August 2019](https://blog.hatena.ne.jp/yamtakumol/ yamtakumol.hatenablog.com/edit?entry=26006613422742161)
[Let's find the seeds of food hits! (3) --September 2019 is the food from Taiwan following bubble tea, especially "cheese tea"](https://blog.hatena.ne.jp/yamtakumol/ yamtakumol.hatenablog.com/edit?entry=26006613447159392)
Let's find the seeds of food hits! --Sweet potato pie in October 2019
** Seeds of food hits expected in 2020-Cheese balls-**
reference:
What is HTML? If you read this, even beginners can definitely write HTML! What is an HTML div class? Commentary with examples in 5 minutes
Recommended Posts