(Update) I will put what I made into a class. [Python] Get the update date of news articles from HTML
Examining the response headers may reveal the last modified date for static sites.
get_lastmodified.py
import requests
res = requests.head('https://www.kantei.go.jp')
print(res.headers['Last-Modified'])
Mon, 17 Feb 2020 08:27:02 GMT
(Previous article) [Python] Get the last updated date of the website
This works fine for some news sites and many Japanese government sites, but most sites don't.
KeyError: 'last-modified'
Then, there seem to be two main methods.
The URL may contain strings such as 2019/05/01 and 2019-05-01. Extracting this is a powerful and reliable method.
This is where you will ultimately rely.
So, with these combined techniques, the site update date will be automatically extracted from the news site you usually read. The acquired beautiful soup object is called soup. The acquired update date is converted to datetime type. Regular expressions are used to extract and format strings.
get_lastmodified.py
import bs4
import datetime
import re
CNN Bloomberg BBC Reuter Wall Street Journal Forbes Japan Newsweek Asahi Shimbun Nikkei newspaper Sankei Shimbun Yomiuri Shimbun Mainichi newspaper
CNN
https://edition.cnn.com/2020/02/17/tech/jetman-dubai-trnd/index.html
get_lastmodified.py
print(soup.select('.update-time')[0].getText())
#Updated 2128 GMT (0528 HKT) February 17, 2020
timestamp_temp_hm = re.search(r'Updated (\d{4}) GMT', str(soup.select('.update-time')[0].getText()))
timestamp_temp_bdy = re.search(r'(January|February|March|April|May|June|July|August|September|October|November|December) (\d{1,2}), (\d{4})', str(soup.select('.update-time')[0].getText()))
print(timestamp_temp_hm.groups())
print(timestamp_temp_bdy.groups())
#('2128',)
#('February', '17', '2020')
timestamp_tmp = timestamp_temp_bdy.groups()[2]+timestamp_temp_bdy.groups()[1]+timestamp_temp_bdy.groups()[0]+timestamp_temp_hm.groups()[0]
news_timestamp = datetime.datetime.strptime(timestamp_tmp, "%Y%d%B%H%M")
print(news_timestamp)
#2020-02-17 21:28:00
#If it's just the date, you can get it from the URL
URL = "https://edition.cnn.com/2020/02/17/tech/jetman-dubai-trnd/index.html"
news_timestamp = re.search(r'\d{4}/\d{1,2}/\d{1,2}', URL)
print(news_timestamp.group())
#2020/02/17
news_timestamp = datetime.datetime.strptime(news_timestamp.group(), "%Y/%m/%d")
print(news_timestamp)
#2020-02-17 00:00:00
Comment: It has not been verified whether the character string'Updated'is always included. CNN's article has the date in the URL except for the summary page, so it looks certain to take this
Bloomberg
https://www.bloomberg.co.jp/news/articles/2020-02-17/Q5V6BO6JIJV101
get_lastmodified.py
print(soup.select('time')[0].string)
# #
# #February 18, 2020 7:05 JST
# #
timesamp_tmp = re.sub(' ','',str(soup.select('time')[0].string))
timesamp_tmp = re.sub('\n','',timesamp_tmp)
news_timestamp = datetime.datetime.strptime(timesamp_tmp, "%Y year%m month%d day%H:%MJST")
print(news_timestamp)
#2020-02-18 07:05:00
#You can get up to the date even with the URL
URL = "https://www.bloomberg.co.jp/news/articles/2020-02-17/Q5V6BO6JIJV101"
timestamp_tmp = re.search(r'\d{4}-\d{1,2}-\d{1,2}', URL)
print(news_timestamp_tmp.group())
#2020-02-17
news_timestamp = datetime.datetime.strptime(timestamp_tmp, "%Y-%m-%d")
print(news_timestamp)
#2020-02-17 00:00:00
Comment: There are line breaks and spaces in the tag, so it takes a lot of work.
BBC https://www.bbc.com/news/world-asia-china-51540981
get_lastmodified.py
print(soup.select("div.date.date--v2")[0].string)
#18 February 2020
news_timestamp = datetime.datetime.strptime(soup.select("div.date.date--v2")[0].string, "%d %B %Y")
print(news_timestamp)
#2020-02-18 00:00:00
Comment: I didn't know where to look for the detailed time.
Reuter
https://jp.reuters.com/article/apple-idJPKBN20C0GP
get_lastmodified.py
print(soup.select(".ArticleHeader_date")[0].string)
#February 18, 2020 / 6:11 AM /an hour ago updated
m1 = re.match(r'(January|February|March|April|May|June|July|August|September|October|November|December) \d{1,2}, \d{4}',str(soup.select(".ArticleHeader_date")[0].string))
print(m1.group())
#February 18, 2020
m2 = re.search(r'\d{1,2}:\d{1,2}',str(soup.select(".ArticleHeader_date")[0].string))
print(m2.group())
#6:11
news_timestamp = datetime.datetime.strptime(m1.group()+' '+m2.group(), "%B %d, %Y %H:%M")
print(news_timestamp)
#2020-02-18 00:00:00
Wall Street Journal https://www.wsj.com/articles/solar-power-is-beginning-to-eclipse-fossil-fuels-11581964338
get_lastmodified.py
print(soup.select(".timestamp.article__timestamp")[0].string)
#
# Feb. 17, 2020 1:32 pm ET
#
news_timestamp = re.sub(' ','',str(soup.select(".timestamp.article__timestamp")[0].string))
news_timestamp = re.sub('\n','',m)
print(news_timestamp)
#Feb.17,20201:32pmET
news_timestamp = re.match(r'(Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec).(\d{1,2}),(\d{4})(\d{1,2}):(\d{1,2})',str(news_timestamp))
print(news_timestamp.groups())
#('Feb', '17', '2020', '1', '32')
tmp = news_timestamp.groups()
timesamp_tmp = tmp[0]+' '+ tmp[1].zfill(2)+' '+tmp[2]+' '+tmp[3].zfill(2)+' '+tmp[4].zfill(2)
print(timesamp_tmp)
#Feb 17 2020 01 32
news_timestamp = datetime.datetime.strptime(timesamp_tmp, "%b %d %Y %H %M")
print(news_timestamp)
#2020-02-17 01:32:00
Forbes Japan https://forbesjapan.com/articles/detail/32418
get_lastmodified.py
print(soup.select("time")[0].string)
#2020/02/18 12:00
news_timestamp = datetime.datetime.strptime(soup.select("time")[0].string, "%Y/%m/%d %H:%M")
print(news_timestamp)
#2020-02-18 12:00:00
Newsweek https://www.newsweek.com/fears-rise-over-coronavirus-american-cruise-passenger-diagnosed-after-previously-showing-no-1487668
get_lastmodified.py
print(soup.select('time')[0].string)
# On 2/17/20 at 12:11 PM EST
m = re.search(r'(\d{1,2})/(\d{1,2})/(\d{1,2}) at (\d{1,2}:\d{1,2}) ', str(soup.select('time')[0].string))
print(m.groups())
#('2', '17', '20', '12:11')
tmp = m.groups()
timesamp_tmp = tmp[0].zfill(2)+' '+ tmp[1].zfill(2)+' '+'20'+tmp[2].zfill(2)+' '+tmp[3]
print(timesamp_tmp)
news_timestamp = datetime.datetime.strptime(timesamp_tmp, "%m %d %Y %H:%M")
print(news_timestamp)
#2020-02-17 12:11:00
https://www.asahi.com/articles/ASN2K7FQKN2KUHNB00R.html
get_lastmodified.py
print(soup.select('time')[0].string)
#February 18, 2020 12:25
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%d day%H o'clock%M minutes")
print(news_timestamp)
#2020-02-18 12:25:00
Comment: Static and easy to understand. At first glance, there is no fluctuation even by category, which is helpful.
https://r.nikkei.com/article/DGXMZO5556760013022020TL1000
get_lastmodified.py
print(soup.select('time')[1])
#February 18, 2020 11:00
news_timestamp = datetime.datetime.strptime(soup.select('time')[1].string, "%Y year%m month%d day%H:%M")
print(news_timestamp)
#2020-02-18 11:00:00
https://www.nikkei.com/article/DGXLASFL18H2S_Y0A210C2000000
get_lastmodified.py
print(soup.select('.cmnc-publish')[0].string)
#2020/2/18 7:37
news_timestamp = datetime.datetime.strptime(soup.select('.cmnc-publish')[0].string, "%Y/%m/%d %H:%M")
print(news_timestamp)
#2020-02-18 07:37:00
https://www.nikkei.com/article/DGXKZO55678940V10C20A2MM8000
get_lastmodified.py
print(soup.select('.cmnc-publish')[0].string)
#2020/2/With 15
news_timestamp = datetime.datetime.strptime(soup.select('.cmnc-publish')[0].string, "%Y/%m/%With d")
print(news_timestamp)
#2020-02-15 00:00:00
Comment: There are various ways to write. There were three at a glance, but there may be one.
https://www.sankei.com/world/news/200218/wor2002180013-n1.html
get_lastmodified.py
print(soup.select('#__r_publish_date__')[0].string)
#2020.2.18 13:10
news_timestamp = datetime.datetime.strptime(soup.select('#__r_publish_date__')[0].string, "%Y.%m.%d %H:%M")
print(news_timestamp)
#2020-02-18 13:10:00
Comment: If you look closely, it was listed in the URL until time.
https://www.yomiuri.co.jp/national/20200218-OYT1T50158/
get_lastmodified.py
print(soup.select('time')[0].string)
#2020/02/18 14:16
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y/%m/%d %H:%M")
print(news_timestamp)
#2020-02-18 14:16:00
Comment: You can get the date only from the URL.
https://mainichi.jp/articles/20180803/ddm/007/030/030000c
get_lastmodified.py
print(soup.select('time')[0].string)
#August 3, 2018 Tokyo morning edition
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%d day Tokyo morning edition")
print(news_timestamp)
#2018-08-03 00:00:00
https://mainichi.jp/articles/20200218/dde/012/030/033000c
get_lastmodified.py
print(soup.select('time')[0].string)
#February 18, 2020 Tokyo evening edition
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%Day d Tokyo evening edition")
print(news_timestamp)
#2020-02-18 00:00:00
https://mainichi.jp/articles/20200218/k00/00m/010/047000c
get_lastmodified.py
print(soup.select('time')[0].string)
#February 18, 2020 09:57
#Last updated print(soup.select('time')[1].string)
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%d day%H o'clock%M minutes")
print(news_timestamp)
#2020-02-18 09:57:00
https://mainichi.jp/premier/politics/articles/20200217/pol/00m/010/005000c
get_lastmodified.py
print(soup.select('time')[0].string)
#February 18, 2020
news_timestamp = datetime.datetime.strptime(soup.select('time')[0].string, "%Y year%m month%d day")
print(news_timestamp)
#2020-02-18 00:00:00
Comment: In the Mainichi Shimbun, articles only in the electronic version can be obtained in minutes. Articles from the morning and evening editions and daily premiere can only be obtained by the date, which is the same as the URL.
News site | From the R header | From URL | From the HTML contents |
---|---|---|---|
CNN | date | Year, month, day, hour and minute | |
Bloomberg | date | Year, month, day, hour and minute | |
BBC | date | ||
Reuter | Year, month, day, hour and minute | ||
Wall Street Journal | Year, month, day, hour and minute | ||
Forbes Japan | Year, month, day, hour and minute | ||
Newsweek | Year, month, day, hour and minute | ||
Asahi Shimbun | Year, month, day, hour and minute | Year, month, day, hour and minute | |
Nikkei newspaper | Year, month, day, hour and minute | ||
Sankei Shimbun | Year, month, day, hour and minute | Date and time | Year, month, day, hour and minute |
Yomiuri Shimbun | date | date時分 | |
Mainichi newspaper | date | date時分* |
Not to mention the language, the date notation varies from site to site. Even within the same news site, there are fluctuations in the notation, and we have not been able to confirm all of them. I haven't found a site that I can't get by looking at the HTML, but I can see it by looking at the URL. I said it was a matching technique, but it doesn't change even if you get it just by scraping. With this method, you have to read the tags and class names for each site, and it seems to be quite difficult to deal with all sites, even news sites. Please let me know if there is a better way.
(Update) I will put what I made into a class. [Python] Get the update date of news articles from HTML
Recommended Posts