In this article, we will deal with the method of extracting the literature related to the new coronavirus newly registered the day before from the medical literature database and automatically sending it to LINE. The main content is to extract documents that match a certain keyword from a database called PubMed.
When you have a new paper, you will be notified like this.
Without it, it looks like this.
Python 3.6.5
beautifulsoup4==4.9.0
requests==2.23.0
urllib3==1.25.9
This time, we will use PubMed as a medical literature database. PubMed is a database created by NCBI (National Center for Biotechnology Information) in NLM (National Library of Medicine). You can search for documents published in major medical journals around the world.
Next, as a keyword, when I looked into the new coronavirus, the words "coronavirus" and "Covid-19" were often used. Therefore, this time I decided to extract the literature that contains either the word "coronavirus" or "Covod-19".
I used PubMed's API as a way to extract documents from PubMed. There are multiple APIs available in PubMed, but I used ESearch and EFetch. For more information, please refer to Documentation.
ESearch allows you to get a list of article IDs that match your search formula. Based on this URL
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=
If you put a search expression after "term =" in, the ID that matches the search expression will be returned.
For example, try "coronavirus".
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=coronavirus
If you enter the above URL in your browser, you will see a result like this.
I have successfully obtained the ID of the paper. Count means the number of documents that match the search formula, and retmax means the number of documents that match and display. The initial value of retmax is 20, but you can get up to 100,000.
For example, to change retmax to 100, you need to add "retmax = 100" to the URL.
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=coronavirus&retmax=100
If you enter the above URL in your browser, It looks like. The number of documents displayed has increased to 100.
You can add some conditions to extract the literature, such as "retmax". This time, in addition to "retmax", we will use "field", "mindate", and "maxdate".
In "field", you can select the search location from "title" or "abstract". With "mindate" and "maxdate", you can decide from when to when the document is targeted by the date when the document was registered in PubMed. For example, if you want to search the literature from April 2019 to April 2020 by title only,
&field=title&mindate=2019/4/1&maxdate=2020/4/31
Add.
First, create a URL to find out the ID of the article that corresponds to the search formula. This time, we are using a search formula that connects "coronavirus" and "covid-19" with an OR.
URL creation for Esearch
import time
def make_url(yesterday,query):
"""
Create url to do Esearch
Arguments: date, search expression
Return value: str type url
"""
#Esearch basic URL
baseURL="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
#Limit search scope to titles or abstracts
field="field=title/abstract"
#Changed the maximum number of IDs that can be obtained to 1000
retmax="retmax=1000"
#Only for yesterday's literature
mindate="mindate={}".format(yesterday)
maxdate="maxdate={}".format(yesterday)
#Combine each string
url="&".join([baseURL+query,field,retmax,mindate,maxdate])
time.sleep(5)
return url
Once you have created the URL, use it to get a list of IDs. We use Beautiful Soup to make it easier to get an ID.
Obtain the article ID from Esearch
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import urllib.request
def get_id(url):
"""
Obtaining a dissertation ID
Arguments: research url
Return value: List of ids
"""
#Get a list of IDs with ESearch
article_id_list=urllib.request.urlopen(url)
#Get ID only
bs=BeautifulSoup(article_id_list,"html.parser")
ids=bs.find_all("id")
return ids
Use EFetch to obtain information such as titles and abstracts from the article ID. This URL is the basis.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=
You can get the information of the paper by entering the id of the paper in "id =".
The dissertation information is obtained from each ID obtained by ESearch.
Get the title and URL of the paper
from bs4 import BeautifulSoup
import urllib.request
import time
def get_summary(id):
"""
Get a summary of your dissertation
Arguments: id
Return value: Title, article url
"""
#EFetch basic URL
serchURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="
search_url=serchURL+id.text+"&retmode=xml"
summary=urllib.request.urlopen(search_url)
summary_bs=BeautifulSoup(summary,"html.parser")
#Document URL is created from the article ID
article_URL="https://pubmed.ncbi.nlm.nih.gov/{}/".format(id.text)
#Extract the title of the document
title=summary_bs.find("articletitle")
title=title.text
time.sleep(5)
return title,article_URL
After that, the information of the acquired paper is output. You can send a message from python to LINE by using LINE Notify. I referred to this article.
Send to LINE
def output_line(line_access_token,message):
"""
Send notifications to LINE
Arguments: access token, notification content
Return value: None
"""
line_url = "https://notify-api.line.me/api/notify"
line_headers = {'Authorization': 'Bearer ' + line_access_token}
payload = {'message': message}
r=requests.post(line_url, headers=line_headers, params=payload,)
python
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from datetime import date,timedelta
import urllib.request
import requests
import time
def main():
"""
Main processing
"""
#LINE access token
line_access_token = 'LINE access token'
#Get date
yesterday=date.today()-timedelta(days=1)
yesterday="/".join([str(yesterday.year),str(yesterday.month),str(yesterday.day)])
#Search formula
query="coronavirus+OR+covid-19"
#Get Esearch link
URL=make_url(yesterday,query)
#Get the dissertation id
ids=get_id(URL)
#When there is no new paper
if ids == []:
message="Covid-There are no 19 new papers"
output_line(line_access_token,message)
#When there is a new paper
else:
for id in ids:
#Get the title and URL of the paper
title,article_URL=get_summary(id)
#Send a notification to LINE
message="""{}
{}""".format(title,article_URL)
output_line(line_access_token,message)
def make_url(yesterday,query):
"""
Create url to do Esearch
Arguments: date, search expression
Return value: str type url
"""
#Esearch basic URL
baseURL="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
#Limit search scope to titles or abstracts
field="field=title/abstract"
#Changed the maximum number of IDs that can be obtained to 1000
retmax="retmax=1000"
#Only for yesterday's literature
mindate="mindate={}".format(yesterday)
maxdate="maxdate={}".format(yesterday)
#Combine each string
url="&".join([baseURL+query,field,retmax,mindate,maxdate])
time.sleep(5)
return url
def get_id(url):
"""
Obtaining a dissertation ID
Arguments: research url
Return value: List of ids
"""
#Get a list of IDs with ESearch
article_id_list=urllib.request.urlopen(url)
#Get ID only
bs=BeautifulSoup(article_id_list,"html.parser")
ids=bs.find_all("id")
return ids
def get_summary(id):
"""
Get a summary of your dissertation
Arguments: id
Return value: Title, article url
"""
#EFetch basic URL
serchURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="
search_url=serchURL+id.text+"&retmode=xml"
summary=urllib.request.urlopen(search_url)
summary_bs=BeautifulSoup(summary,"html.parser")
#Document URL is created from the article ID
article_URL="https://pubmed.ncbi.nlm.nih.gov/{}/".format(id.text)
#Extract the title of the document
title=summary_bs.find("articletitle")
title=title.text
time.sleep(5)
return title,article_URL
def output_line(line_access_token,message):
"""
Send notifications to LINE
Arguments: access token, notification content
Return value: None
"""
line_url = "https://notify-api.line.me/api/notify"
line_headers = {'Authorization': 'Bearer ' + line_access_token}
payload = {'message': message}
r=requests.post(line_url, headers=line_headers, params=payload,)
if __name__ == "__main__":
main()
After that, by running this with cron, you can automatically send the title and URL of the document to LINE every day.