Overview

In this article, we will deal with the method of extracting the literature related to the new coronavirus newly registered the day before from the medical literature database and automatically sending it to LINE. The main content is to extract documents that match a certain keyword from a database called PubMed.

When you have a new paper, you will be notified like this. 完成イメージ（論文あり）

Without it, it looks like this. 完成イメージ（論文なし）

environment

Python 3.6.5
beautifulsoup4==4.9.0
requests==2.23.0
urllib3==1.25.9

Database and keyword selection

This time, we will use PubMed as a medical literature database. PubMed is a database created by NCBI (National Center for Biotechnology Information) in NLM (National Library of Medicine). You can search for documents published in major medical journals around the world.

Next, as a keyword, when I looked into the new coronavirus, the words "coronavirus" and "Covid-19" were often used. Therefore, this time I decided to extract the literature that contains either the word "coronavirus" or "Covod-19".

PubMed API

I used PubMed's API as a way to extract documents from PubMed. There are multiple APIs available in PubMed, but I used ESearch and EFetch. For more information, please refer to Documentation.

ESearch_Overview

ESearch allows you to get a list of article IDs that match your search formula. Based on this URL

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=

If you put a search expression after "term =" in, the ID that matches the search expression will be returned.

For example, try "coronavirus".

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=coronavirus

If you enter the above URL in your browser, you will see a result like this. ESearch retmax=20

I have successfully obtained the ID of the paper. Count means the number of documents that match the search formula, and retmax means the number of documents that match and display. The initial value of retmax is 20, but you can get up to 100,000.

For example, to change retmax to 100, you need to add "retmax = 100" to the URL.

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=coronavirus&retmax=100

If you enter the above URL in your browser, ESearch retmax=100 It looks like. The number of documents displayed has increased to 100.

You can add some conditions to extract the literature, such as "retmax". This time, in addition to "retmax", we will use "field", "mindate", and "maxdate".

In "field", you can select the search location from "title" or "abstract". With "mindate" and "maxdate", you can decide from when to when the document is targeted by the date when the document was registered in PubMed. For example, if you want to search the literature from April 2019 to April 2020 by title only,

&field=title&mindate=2019/4/1&maxdate=2020/4/31

Add.

ESearch_code

First, create a URL to find out the ID of the article that corresponds to the search formula. This time, we are using a search formula that connects "coronavirus" and "covid-19" with an OR.

`URL creation for Esearch`


import time
def make_url(yesterday,query):
    """
Create url to do Esearch
Arguments: date, search expression
Return value: str type url
    """
    #Esearch basic URL
    baseURL="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
    
    #Limit search scope to titles or abstracts
    field="field=title/abstract"

    #Changed the maximum number of IDs that can be obtained to 1000
    retmax="retmax=1000"

    #Only for yesterday's literature
    mindate="mindate={}".format(yesterday)
    maxdate="maxdate={}".format(yesterday)

    #Combine each string
    url="&".join([baseURL+query,field,retmax,mindate,maxdate])
    
    time.sleep(5)
    return url

Once you have created the URL, use it to get a list of IDs. We use Beautiful Soup to make it easier to get an ID.

`Obtain the article ID from Esearch`


from bs4 import BeautifulSoup
from urllib.parse import urljoin
import urllib.request

def get_id(url): 
    """
Obtaining a dissertation ID
Arguments: research url
Return value: List of ids
    """
    #Get a list of IDs with ESearch
    article_id_list=urllib.request.urlopen(url)
    
    #Get ID only
    bs=BeautifulSoup(article_id_list,"html.parser")
    ids=bs.find_all("id")
    
    return ids

EFetch_ Overview

Use EFetch to obtain information such as titles and abstracts from the article ID. This URL is the basis.

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=

You can get the information of the paper by entering the id of the paper in "id =".

EFetch_ code

The dissertation information is obtained from each ID obtained by ESearch.

`Get the title and URL of the paper`


from bs4 import BeautifulSoup
import urllib.request
import time

def get_summary(id):
    """
Get a summary of your dissertation
Arguments: id
Return value: Title, article url
    """ 
    #EFetch basic URL
    serchURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="
    
    search_url=serchURL+id.text+"&retmode=xml"
    summary=urllib.request.urlopen(search_url)
    summary_bs=BeautifulSoup(summary,"html.parser")
    
    #Document URL is created from the article ID
    article_URL="https://pubmed.ncbi.nlm.nih.gov/{}/".format(id.text)
    
    #Extract the title of the document
    title=summary_bs.find("articletitle")
    title=title.text
    
    time.sleep(5)
    return title,article_URL

LINE Notify_ code

After that, the information of the acquired paper is output. You can send a message from python to LINE by using LINE Notify. I referred to this article.

`Send to LINE`


def output_line(line_access_token,message):
    """
Send notifications to LINE
Arguments: access token, notification content
Return value: None
    """ 
    line_url = "https://notify-api.line.me/api/notify"
    line_headers = {'Authorization': 'Bearer ' + line_access_token}
    payload = {'message': message}
    r=requests.post(line_url, headers=line_headers, params=payload,)

Whole code

`python`


from bs4 import BeautifulSoup
from urllib.parse import urljoin
from datetime import date,timedelta
import urllib.request
import requests
import time

def main():
    """
Main processing
    """
    #LINE access token
    line_access_token = 'LINE access token'

    #Get date
    yesterday=date.today()-timedelta(days=1)
    yesterday="/".join([str(yesterday.year),str(yesterday.month),str(yesterday.day)])
    
    #Search formula
    query="coronavirus+OR+covid-19"

    #Get Esearch link
    URL=make_url(yesterday,query)

    #Get the dissertation id
    ids=get_id(URL)

    #When there is no new paper
    if ids == []:
        message="Covid-There are no 19 new papers"
        output_line(line_access_token,message)
  
    #When there is a new paper
    else:
        for id in ids:
            #Get the title and URL of the paper
            title,article_URL=get_summary(id)

            #Send a notification to LINE
            message="""{}
            {}""".format(title,article_URL)
            output_line(line_access_token,message)

def make_url(yesterday,query):
    """
Create url to do Esearch
Arguments: date, search expression
Return value: str type url
    """
    #Esearch basic URL
    baseURL="http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
    
    #Limit search scope to titles or abstracts
    field="field=title/abstract"

    #Changed the maximum number of IDs that can be obtained to 1000
    retmax="retmax=1000"

    #Only for yesterday's literature
    mindate="mindate={}".format(yesterday)
    maxdate="maxdate={}".format(yesterday)

    #Combine each string
    url="&".join([baseURL+query,field,retmax,mindate,maxdate])
    
    time.sleep(5)
    return url
    
def get_id(url): 
    """
Obtaining a dissertation ID
Arguments: research url
Return value: List of ids
    """
    #Get a list of IDs with ESearch
    article_id_list=urllib.request.urlopen(url)
    
    #Get ID only
    bs=BeautifulSoup(article_id_list,"html.parser")
    ids=bs.find_all("id")
    
    return ids

def get_summary(id):
    """
Get a summary of your dissertation
Arguments: id
Return value: Title, article url
    """ 
    #EFetch basic URL
    serchURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="
    
    search_url=serchURL+id.text+"&retmode=xml"
    summary=urllib.request.urlopen(search_url)
    summary_bs=BeautifulSoup(summary,"html.parser")
    
    #Document URL is created from the article ID
    article_URL="https://pubmed.ncbi.nlm.nih.gov/{}/".format(id.text)
    
    #Extract the title of the document
    title=summary_bs.find("articletitle")
    title=title.text
    
    time.sleep(5)
    return title,article_URL
        
def output_line(line_access_token,message):
    """
Send notifications to LINE
Arguments: access token, notification content
Return value: None
    """ 
    line_url = "https://notify-api.line.me/api/notify"
    line_headers = {'Authorization': 'Bearer ' + line_access_token}
    payload = {'message': message}
    r=requests.post(line_url, headers=line_headers, params=payload,)

if __name__ == "__main__":
    main()

After that, by running this with cron, you can automatically send the title and URL of the document to LINE every day.

I tried to automatically send the literature of the new coronavirus to LINE with Python