Many of you may have wanted to give a quick overview (preferably in connection with existing knowledge) when you are too busy to check your dissertation or when you start something new. ..
This time, I will also study, and I will try to make something that can be used in such a case using RDF.
Abbreviation for Resource Description Framework. It is expressed as a directed graph using three values, S (Subject), P (Predicate), and O (Object). There is also a mechanism that allows you to connect data and retrieve the information you want to know by querying.
Reference article: Miscellaneous explanation about RDF-Qiita [Intuition RDF !! Part 2-Create an easy-to-use RDF and search. --Qiita] (http://qiita.com/maoringo/items/0d48a3d967a35581cc24)
If the paper information is PubMed provided by NCBI, the information can be obtained by API, so I will try using it.
There are four types of APIs provided.
First of all, it seems that you need to get a list of paper IDs with ESearch and get details about each paper ID. ESpell doesn't seem to be needed this time.
Reference article Summary of PubMed API
Use ESearch to get the dissertation ID. Based on this URL
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=
If you enter a search keyword after "term =", that ID should be returned.
For example, try the search keyword: "cancer".
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=cancer
If you enter the above URL in your browser, you will see a result like this. You have obtained the thesis ID list.
However, it is difficult to do it manually every time, and I want to erase unnecessary things and use only the paper ID. So I will write it using python.
Environmental information
- Windows10
get_id.py
# coding: utf-8
import urllib.request
keyword = "cancer"
baseURL = "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term="
def get_id(url):#Get the dissertation ID
result = urllib.request.urlopen(url)
return result
def main():
url = baseURL + keyword
result = get_id(url)
print(result.read())
if __name__ == "__main__":
main()
When you do this
% python get_id.py
<?xml version="1.0" encoding="UTF-8" ?><!DOCTYPE eSearchResult
PUBLIC "-//NLM//DTD esearch20060628//EN""https://eutils.ncbi.n
lm.nih.gov/eutils/dtd/20060628/esearch.dtd"><eSearchResult><Cou
nt>3465235</Count><RetMax>20</RetMax><RetStart>0</RetStart><IdL
ist><Id>28420040</Id><Id>28420039</Id><Id>28420037</Id>
....
It's hard to see without line breaks, but you can see that you can get the same information as you did with the browser earlier.
XML is basically
<element>Contents</element>
<element="elementå"attribute="Attribute value">Contents</element>
It has a structure like. For example, for a dissertation ID
<Id>Paper ID</Id>
You can see that it looks like this by looking at the acquired information. Remove unnecessary parts such as elements, and extract only the required paper ID.
Since there is a library called ElementTree for handling XML, I will use it.
get_id.py
from xml.etree.ElementTree import *
After importing, rewrite main as follows.
get_id.py
def main():
url = baseURL + keyword
result = get_id(url)
element = fromstring(result.read())
print(element.findtext(".//Id"))
First, create an Element object with fromstring (). Subsequent element.findtext () will return the first content that matches the condition. This time I want "Id", so I specify it, but there is a rule to write ".// Id".
When you do this
% python get_id.py
28420040
I was able to extract only the first paper ID. If you want to extract not only the first one but all the matching contents, use element.findall () and write as follows.
get_id.py
def main():
url = baseURL + keyword
result = get_id(url)
element = fromstring(result.read())
for e in element.findall(".//Id"):
print(e.text)
When you run
% python get_id.py
28420040
28420039
28420037
28420035
...
I was able to successfully extract only all the paper IDs.
Considering future processing, create a file called "idlist_search word.txt" and save the acquired ID list.
get_id.py
def main():
url = baseURL + keyword
result = get_id(url)
element = fromstring(result.read())
filename = "idlist_"+keyword+".txt"
f = open(filename, "w")
for e in element.findall(".//Id"):
f.write(e.text)
f.write("\n")
f.close()
Reference article How to process XML with ElementTree in Python --hikm's blog
Next, let's get the summary using the obtained paper ID. The base URL of ESummary is
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=
is. Just like when you get the thesis ID, enter the thesis ID you want to get the information after "id =". For example, let's enter the first paper ID "28420040" that we obtained earlier.
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id=28420040
By entering this URL in a browser, information on the publication date, author name, and article title could be obtained in this way.
If you write up to here in Python
get_summary.py
# coding: utf-8
import urllib.request
from xml.etree.ElementTree import *
keyword = "cancer"
idfile = "idlist_"+keyword+".txt"
baseURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&id="
def get_xml(url):#Obtain a dissertation summary
result = urllib.request.urlopen(url)
return result
def main():
idlist = []
f = open(idfile,"r")
for i in f.readlines():
idlist.append(i.strip())
f.close()
url = baseURL + idlist[0]
result = get_xml(url)
print(result.read())
if __name__ == "__main__":
main()
The dissertation ID is in a format that can be read from the saved file. Also, although I haven't used it here, I've already imported the ElementTree library first because I'll be using it soon. When you execute it, you should see a version without line breaks when you execute it in a browser, like when you get the article ID.
After that, just like the paper ID, only the part of the desired content is extracted. However, unlike the article ID, Author and Title are attributes of the Item element. Therefore, as in the case of the dissertation ID
for e in element.findall(".//Item"):
print(e.text)
Then, all the dissertation information will be extracted in this way.
% python get_summary.py
28420040
2017 Apr 18
2017 Apr 18
J Surg Oncol
None
Duan W
Liu K
Fu X
Shen X
...
You can use this with this, but let's also know how to extract only what you want, such as Author and Title.
The Element object created by passing XML text is a dictionary type object, and each element can be accessed. Here are some examples.
print(element[0][3].text)
print(element[0][4][2].text)
print(element[0][6].text)
Execution result
2017 Apr 18
Fu X
Semi-end-to-end esophagojejunostomy after laparoscopy-assisted total gastrectomy better reduces stricture and leakage than the conventional end-to-side procedure: A retrospective study.
If you want to extract the list of authors, it looks like this.
for i in range(len(element[0][4])):
print(element[0][4][i].text)
Execution result
Duan W
Liu K
Fu X
Shen X
...
Also, getting elements (tags) and attributes (keys)
print(element[0][4].tag)
print(element[0][4].attrib)
print(element[0][4].keys())
Execution result
Item
{'Name': 'AuthorList', 'Type': 'List'}
['Name', 'Type']
You can do it like this. I think it's useful to remember.
I was able to get the information of the paper, but the title and author name are not enough. It would have been nice if there were keywords related to the dissertation, but it can't be helped. Therefore, I will use EFetch to obtain the abstract of the dissertation.
First of all, the URL that is the base of EFetch is
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=
When you enter the paper ID given in the browser
It was returned in a very unwieldy format. It seems that you can specify the XML format with a parameter called retmode.
Reference article The E-utilities In-Depth: Parameters, Syntax and More - Entrez Programming Utilities Help - NCBI Bookshelf
If you try to execute the URL as follows
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id=28420040&retmode=xml
Now in XML format!
Up to this point, you can write in almost the same way as ESearch.
get_abstract.py
# coding: utf-8
import urllib.request
from xml.etree.ElementTree import *
keyword = "cancer"
idfile = "idlist_"+keyword+".txt"
baseURL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=pubmed&id="
def get_xml(url):#Get dissertation information
result = urllib.request.urlopen(url)
return result
def main():
idlist = []
f = open(idfile,"r")
for i in f.readlines():
idlist.append(i.strip())
f.close()
url = baseURL + idlist[0] + "&retmode=xml"
result = get_xml(url)
print(result.read())#Display the acquisition result as it is
if __name__ == "__main__":
main()
After that, since the thesis abstract is an AbstractText element, it is the same as when extracting the thesis ID.
element = fromstring(result.read())
for e in element.findall(".//AbstractText"):
print(e.text)#View abstract
You should be able to do it.
When I tried it, I was able to successfully extract only the abstract of the dissertation.
% python get_abstract.py
Laparoscopy-assisted total gastrectomy (LATG) has not
gained popularity due to the technical difficulty of e
sophagojejunostomy (EJ) and the high incidence of EJ-r
elated complications. Herein, we compared two types of
EJ for Roux-en-Y reconstruction to determine whether
...
If we can process the abstracts of the dissertation obtained in this way and convert the knowledge into RDF, it seems that we can make something interesting. It took a long time to get the dissertation information, so I would like to continue with the next article.
When processing the XML file once saved
element = fromstring(result.read())
To
tree = parse("efetch_result.xml")
element = tree.getroot()
It is possible by replacing it with.