Introduction

This article is a personal memo about how to read the bibliographic data (xml format) caught in the search in Pubmed with python.

I would appreciate it if you could point out any points you noticed.

Data you want to process

One piece of data looks like the following. Actually, I want to process multiple data, but first I will make it possible to process one by one.

`001.xml`


<PubmedArticle>
    <MedlineCitation Status="Publisher" Owner="NLM">
        <PMID Version="1">12345678</PMID>
        <DateRevised>
            <Year>2020</Year>
            <Month>03</Month>
            <Day>27</Day>
        </DateRevised>
        <Article PubModel="Print-Electronic">
            <Journal>
                <ISSN IssnType="Electronic">1873-3700</ISSN>
                <JournalIssue CitedMedium="Internet">
                    <PubDate>
                        <Year>2020</Year>
                        <Month>Mar</Month>
                    </PubDate>
                </JournalIssue>
                <Title>Journal of XXX</Title>
            </Journal>
            <ArticleTitle>Identification of XXX.</ArticleTitle>
            <AuthorList CompleteYN="Y">
                <Author ValidYN="Y">
                    <LastName>Sendai</LastName>
                    <ForeName>Shiro</ForeName>
                    <Initials>S</Initials>
                    <AffiliationInfo>
                        <Affiliation>Sendai, Japan.</Affiliation>
                    </AffiliationInfo>
                </Author>
                <Author ValidYN="Y">
                    <LastName>Tohoku</LastName>
                    <ForeName>Taro</ForeName>
                    <Initials>T</Initials>
                    <AffiliationInfo>
                        <Affiliation>Miyagi, Japan.</Affiliation>
                    </AffiliationInfo>
                </Author>
            </AuthorList>
            <Language>eng</Language>
            <PublicationTypeList>
                <PublicationType UI="D016428">Journal Article</PublicationType>
            </PublicationTypeList>
            <ArticleDate DateType="Electronic">
                <Year>2020</Year>
                <Month>03</Month>
                <Day>23</Day>
            </ArticleDate>
        </Article>
        <CitationSubset>IM</CitationSubset>
    </MedlineCitation>
    <PubmedData>
        <PublicationStatus>aheadofprint</PublicationStatus>
        <ArticleIdList>
            <ArticleId IdType="pubmed">32213359</ArticleId>
            <ArticleId IdType="pii">S0031-9422(19)30971-9</ArticleId>
            <ArticleId IdType="doi">10.1016/j.phytochem.2020.112349</ArticleId>
        </ArticleIdList>
    </PubmedData>
</PubmedArticle>

Understanding basic usage

Load the library for reading xml.

`001.py`


import xml.etree.ElementTree as ET

Read the xml data from the file. It seems that multiple data are lined up with two line breaks, so split them with split to make a list.

`002.py`


test_data = open("./xxxx/pubmed.xml", "r")
contents = test_data.read()
records = contents.split('\n\n')

The first bibliographic data (records [0]) is read by ET.fromstring () and stored in the variable root. If you look at root with type (), you'll see that it's an Element object.

`003.py`


root = ET.fromstring(records[0])
type(root)
#<class 'xml.etree.ElementTree.Element'>

You can check the tag with root.tag. I will check it.

`004.py`


root.tag
#'PubmedArticle'

Roughly speaking, one piece of data has the following form. I was able to access the outermost tag with root.tag.

`002.xml`


<PubmedArticle>
    <MedlineCitation>
    </MedlineCitation>
    <PubmedData>
    </PubmedData>
</PubmedArticle>

Inside \ <PubmedArticle > are two elements (MedlineCitation and PubmedData), which can be accessed using subscripts. Access using a subscript and look up the type further.

`005.py`


root[0]
#<Element 'MedlineCitation' at 0x10a9d5b38>
type(root[0])
#<class 'xml.etree.ElementTree.Element'>

root[1]
#<Element 'PubmedData' at 0x10aa78868>
type(root[1])
#<class 'xml.etree.ElementTree.Element'>

You can see that both are Element objects.

In short, it seems that all nodes are Element objects. Element objects can be iterated and child nodes can be retrieved and processed one by one.

for i in root:
    print(i.tag)

You can look up an Element's tag with .tag, and you can look up the attributes and attribute values attached to that tag with .attrib.

root[0].tag
#'MedlineCitation'

root[0].attrib
#{'Status': 'Publisher', 'Owner': 'NLM'}
# root[0]The area around the tag is as follows.
#    <MedlineCitation Status="Publisher" Owner="NLM">


type(root[0].attrib)
#<class 'dict'> #Dictionary class

How to access the Element object

There are likely to be three. In each case, you can specify one or more tags. Enclose the entire tag in quotation marks, and separate the tags with slashes when specifying multiple tags.

find('tag1/tag2')
findall('tag1/tag2')
iter('tag1/tag2')

If it is 1, the Element object is returned, if it is 2, the list of Element objects is returned, and if it is 3, it is an object for iteration? Will be returned. I will check it.

root.find('MedlineCitation/DateRevised/Year')
#<Element 'Year' at 0x10a9f8ae8>

root.findall('MedlineCitation')
#[<Element 'MedlineCitation' at 0x10a9d5b38>]

root.iter('Author')
#<_elementtree._element_iterator object at 0x10aa65990>

#Let's iterate with a for statement.
for i in root.iter('Author'):
    print(i)
#<Element 'Author' at 0x10aa6e9f8>
#<Element 'Author' at 0x10aa6ec28>

It seems that findall () looks only at the child nodes of the Element object, and iter () looks at all the child nodes, grandchild nodes, great-grandchild nodes ... of the Element object.

Access to the value of the Element object

The Element object has two values. Attribute values and text data. The attribute value can be obtained by .get ('* property name ') for the Element object. Alternatively, .attrib [' property name *'] seems to be fine.

#.get()Or
root.find('MedlineCitation').get('Status')
#'Publisher'

#.attrib()Or
root.find('MedlineCitation').attrib['Status']
#'Publisher'

You can also get text data by using .text for the Element object.

The text data here is the part surrounded by tags, 2020 in the example below. <Year>2020</Year>

Try to get the value by specifying the path to the Element object with find ().

root.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Year').text
#'2020'

To get information about multiple Authors, iterate over the list obtained with findall ().

Corrected in consideration of multiple author affiliations (March 31, 2020).

for x in root.findall('MedlineCitation/Article/AuthorList/Author'):
    x.find('LastName').text   #Author's surname
    x.find('ForeName').text    #Author's name
    for y in x.findall('AffiliationInfo'):
        y.find('Affiliation').text

The doi (document identifier) is described in the tag ELocationID, but the tag ELocationID has some attribute values, and it is necessary to obtain the text data in the case of EIdType = "doi".

for x in root.findall('MedlineCitation/Article/ELocationID'):
    if(x.get('EIdType') == 'doi'):
        x.text

It is necessary to distinguish whether the record is a Review or a Journal Article, which is described in the Publication Type. However, there are usually multiple Publication Types, and if any of them has a value of Review, it seems to be Review.

For example, if you look at the Review record, it looks like this:

`.xml`


<PublicationTypeList>
    <PublicationType UI="D016428">Journal Article</PublicationType>
    <PublicationType UI="D016454">Review</PublicationType>
    <PublicationType UI="D013485">Research Support, Non-U.S. Gov't</PublicationType>
</PublicationTypeList>

So, whether it is a review or not is

isReview = False
for x in root.findall('MedlineCitation/Article/PublicationTypeList'):
    if (x.text == 'Review'):
        isReview = TRUE

I think it's good to do it.

To summarize the above, including other information that you may want to acquire

import xml.etree.ElementTree as ET

test_data = open("./pubmed.xml", "r")
contents = test_data.read()
records = contents.split('\n\n')
root = ET.fromstring(records[0])#For the time being, only the first case.

#Author information
for x in root.findall('MedlineCitation/Article/AuthorList/Author'):
    x.find('LastName').text   #Author's surname
    x.find('ForeName').text    #Author's name
    for y in x.findall('AffiliationInfo'):
        y.find('Affiliation').text#Fixed.

#Judgment of Review
isReview = False
for x in root.findall('MedlineCitation/Article/PublicationTypeList'):
    if (x.text == 'Review'):
        isReview = TRUE

# doi
for x in root.findall('MedlineCitation/Article/ELocationID'):
    if(x.get('EIdType') == 'doi'):
        x.text

#PMID
root.find('MedlineCitation/PMID').text
#Paper title
root.find('MedlineCitation/Article/ArticleTitle').text
#Journal name
root.find('MedlineCitation/Article/Journal/Title').text
#Year of publication
root.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Year').text
#Publication month
root.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Month').text
#language
root.find('MedlineCitation/Article/Language').text

I think it should be done. In the above code, there is only one process, but

for record in records:
    root = ET.fromstring(record)
    #Describe the process

You should do it as.

Now, if you have xml data, you can extract the necessary information at once. All you have to do is think about how to shape it.

Now you know how to handle xml data.

Process Pubmed .xml data with python

Introduction

Data you want to process

001.xml

Understanding basic usage

001.py

002.py

003.py

004.py

002.xml

005.py