Introduction

This article is a memo for myself. However, we would appreciate it if you could give us your opinions / advice for improvement.

In the previous article, I understood how to use the library to process xml format data. In this article, we will create a wrapper class for processing Pubmed dissertation data.

The wrapper class allows you to extract basic information such as pubmed id, doi, year of publication, and title for each article data. In addition to this

Total number of authors,
List of responding authors (if more, all)
What number author is an author? Etc. can be returned.

It seems that pubmed's dissertation data often does not include who the co-responding author of the dissertation is, but who is the co-responding author of the dissertation is important information, so handle it as carefully as possible. I'm going.

We will also be able to find information about co-first, co-last, and'equality'.

Corresponding author processing

Who the responding author is is not explicitly stated in the pubmde data. In other words, you need to look closely at the data to determine who is the responding author. I decided to judge as follows.

There are two types of authors, one that represents a person and the other that represents a research group, but I would like to be able to find out whether a specific individual is the responding author of the paper. I will not think about anything other than "human author)".

Judgment flow: (If each item cannot be confirmed, judge by the item below)

If there is only one author, that person is the responding author.
If the author's affiliation information (Author-> Affiliation Info-> Affiliation text) has an email address, that person is the responding author.
If there are multiple authors and only one of them has affiliation information, that author is the responding author (all authors are considered to belong to the same affiliation).
Cases where you do not know who the responding author is

In other words, 4 is a case where there are multiple authors, no one has email address information, and multiple authors (maybe all) have affiliation information. Basically, if you jump from the pubmed page to the linked paper page, it's clear who is the responding author, but I won't follow this here.

Information processing about'equality'

If you read Pubmed's xml description, it says to add Y to EqualContrib to indicate equal contribution. In other words <Author EqualContrib="Y"> An example is given.

When I looked it up, there seemed to be an example where only one author has Equal Contrib = "Y". However, in addition to these problems, there are quite a few examples where there is a description about equal contribution in the affiliation information.

<Author>
    <AffiliationInfo>
        <Affiliation>###It is written here in your own way###</Affiliation>
    </AffiliationInfo>
</Author>

Example:Including Equal: 
Equal authorship.
Joint last author with equal contributions.
Equal contribution.
Equal contribution as senior author.
A.P. and A.J. are the equal senior authors.
Contribute equal to this work.
Co-first authors with equal contribution.
Equal contributor.
These authors are joint senior authors with equal contribution.
Joint last author with equal contributions.
* Equal contributors.

Example:Including Equally
These authors contributed equally to this article.
Both authors contributed equally.
These authors contributed equally to this work.
Contributed equally.
These authors contributed equally and are co-first authors.

In some cases, the author, who has nothing to do with equal contribution, wrote about Equality. Example 32281472 In some cases, "Equal" is included in the affiliation name. Foundation for Research on Equal Opportunity Social and Health Inequalities Network Pakistan Center for Health Policy and Inequalities Research

The description is too wide to handle to read and process the content. So for each author

If there is a description of \ 1
If there is a description of equal and equally in \ , 2
0 if neither

I decided to keep it as a list of ints (I checked 63,686 items with both descriptions and did not find one, so I decided not to have it. If there is, it will be 1 in processing), We will discuss this division later, along with what patterns actually exist.

Class design policy

Pubmed data has the following three tags

1.PubmedArticle、 2.Author、 3.Affiliation

For what is, create a class to handle it. The class names are ** PubmedArticle **, ** PubmedAuthor **, ** PubmedAffiliation **, and the corresponding ElementTree objects are stored in each.

These classes are wrapper classes that keep the Element object intact. Encapsulate it for easy inspection. The PubmedArticle object has a reference to the PubmedAuthor object, and the PubmedAuthor object has a reference to the PubmedAffiliation object. Method calls should follow this trend and not go backwards. Here, the description starts from the downstream class.

How many methods do you prepare?

I made the above three classes and defined various methods as follows, but how many methods should I make in the first place?

In the first place, if the user is familiar with the data format of pubmed, the wrapper class is not necessary in the first place, but even if you do not know the data format of pubmed or the data pattern at all, you can fully process using these classes. I'm a little wondering if it's possible. This is because there are many places where data cannot be processed uniformly because the data is written differently for each pubmed record (eg, the year of publication is written, and the responding author is often unclear. , Equality information is given in various ways).

In that case, I think we will aim for a class that is convenient for users who have some knowledge of what pubmed data looks like.

PubmedAffiliation class

It should be noted that an author may have multiple Pubmed Affiliations. The following numbers correspond to the numbers assigned to the methods in the code.

Initialization method. Receives an xmlElement object.
A method to determine whether the affiliation information includes an email address,
Method that returns affiliation information,
A method to check whether the affiliation information includes all the character strings included in the specified list. I prepared.

import xml.etree.ElementTree as ET
import re

class PubmedAffiliation():

    E_MAIL_PATTERN = re.compile(r'[0-9a-z_./?-]+@([0-9a-z-]+\.)+[0-9a-z-]+')

#1 Initialization method
    def __init__(self, xmlElement):
        self.xml = xmlElement
        
#2 Does your affiliation include an email address?: bool
#reference:If your affiliation includes an email address, you can consider it a responding author, but be aware that older literature may not list your email address.
    def hasAnyMailAddress(self,):
        affiliation = self.xml.text
        result = self.E_MAIL_PATTERN.search(affiliation)
        if result is not None:
            return True
        return False     

#3 Return affiliation information as text: str
    def affiliation(self,):
        if self.xml is not None:
            return self.xml.text
        return "-"

#4 List with specified affiliation(words)Does it contain all the words contained in: bool
    def isAffiliatedTo(self,words):#True if all are included
        for word in words:
            if not word in self.affiliation():
                return False
        return True

PubmedAuthor class

The following numbers correspond to the numbers assigned to the methods in the code.

Initialization method. Receives an xmlElement.
A method to check if there is affiliation information including an email address,
A method that returns the last name,
A method that returns a fore name (First name),
A method that returns initials,
A method that returns a list of affiliation information (PubmedAffiliation object)
A method to check if there is any affiliation information that includes all the character strings included in the specified character string list. Prepared.

The variable singleCommonAffi, which sets None at initialization, is set as needed when initializing the PubmedArticle object (depending on the pubmed data, only one author may have affiliation information, in which case Decided to consider this affiliation information as a common affiliation for all authors).


class PubmedAuthor():

#1 Initialization method
    def __init__(self, xmlElement):
        self.xml = xmlElement
        self.singleCommonAffi = None

#2 Is your e-mail address listed in your affiliation?: bool    
    def withAnyMailAddress(self,):#Is it a responding author?
        for x in self.xml.findall('AffiliationInfo/Affiliation'):#Affiliation
            pubmedAffiliation = PubmedAffiliation(x)
            if pubmedAffiliation.hasAnyMailAddress():
                return True
        return False

#3 returns last name: str    
    def lastName(self,):
        x = self.xml.find('LastName')
        if x is not None:
            return x.text
        return "-"

#4 returns fore name: str    
    def foreName(self,):
        x = self.xml.find('ForeName')
        if x is not None:
            return x.text
        return "-"

#5 Return initials: str    
    def initials(self,):
        x = self.xml.find('Initials')
        if x is not None:
            return x.text
        return "-"

#6 Affiliation with this author(PubmedAffiliation object)List including all: list
    def affiliations(self,):
        x = []
        for y in self.xml.findall('AffiliationInfo/Affiliation'):#Affiliation
            x.append(PubmedAffiliation(y))
        return x

#7 Does the affiliation information include all the words specified in list?: bool
    def isAffiliatedTo(self,words):
        for x in self.xml.findall('AffiliationInfo/Affiliation'):#Affiliation
            pubmedAffiliation = PubmedAffiliation(x)
            if pubmedAffiliation.isAffiliatedTo(words):
                return True
        #Without singleCommonAffi, don't look any further
        if self.singleCommonAffi is None
            return False

        #Find out about singleCommonAffi. True if all specified words are present
        for word in words:
            if not word in self.singleCommonAffi:
                return False        
        return True

PubmedArticle class

Upon initialization, it receives an xmlElement object and examines the following items:

List of human authors (PubmedAuthor objects),
A list of responding authors (PubmedAuthor objects),
For each author, a list of information about quality,
List of information about equality

It has a large number of methods. The number corresponds to the number assigned to the method in the code.

Information about co-authorship: str
Review or not: bool
Whether it is an Erratum (corrected article): bool
Publication type: str
Document identifier (doi): str
pubmed id(pmid): str
Title: str
Journal name: str
Publication year: str
Publication month: str
Description language: str
Find out what author the person specified in the foreName and LastName tuples is: int
Returns whether the person specified in the foreName and LastName tuples is in the specified author list: bool
Is the person specified in the foreName and LastName tuples the author of this paper?
Is the Corresponding Author Revealed ?: bool
Is the person specified in the foreName and LastName tuples the responding author ?: bool

class PubmedArticle():

#0 Initialization method
    def __init__(self, xmlElement):
        self.xml = xmlElement
        self.humanAuthors = []
        self.corespondingAuthors = []
        self.collectiveNames = []   #In some cases, the group name is included as the author. non-human author
        self.singleCommonAffi = None #
        self.equalityStatements = [] #Description of equaltity
        self.authorStates = []

        #authorStates is for each humanAuthor
        # 0:No description
        # 1: EqualContrib =There is a description of Y.
        # 2:There is a description related to quality in Affiliation.
        #Let it be 0, 1, 2 for each author.
        #Considering the author as a whole, there are several patterns
        #pattern 1:All 1....Everyone co-first at co-last
        #Pattern 2:Two or three from the front is one....co-1st
        #Pattern 3:Two from the back are 1.....co-last
        #Pattern 4:The first one is 2...There is something about quality. I don't know if I have to read it. This description is retained in qualityStatements.
        #Pattern 5:Other

        #Collect human authors.
        for x in self.xml.findall('MedlineCitation/Article/AuthorList/Author'):
            pubmedAuthor = PubmedAuthor(x)
            if x.find('CollectiveName') is not None:#<Author>There are cases where the group name is written. Do not include it in the author, but manage it separately.
                self.collectiveNames.append(pubmedAuthor)
            else :
                self.humanAuthors.append(pubmedAuthor)
        
        #Collect Corresponding Authors.(Incidentally, if there is only one author with affiliation information, check that affiliation.)。
        if len(self.humanAuthors) == 1:#When there is only one author. That person is a responding author.
            self.corespondingAuthors.append(self.humanAuthors[0])
        else:
            for author in self.humanAuthors:
                if author.withAnyMailAddress():#Corresponding author if email address is written in affiliate
                    self.corespondingAuthors.append(author)
            if len(self.corespondingAuthors) == 0:
                pubmedAffiliations = []
                humanAuthorsWithAffiliation =[]
                for author in self.humanAuthors:
                    x =  author.xml.find('AffiliationInfo/Affiliation')
                    if x is not None:#There is affiliation information
                        humanAuthorsWithAffiliation.append(author)
                        pubmedAffiliations.append(PubmedAffiliation(x))
                        
                if (len(humanAuthorsWithAffiliation) == 1):
                    self.corespondingAuthors.append(humanAuthorsWithAffiliation[0])
                    self.singleCommonAffi = pubmedAffiliations[0]
                    #Give all authors this information
                    for author in self.humanAuthors:
                        author.singleCommonAffi = self.singleCommonAffi
        
        #In the literature, co-first or co-Information about last(Information about equaltity)Determine if is included
        for author in self.humanAuthors:
            state = 0
            if 'EqualContrib' in author.xml.attrib:
                if author.xml.attrib['EqualContrib'] == 'Y':
                    state = 1
            else :
                for x in author.xml.findall('AffiliationInfo/Affiliation'):
                    if ' equal ' in x.text or 'Equal ' in x.text or ' equally ' in x.text or 'Equally ' in x.text:
                        state = 2
                        self.equalityStatements.append(x.text)
                        break
            self.authorStates.append(state)

#1 Returns information about co-authorship.
    def coauthorshipInfo(self,):
        if all(map(lambda x: x == 1,self.authorStates)):#All 1
            return "All authors are equal contributors."
        if any(map(lambda x: x == 2,self.authorStates)):#At least one is 2
            return "Specific descriptions on co-authorship."
        if self.authorStates[0] == 1 and self.authorStates[-1] == 1:#1 at the beginning and 1 at the end
            return "co-first and co-last authorships are described."
        if self.authorStates[0] == 1:#First is 1
            count = 0
            for x in self.authorStates:
                if x == 1:
                    count += 1
                else:
                    break
            return "co-first authorship is described. " + str(count) + " co-first authors"
        if self.authorStates[-1] == 1:#The last is 1
            count = 0
            for x in reversed(self.authorStates):
                if x == 1:
                    count += 1
                else:
                    break
            return "co-last authorship is described." + str(count) + " co-last authors"
        return None

#2 review:bool value
    def isReview(self,):
        for x in self.xml.findall('MedlineCitation/Article/PublicationTypeList/PublicationType'):
            if (x.text == 'Review'):
                return True
        return False

#3 Whether it is a corrected article:bool value
    def isErratum(self,):
        for x in self.xml.findall('MedlineCitation/Article/PublicationTypeList/PublicationType'):
            if (x.text == 'Published Erratum'):
                return True
        return False

#4 Publishing type
    def PublicationType(self,):
        for x in self.xml.findall('MedlineCitation/Article/PublicationTypeList/PublicationType'):
            if x.text is not None:
                return x.text
        return "-"

#5 Document identifier(doi): str
    def doi(self,):
        for x in self.xml.findall('MedlineCitation/Article/ELocationID'):
            if(x.get('EIdType') == 'doi'):
                return x.text
        return "-"

#6 pubmed id(pmid): str
    def pmid(self,):
        element = self.xml.find('MedlineCitation/PMID')
        if element is not None:
            return element.text
        else:
            return "-"

#7 titles: str
    def title(self,):
        element = self.xml.find('MedlineCitation/Article/ArticleTitle')
        if element is not None:
            return element.text
        else:
            return "-"

#8 Journal name: str
    def journal(self,):
        element = self.xml.find('MedlineCitation/Article/Journal/Title')
        if element is not None:
            return element.text
        else:
            return "-"

#9 Year of publication: str
#reference: <MedlineDate>To"2019 Mar - Apr"There is a case where it is written.
#reference: <MedlineDate>To"2012-2013"There is a case where it is written.
    def year(self,flag="all"):
        element = self.xml.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Year')
        if element is not None:
            return element.text
        else:
            element = self.xml.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/MedlineDate')
            if element is not None:
                if flag == "all":#Returns the entire string by default
                    return element.text
                else:#Otherwise, return the first 4 digit year
                    m = re.search('(\d{4})',element.text)
                    if m is not None:
                        return m.group(0)
                    else:
                        return "0"
            return "0"

#10 Publication month: str
    def month(self,):
        element = self.xml.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Month')
        if element is not None:
            return element.text
        else:
            element = self.xml.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/MedlineDate')
            if element is not None:
                return element.text.split(' ')[1]
            return "-"

#11 Description language
    def language(self,):
        element = self.xml.find('MedlineCitation/Article/Language')
        if element is not None:
            return element.text
        else:
            return "-"



#################################################################################
##########Author's name(Tuple)Contact us at.
#################################################################################
#Find out what number the author is(0 if not the author):int。
#12 query is a tuple of foreName and LastName
    def positionInAuthors(self,query):#If it is the 1st author, the return value is 1(Not 0).. query is a tuple(ForeName, LastName) 
        for x in range( len(self.humanAuthors) ):
            if self.humanAuthors[x].foreName() == query[0] and self.humanAuthors[x].lastName() == query[1]:
                return x + 1
            if self.humanAuthors[x].initials() == query[0] and self.humanAuthors[x].lastName() == query[1]:
                return x + 1
        return 0            

#13 Authors are included or returned in the specified author list: bool
#The designated author list is, for example, a list of responding authors.
    def isAuthorIn(self,query,authors):#name_Returns whether the surname is included in the specified Authors. query is a tuple
        for author in authors:
            if ( author.foreName() == query[0] and author.lastName() == query[1]):
                return True
            if ( author.initials() == query[0] and author.lastName() == query[1]):
                return True
        return False

#14 Check if the author specified in the tuple is the author: bool
    def isAuthor(self,query):
        for author in self.humanAuthors:
            if author.foreName == query[0] and author.lastName == query[1]:
                return True
            if author.initials == query[0] and author.lastName == query[1]:
                return True
        return False

#15 Find out if the responding author is known: bool
    def isCorrespondingAuthorDefined(self,):
        if len(self.corespondingAuthors) == 0:
            return False
        else:
            return True

#16 Find out if the author specified in the tuple is a responding author: bool
    def isCorrespondingAuthor(self,query):
        for author in self.corespondingAuthors:
            if ( author.foreName() == query[0] and author.lastName() == query[1]):
                return True
            if ( author.initials() == query[0] and author.lastName() == query[1]):
                return True
        return False

Actually use

Let's read the data. pubmed_result.xml is an xml format data file downloaded from the pubmed page. The data file contains multiple Pubmed records, which we read in their entirety and store the element tree in the variable root.

test_data = open("/Users/yoho/Downloads/pubmed_result.xml", "r")
contents = test_data.read()
root = ET.fromstring(contents)

How to access basic information:

for pubmedArticleElement in root.findall('PubmedArticle'):
    p = PubmedArticle(pubmedArticleElement)#Make one record a PubmedArticle object
    
    print(
        p.pmid(),# pubmed id
        p.doi(),# doi (Document identifier)
        p.year(flag=1),#Year of publication. Year information only. Flag for all= "all"
        p.month(),#Publication month
        p.title(),#Paper title
        p.language(),#language
        p.PublicationType(),#Publishing type
        sep = "\t",end="\n")

How to access other than basic information:

for pubmedArticleElement in root.findall('PubmedArticle'):
    p = PubmedArticle(pubmedArticleElement)#Make one record a PubmedArticle object
    
    #Number of human Authors
    print (str(p.numberOfAuthors()))

    #Access to author name
    for x in p.humanAuthors:
        print(
            x.foreName(), # First Name
            x.lastName(), # Last Name
            sep="\t",end="\t")
    print("")

    #Find out if the responding author has been identified
    if len(p.corespondingAuthors) != 0:
        print("Corresponding author can be found from pubmed information",end="\t")
    else :
        print("It is not known from pubmed information who the responding author is",end="\t")

    #Access to Corresponding Author
    if len(p.corespondingAuthors) == 0:
        print("Who is the responding author is unknown from pubmed information",end="\t")
    else:
        print("Number of responding authors:"+str(len(p.corespondingAuthors)),end="\t")
        for x in p.corespondingAuthors:
           print(
            x.foreName(), # First Name
            x.lastName(), # Last Name
            sep=" ",end="\t")
    
    #Find out if you are a responding author by specifying the First Name and Last Name in the tuple.
    author = ("Taro","Tohoku")

    if p.isAuthorIn(author,p.corespondingAuthors):
        print(author[0] + " " + author[1] + "Is the responding author for this paper.",end="\t")
    else :
        print(author[0] + " " + author[1] + "Is not the responding author of this paper.",end="\t")

    #Find out if you are the author by specifying the First Name and Last Name in the tuple
    if p.isAuthor(author):
        print(author[0] + " " + author[1] + "Is the author of this paper.",end="\t")
    else:
        print(author[0] + " " + author[1] + "Is not the author of this paper.",end="\t")
       
    #Find out what number the author is by specifying the First Name and Last Name in the tuple.
    position = p.positionInAuthors(author)
    if position != 0:
        print(str(position) + "Second author",end="\t")
    else: 
        print(author[0] + " " + author[1] + "Is not the author",end="\t")

Description pattern for co-authorship

Here, I analyzed all pubmed data that includes AIDS in the title. You can find equality in the list of ints "authorStates". The number of records is 63,686 (file size 500 MB).


for pubmedArticleElement in root.findall('PubmedArticle'):
    p = PubmedArticle(pubmedArticleElement)

    if any(p.authorStates):
        print(p.pmid(),end="\t")
        print("".join(map(str,p.authorStates)),end="\n")
        if p.authorStates[0] == 2:#When it is 2, co-There is some description about authorship.
            pass #Omitted print(" ".join(p.equalityStatements),end="\t")
#output
# pumed id      co-Description state of authorship(1 author 1 digit)
# 32209633	000000011
# 30914431	110000000000
# 30912750	100
# 30828360	11000000000000
# 30421884	1100
# 30467102	10000
# 30356992	1100000000
# 29563205	1100000011
# 29728344	111111111
# 29588307	110000000000
# 29254269	110000000000
# 27733330	10
# 26990633	200000000
# 26949197	111000000000000
# 26595543	200000000000
# 26825036	20000000000000
# 26691548	20000
# 26397046	01110000
# 26535055	110
# 26544576	2000000000000
# 26173930	110000000011
# 26166108	20000000000
# 26125144	20000
# 25949269	1111111
# 24906111	20000000
# 24401642	200
# 22350831	110000000000000
# 22192455	11000
# 22098625	1110
# 21129197	11
# 20540714	11

It seems that there are various cases such as cases where 1 is given to everyone, cases where only the beginning is 2, cases where the first two and the last two have 1 and so on. Since 1 is written according to the rules, I would like to adopt it, but there are cases where only the beginning is 1 (who and equal?), The beginning is 0, and the second and third people are 1 (insufficient data? ) And so on.

There are various types of descriptions in the case where 2 is assigned, and you cannot understand what it means unless you read the description one by one. Therefore, I decided to "determine what kind of paper it is by referring to this list of ints as needed", and to refer to the information in 2 in text.

Find out about an entire research institute

There are times when you want to find out who is writing and how much of a research institution as a whole. This is the case when you search by specifying a research institution with pubmed and analyze the obtained xml. First, create a dictionary with "Tuple of First Name and Last Name" as the key and "List containing Pubmed Article objects" as the value.

#Creating a dictionary
authorAndArticle = {}#dictionary
for pubmedArticleElement in root.findall('PubmedArticle'):
    p = PubmedArticle(pubmedArticleElement)

    for author in p.humanAuthors:
        if author.isAffiliatedTo(['Graduate School','Sciences']):
            authorFullName = (author.foreName(),author.lastName()) #Use tuple as key
            if authorFullName in authorAndArticle:#If the dictionary already has a key
                authorAndArticle[authorFullName].append(p)
            else:#If the key is not already in the dictionary
                authorAndArticle[authorFullName] = [p]

Data is output for each person.

for authorFN in authorAndArticle:
    pubmedArticles = authorAndArticle[authorFN]
    print(authorFN[0] + " " + authorFN[1])
    for pma in pubmedArticles:
        
        print('            ',end= '')

        #Journal information
        print(pma.pmid(), pma.year(), pma.journal(), sep="\t",end='\t')

        # co-Judgment condition of authorship
        print(pma.coauthorshipInfo(),end='\t')
        
        # co-authorship status information. At the beginning so that it is treated as a character string on Excel'Add
        print("'" + "".join(map(str,pma.authorStates)),end="\t")

        #What number of author
        print(str(pma.positionInAuthors(authorFN)),end='\t')
        
        #Number of authors
        print(str(len(pma.humanAuthors)),end='\t')
        
        #Find out if it's the first author
        if pma.positionInAuthors(authorFN) == 1:
            print("First Author",end='\t')
        else:
            print("____",end='\t')

        #Find out if it's a responding author
        if len(pma.corespondingAuthors) == 0:
            print("Corresponding author unknown",end="\t")
        elif pma.isAuthorIn(authorFN,pma.corespondingAuthors):
            if len(pma.corespondingAuthors) == 1:
                print("Coresponding author",end="\t")
            else:
                print("Coresponding author of total " + str(len(pma.corespondingAuthors)) + " coresponding authors",end='\t')
        else:
            print("",end="\t")
        
        #Find out if it is the last author.
        if pma.positionInAuthors(authorFN) == len(pma.humanAuthors):
            print("Last author",end='\t')
        else:
            print("",end='\t')

        print("")

Now, it is possible to convert the data of each paper to the data of each person who wrote what kind of paper. In old papers, if the author's First Name is not included in the data, or if the name changes due to marriage etc., it will be treated as a different person. Also, if they have the same surname and the same name, they cannot be distinguished. In new papers, ORCID can be used to distinguish between people, but unless ORCID is retroactively assigned to authors, it seems very difficult to find out the author's identity uniformly.

At the end

I tried to make various things, but it was difficult because there were various ways to write pubmed data and it could not be processed uniformly. As for equality, we've only created a list of whether or not there is a description for each author, so the rest needs to be handled on the user side.

Process Pubmed .xml data with python [Part 2]