This article is a memo for myself. However, we would appreciate it if you could give us your opinions / advice for improvement.
In the previous article, I understood how to use the library to process xml format data. In this article, we will create a wrapper class for processing Pubmed dissertation data.
The wrapper class allows you to extract basic information such as pubmed id, doi, year of publication, and title for each article data. In addition to this
It seems that pubmed's dissertation data often does not include who the co-responding author of the dissertation is, but who is the co-responding author of the dissertation is important information, so handle it as carefully as possible. I'm going.
We will also be able to find information about co-first, co-last, and'equality'.
Who the responding author is is not explicitly stated in the pubmde data. In other words, you need to look closely at the data to determine who is the responding author. I decided to judge as follows.
There are two types of authors, one that represents a person and the other that represents a research group, but I would like to be able to find out whether a specific individual is the responding author of the paper. I will not think about anything other than "human author)".
Judgment flow: (If each item cannot be confirmed, judge by the item below)
In other words, 4 is a case where there are multiple authors, no one has email address information, and multiple authors (maybe all) have affiliation information. Basically, if you jump from the pubmed page to the linked paper page, it's clear who is the responding author, but I won't follow this here.
If you read Pubmed's xml description, it says to add Y to EqualContrib to indicate equal contribution. In other words <Author EqualContrib="Y"> An example is given.
When I looked it up, there seemed to be an example where only one author has Equal Contrib = "Y". However, in addition to these problems, there are quite a few examples where there is a description about equal contribution in the affiliation information.
<Author>
<AffiliationInfo>
<Affiliation>###It is written here in your own way###</Affiliation>
</AffiliationInfo>
</Author>
Example:Including Equal:
Equal authorship.
Joint last author with equal contributions.
Equal contribution.
Equal contribution as senior author.
A.P. and A.J. are the equal senior authors.
Contribute equal to this work.
Co-first authors with equal contribution.
Equal contributor.
These authors are joint senior authors with equal contribution.
Joint last author with equal contributions.
* Equal contributors.
Example:Including Equally
These authors contributed equally to this article.
Both authors contributed equally.
These authors contributed equally to this work.
Contributed equally.
These authors contributed equally and are co-first authors.
In some cases, the author, who has nothing to do with equal contribution, wrote about Equality. Example 32281472 In some cases, "Equal" is included in the affiliation name. Foundation for Research on Equal Opportunity Social and Health Inequalities Network Pakistan Center for Health Policy and Inequalities Research
The description is too wide to handle to read and process the content. So for each author
I decided to keep it as a list of ints (I checked 63,686 items with both descriptions and did not find one, so I decided not to have it. If there is, it will be 1 in processing), We will discuss this division later, along with what patterns actually exist.
Pubmed data has the following three tags
1.PubmedArticle、 2.Author、 3.Affiliation
For what is, create a class to handle it. The class names are ** PubmedArticle **, ** PubmedAuthor **, ** PubmedAffiliation **, and the corresponding ElementTree objects are stored in each.
These classes are wrapper classes that keep the Element object intact. Encapsulate it for easy inspection. The PubmedArticle object has a reference to the PubmedAuthor object, and the PubmedAuthor object has a reference to the PubmedAffiliation object. Method calls should follow this trend and not go backwards. Here, the description starts from the downstream class.
I made the above three classes and defined various methods as follows, but how many methods should I make in the first place?
In the first place, if the user is familiar with the data format of pubmed, the wrapper class is not necessary in the first place, but even if you do not know the data format of pubmed or the data pattern at all, you can fully process using these classes. I'm a little wondering if it's possible. This is because there are many places where data cannot be processed uniformly because the data is written differently for each pubmed record (eg, the year of publication is written, and the responding author is often unclear. , Equality information is given in various ways).
In that case, I think we will aim for a class that is convenient for users who have some knowledge of what pubmed data looks like.
It should be noted that an author may have multiple Pubmed Affiliations. The following numbers correspond to the numbers assigned to the methods in the code.
import xml.etree.ElementTree as ET
import re
class PubmedAffiliation():
E_MAIL_PATTERN = re.compile(r'[0-9a-z_./?-]+@([0-9a-z-]+\.)+[0-9a-z-]+')
#1 Initialization method
def __init__(self, xmlElement):
self.xml = xmlElement
#2 Does your affiliation include an email address?: bool
#reference:If your affiliation includes an email address, you can consider it a responding author, but be aware that older literature may not list your email address.
def hasAnyMailAddress(self,):
affiliation = self.xml.text
result = self.E_MAIL_PATTERN.search(affiliation)
if result is not None:
return True
return False
#3 Return affiliation information as text: str
def affiliation(self,):
if self.xml is not None:
return self.xml.text
return "-"
#4 List with specified affiliation(words)Does it contain all the words contained in: bool
def isAffiliatedTo(self,words):#True if all are included
for word in words:
if not word in self.affiliation():
return False
return True
The following numbers correspond to the numbers assigned to the methods in the code.
The variable singleCommonAffi, which sets None at initialization, is set as needed when initializing the PubmedArticle object (depending on the pubmed data, only one author may have affiliation information, in which case Decided to consider this affiliation information as a common affiliation for all authors).
class PubmedAuthor():
#1 Initialization method
def __init__(self, xmlElement):
self.xml = xmlElement
self.singleCommonAffi = None
#2 Is your e-mail address listed in your affiliation?: bool
def withAnyMailAddress(self,):#Is it a responding author?
for x in self.xml.findall('AffiliationInfo/Affiliation'):#Affiliation
pubmedAffiliation = PubmedAffiliation(x)
if pubmedAffiliation.hasAnyMailAddress():
return True
return False
#3 returns last name: str
def lastName(self,):
x = self.xml.find('LastName')
if x is not None:
return x.text
return "-"
#4 returns fore name: str
def foreName(self,):
x = self.xml.find('ForeName')
if x is not None:
return x.text
return "-"
#5 Return initials: str
def initials(self,):
x = self.xml.find('Initials')
if x is not None:
return x.text
return "-"
#6 Affiliation with this author(PubmedAffiliation object)List including all: list
def affiliations(self,):
x = []
for y in self.xml.findall('AffiliationInfo/Affiliation'):#Affiliation
x.append(PubmedAffiliation(y))
return x
#7 Does the affiliation information include all the words specified in list?: bool
def isAffiliatedTo(self,words):
for x in self.xml.findall('AffiliationInfo/Affiliation'):#Affiliation
pubmedAffiliation = PubmedAffiliation(x)
if pubmedAffiliation.isAffiliatedTo(words):
return True
#Without singleCommonAffi, don't look any further
if self.singleCommonAffi is None
return False
#Find out about singleCommonAffi. True if all specified words are present
for word in words:
if not word in self.singleCommonAffi:
return False
return True
Upon initialization, it receives an xmlElement object and examines the following items:
It has a large number of methods. The number corresponds to the number assigned to the method in the code.
class PubmedArticle():
#0 Initialization method
def __init__(self, xmlElement):
self.xml = xmlElement
self.humanAuthors = []
self.corespondingAuthors = []
self.collectiveNames = [] #In some cases, the group name is included as the author. non-human author
self.singleCommonAffi = None #
self.equalityStatements = [] #Description of equaltity
self.authorStates = []
#authorStates is for each humanAuthor
# 0:No description
# 1: EqualContrib =There is a description of Y.
# 2:There is a description related to quality in Affiliation.
#Let it be 0, 1, 2 for each author.
#Considering the author as a whole, there are several patterns
#pattern 1:All 1....Everyone co-first at co-last
#Pattern 2:Two or three from the front is one....co-1st
#Pattern 3:Two from the back are 1.....co-last
#Pattern 4:The first one is 2...There is something about quality. I don't know if I have to read it. This description is retained in qualityStatements.
#Pattern 5:Other
#Collect human authors.
for x in self.xml.findall('MedlineCitation/Article/AuthorList/Author'):
pubmedAuthor = PubmedAuthor(x)
if x.find('CollectiveName') is not None:#<Author>There are cases where the group name is written. Do not include it in the author, but manage it separately.
self.collectiveNames.append(pubmedAuthor)
else :
self.humanAuthors.append(pubmedAuthor)
#Collect Corresponding Authors.(Incidentally, if there is only one author with affiliation information, check that affiliation.)。
if len(self.humanAuthors) == 1:#When there is only one author. That person is a responding author.
self.corespondingAuthors.append(self.humanAuthors[0])
else:
for author in self.humanAuthors:
if author.withAnyMailAddress():#Corresponding author if email address is written in affiliate
self.corespondingAuthors.append(author)
if len(self.corespondingAuthors) == 0:
pubmedAffiliations = []
humanAuthorsWithAffiliation =[]
for author in self.humanAuthors:
x = author.xml.find('AffiliationInfo/Affiliation')
if x is not None:#There is affiliation information
humanAuthorsWithAffiliation.append(author)
pubmedAffiliations.append(PubmedAffiliation(x))
if (len(humanAuthorsWithAffiliation) == 1):
self.corespondingAuthors.append(humanAuthorsWithAffiliation[0])
self.singleCommonAffi = pubmedAffiliations[0]
#Give all authors this information
for author in self.humanAuthors:
author.singleCommonAffi = self.singleCommonAffi
#In the literature, co-first or co-Information about last(Information about equaltity)Determine if is included
for author in self.humanAuthors:
state = 0
if 'EqualContrib' in author.xml.attrib:
if author.xml.attrib['EqualContrib'] == 'Y':
state = 1
else :
for x in author.xml.findall('AffiliationInfo/Affiliation'):
if ' equal ' in x.text or 'Equal ' in x.text or ' equally ' in x.text or 'Equally ' in x.text:
state = 2
self.equalityStatements.append(x.text)
break
self.authorStates.append(state)
#1 Returns information about co-authorship.
def coauthorshipInfo(self,):
if all(map(lambda x: x == 1,self.authorStates)):#All 1
return "All authors are equal contributors."
if any(map(lambda x: x == 2,self.authorStates)):#At least one is 2
return "Specific descriptions on co-authorship."
if self.authorStates[0] == 1 and self.authorStates[-1] == 1:#1 at the beginning and 1 at the end
return "co-first and co-last authorships are described."
if self.authorStates[0] == 1:#First is 1
count = 0
for x in self.authorStates:
if x == 1:
count += 1
else:
break
return "co-first authorship is described. " + str(count) + " co-first authors"
if self.authorStates[-1] == 1:#The last is 1
count = 0
for x in reversed(self.authorStates):
if x == 1:
count += 1
else:
break
return "co-last authorship is described." + str(count) + " co-last authors"
return None
#2 review:bool value
def isReview(self,):
for x in self.xml.findall('MedlineCitation/Article/PublicationTypeList/PublicationType'):
if (x.text == 'Review'):
return True
return False
#3 Whether it is a corrected article:bool value
def isErratum(self,):
for x in self.xml.findall('MedlineCitation/Article/PublicationTypeList/PublicationType'):
if (x.text == 'Published Erratum'):
return True
return False
#4 Publishing type
def PublicationType(self,):
for x in self.xml.findall('MedlineCitation/Article/PublicationTypeList/PublicationType'):
if x.text is not None:
return x.text
return "-"
#5 Document identifier(doi): str
def doi(self,):
for x in self.xml.findall('MedlineCitation/Article/ELocationID'):
if(x.get('EIdType') == 'doi'):
return x.text
return "-"
#6 pubmed id(pmid): str
def pmid(self,):
element = self.xml.find('MedlineCitation/PMID')
if element is not None:
return element.text
else:
return "-"
#7 titles: str
def title(self,):
element = self.xml.find('MedlineCitation/Article/ArticleTitle')
if element is not None:
return element.text
else:
return "-"
#8 Journal name: str
def journal(self,):
element = self.xml.find('MedlineCitation/Article/Journal/Title')
if element is not None:
return element.text
else:
return "-"
#9 Year of publication: str
#reference: <MedlineDate>To"2019 Mar - Apr"There is a case where it is written.
#reference: <MedlineDate>To"2012-2013"There is a case where it is written.
def year(self,flag="all"):
element = self.xml.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Year')
if element is not None:
return element.text
else:
element = self.xml.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/MedlineDate')
if element is not None:
if flag == "all":#Returns the entire string by default
return element.text
else:#Otherwise, return the first 4 digit year
m = re.search('(\d{4})',element.text)
if m is not None:
return m.group(0)
else:
return "0"
return "0"
#10 Publication month: str
def month(self,):
element = self.xml.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/Month')
if element is not None:
return element.text
else:
element = self.xml.find('MedlineCitation/Article/Journal/JournalIssue/PubDate/MedlineDate')
if element is not None:
return element.text.split(' ')[1]
return "-"
#11 Description language
def language(self,):
element = self.xml.find('MedlineCitation/Article/Language')
if element is not None:
return element.text
else:
return "-"
#################################################################################
##########Author's name(Tuple)Contact us at.
#################################################################################
#Find out what number the author is(0 if not the author):int。
#12 query is a tuple of foreName and LastName
def positionInAuthors(self,query):#If it is the 1st author, the return value is 1(Not 0).. query is a tuple(ForeName, LastName)
for x in range( len(self.humanAuthors) ):
if self.humanAuthors[x].foreName() == query[0] and self.humanAuthors[x].lastName() == query[1]:
return x + 1
if self.humanAuthors[x].initials() == query[0] and self.humanAuthors[x].lastName() == query[1]:
return x + 1
return 0
#13 Authors are included or returned in the specified author list: bool
#The designated author list is, for example, a list of responding authors.
def isAuthorIn(self,query,authors):#name_Returns whether the surname is included in the specified Authors. query is a tuple
for author in authors:
if ( author.foreName() == query[0] and author.lastName() == query[1]):
return True
if ( author.initials() == query[0] and author.lastName() == query[1]):
return True
return False
#14 Check if the author specified in the tuple is the author: bool
def isAuthor(self,query):
for author in self.humanAuthors:
if author.foreName == query[0] and author.lastName == query[1]:
return True
if author.initials == query[0] and author.lastName == query[1]:
return True
return False
#15 Find out if the responding author is known: bool
def isCorrespondingAuthorDefined(self,):
if len(self.corespondingAuthors) == 0:
return False
else:
return True
#16 Find out if the author specified in the tuple is a responding author: bool
def isCorrespondingAuthor(self,query):
for author in self.corespondingAuthors:
if ( author.foreName() == query[0] and author.lastName() == query[1]):
return True
if ( author.initials() == query[0] and author.lastName() == query[1]):
return True
return False
Let's read the data. pubmed_result.xml is an xml format data file downloaded from the pubmed page. The data file contains multiple Pubmed records, which we read in their entirety and store the element tree in the variable root.
test_data = open("/Users/yoho/Downloads/pubmed_result.xml", "r")
contents = test_data.read()
root = ET.fromstring(contents)
How to access basic information:
for pubmedArticleElement in root.findall('PubmedArticle'):
p = PubmedArticle(pubmedArticleElement)#Make one record a PubmedArticle object
print(
p.pmid(),# pubmed id
p.doi(),# doi (Document identifier)
p.year(flag=1),#Year of publication. Year information only. Flag for all= "all"
p.month(),#Publication month
p.title(),#Paper title
p.language(),#language
p.PublicationType(),#Publishing type
sep = "\t",end="\n")
How to access other than basic information:
for pubmedArticleElement in root.findall('PubmedArticle'):
p = PubmedArticle(pubmedArticleElement)#Make one record a PubmedArticle object
#Number of human Authors
print (str(p.numberOfAuthors()))
#Access to author name
for x in p.humanAuthors:
print(
x.foreName(), # First Name
x.lastName(), # Last Name
sep="\t",end="\t")
print("")
#Find out if the responding author has been identified
if len(p.corespondingAuthors) != 0:
print("Corresponding author can be found from pubmed information",end="\t")
else :
print("It is not known from pubmed information who the responding author is",end="\t")
#Access to Corresponding Author
if len(p.corespondingAuthors) == 0:
print("Who is the responding author is unknown from pubmed information",end="\t")
else:
print("Number of responding authors:"+str(len(p.corespondingAuthors)),end="\t")
for x in p.corespondingAuthors:
print(
x.foreName(), # First Name
x.lastName(), # Last Name
sep=" ",end="\t")
#Find out if you are a responding author by specifying the First Name and Last Name in the tuple.
author = ("Taro","Tohoku")
if p.isAuthorIn(author,p.corespondingAuthors):
print(author[0] + " " + author[1] + "Is the responding author for this paper.",end="\t")
else :
print(author[0] + " " + author[1] + "Is not the responding author of this paper.",end="\t")
#Find out if you are the author by specifying the First Name and Last Name in the tuple
if p.isAuthor(author):
print(author[0] + " " + author[1] + "Is the author of this paper.",end="\t")
else:
print(author[0] + " " + author[1] + "Is not the author of this paper.",end="\t")
#Find out what number the author is by specifying the First Name and Last Name in the tuple.
position = p.positionInAuthors(author)
if position != 0:
print(str(position) + "Second author",end="\t")
else:
print(author[0] + " " + author[1] + "Is not the author",end="\t")
Here, I analyzed all pubmed data that includes AIDS in the title. You can find equality in the list of ints "authorStates". The number of records is 63,686 (file size 500 MB).
for pubmedArticleElement in root.findall('PubmedArticle'):
p = PubmedArticle(pubmedArticleElement)
if any(p.authorStates):
print(p.pmid(),end="\t")
print("".join(map(str,p.authorStates)),end="\n")
if p.authorStates[0] == 2:#When it is 2, co-There is some description about authorship.
pass #Omitted print(" ".join(p.equalityStatements),end="\t")
#output
# pumed id co-Description state of authorship(1 author 1 digit)
# 32209633 000000011
# 30914431 110000000000
# 30912750 100
# 30828360 11000000000000
# 30421884 1100
# 30467102 10000
# 30356992 1100000000
# 29563205 1100000011
# 29728344 111111111
# 29588307 110000000000
# 29254269 110000000000
# 27733330 10
# 26990633 200000000
# 26949197 111000000000000
# 26595543 200000000000
# 26825036 20000000000000
# 26691548 20000
# 26397046 01110000
# 26535055 110
# 26544576 2000000000000
# 26173930 110000000011
# 26166108 20000000000
# 26125144 20000
# 25949269 1111111
# 24906111 20000000
# 24401642 200
# 22350831 110000000000000
# 22192455 11000
# 22098625 1110
# 21129197 11
# 20540714 11
It seems that there are various cases such as cases where 1 is given to everyone, cases where only the beginning is 2, cases where the first two and the last two have 1 and so on. Since 1 is written according to the rules, I would like to adopt it, but there are cases where only the beginning is 1 (who and equal?), The beginning is 0, and the second and third people are 1 (insufficient data? ) And so on.
There are various types of descriptions in the case where 2 is assigned, and you cannot understand what it means unless you read the description one by one. Therefore, I decided to "determine what kind of paper it is by referring to this list of ints as needed", and to refer to the information in 2 in text.
There are times when you want to find out who is writing and how much of a research institution as a whole. This is the case when you search by specifying a research institution with pubmed and analyze the obtained xml. First, create a dictionary with "Tuple of First Name and Last Name" as the key and "List containing Pubmed Article objects" as the value.
#Creating a dictionary
authorAndArticle = {}#dictionary
for pubmedArticleElement in root.findall('PubmedArticle'):
p = PubmedArticle(pubmedArticleElement)
for author in p.humanAuthors:
if author.isAffiliatedTo(['Graduate School','Sciences']):
authorFullName = (author.foreName(),author.lastName()) #Use tuple as key
if authorFullName in authorAndArticle:#If the dictionary already has a key
authorAndArticle[authorFullName].append(p)
else:#If the key is not already in the dictionary
authorAndArticle[authorFullName] = [p]
Data is output for each person.
for authorFN in authorAndArticle:
pubmedArticles = authorAndArticle[authorFN]
print(authorFN[0] + " " + authorFN[1])
for pma in pubmedArticles:
print(' ',end= '')
#Journal information
print(pma.pmid(), pma.year(), pma.journal(), sep="\t",end='\t')
# co-Judgment condition of authorship
print(pma.coauthorshipInfo(),end='\t')
# co-authorship status information. At the beginning so that it is treated as a character string on Excel'Add
print("'" + "".join(map(str,pma.authorStates)),end="\t")
#What number of author
print(str(pma.positionInAuthors(authorFN)),end='\t')
#Number of authors
print(str(len(pma.humanAuthors)),end='\t')
#Find out if it's the first author
if pma.positionInAuthors(authorFN) == 1:
print("First Author",end='\t')
else:
print("____",end='\t')
#Find out if it's a responding author
if len(pma.corespondingAuthors) == 0:
print("Corresponding author unknown",end="\t")
elif pma.isAuthorIn(authorFN,pma.corespondingAuthors):
if len(pma.corespondingAuthors) == 1:
print("Coresponding author",end="\t")
else:
print("Coresponding author of total " + str(len(pma.corespondingAuthors)) + " coresponding authors",end='\t')
else:
print("",end="\t")
#Find out if it is the last author.
if pma.positionInAuthors(authorFN) == len(pma.humanAuthors):
print("Last author",end='\t')
else:
print("",end='\t')
print("")
Now, it is possible to convert the data of each paper to the data of each person who wrote what kind of paper. In old papers, if the author's First Name is not included in the data, or if the name changes due to marriage etc., it will be treated as a different person. Also, if they have the same surname and the same name, they cannot be distinguished. In new papers, ORCID can be used to distinguish between people, but unless ORCID is retroactively assigned to authors, it seems very difficult to find out the author's identity uniformly.
I tried to make various things, but it was difficult because there were various ways to write pubmed data and it could not be processed uniformly. As for equality, we've only created a list of whether or not there is a description for each author, so the rest needs to be handled on the user side.
Recommended Posts