It is a content that I am playing with collecting information from arxiv and trying to get useful information. The procedure of what I did this time is as follows.
I wonder if I can see a cluster such as a large research group in the research field related to appropriate keywords. Finally, extract the co-author network (conditioned by the number of papers) as shown in the graph below.
This is a co-author network when only authors with 15 or more relevant papers are selected from arxiv's quant-ph papers from 2015 to 2020 that include "quantum comput" in the title / abstract. (The density of the edges is the number of co-authored papers and standardized for each cluster)
** (Addition: Updated the notation fluctuation of the author name and corrected the data) **
Scraping normally in Python. (There was also an arxiv API) Hit arxiv's Advanced search directly to pick it up. The information collected every year is converted into a pandas DataFrame and saved as a csv file.
The information we collect
--Cite: arxiv: XXXXX. --Title: Paper title --Abst: Abst --Authors: Saves as a list of authors, a tuple of author names in linked text and strings used in search queries. --Fields: Field information. If it is a Cross-list, multiple files will be saved. --DOI: Acquires the one to which DOI is added. --OrigDateY: First posting year --OrigDateM: First post month --Date Info: Saves date information such as other revisions as text.
After all, there are some information that I haven't used this time.
** * The code has been modified because it does not work due to the specifications of arxiv if the number of search targets for one year exceeds 10,000. If you check the number of cases and it exceeds, search monthly and collect data. ** **
from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import numpy as np
import math
def month_string_to_number(string):
m = {
'jan': 1,
'feb': 2,
'mar': 3,
'apr':4,
'may':5,
'jun':6,
'jul':7,
'aug':8,
'sep':9,
'oct':10,
'nov':11,
'dec':12
}
s = string.strip()[:3].lower()
try:
out = m[s]
return out
except:
raise ValueError('Not a month')
#Get the number of search results from the Search result page
def get_number_of_searchresult(url):
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, "html.parser")
tags = soup.find_all("h1",{"class":"title is-clearfix"})
text = tags[0].text.strip()
if "Showing" in text and "results" in text:
stext = text.split(" ")
datanum = int(stext[3].replace(',', ''))#Get the number of search results
else:
datanum=math.nan
return datanum
#Get information from Search results and np.ndarray
def collect_info_from_advancedsearch(urlhead, datanum,key):
titles=[]#List for data storage
absts=[]#Abst
cites=[]#cite information(arxiv:XXXXXX)
authors=[]# author
dates=[]#Date information
dates_orig_m =[]#First posting month
dates_orig_y =[]#First posting year
dois = []#doi
fields=[]#cross-Field information including list
startnum=0
while datanum > startnum:
print(str(startnum)+"...", end="")
url = urlhead+str(startnum)#advanced search URL
html_doc = requests.get(url).text
soup = BeautifulSoup(html_doc, "html.parser")
#title information
tags1= soup.find_all("p", {"class": "title is-5 mathjax"})
titles_tmp = [tag.text.strip() for tag in tags1]
#abst information
tags2 = soup.find_all("span", {"class": "abstract-full has-text-grey-dark mathjax"})
absts_tmp = [tag.text[:-7].strip() for tag in tags2 if "Less" in tag.text]
#cite information
tags3 =soup.find_all("p", {"class": "list-title is-inline-block"})
cites_tmp = [tag.select("a")[0].text for tag in tags3]
#Date information
tags4 = soup.find_all("p",{"class":"is-size-7"})
text = [tag.text.strip() for tag in tags4 if "originally announced" in tag.text ]
dates_tmp = text
dates_orig_y_tmp=[txt.split(" ")[-1][:-1] for txt in text]
dates_orig_m_tmp=[month_string_to_number(txt.split(" ")[-2]) for txt in text]
#DOI information
tags5 = soup.find_all("div",{"class":"is-marginless"})
dois_tmp = [tag.text[tag.text.rfind("doi"):].split("\n")[1] for tag in tags5 if key in tag.text ]
#Author information
tags6= soup.find_all("p", {"class": "authors"})
auths_tmp = []
for tag in tags6:
auths=tag.select("a")
authlist=[(author.text,author.get("href")[33:]) for author in auths]
auths_tmp.append(authlist)
# Cross-list information
tags7= soup.find_all("div", {"class": "tags is-inline-block"}) # title#("span", {"class": "tag is-small is-link tooltip is-tooltip-top"}) # title
fields_tmp=[tag.text.strip().split("\n") for tag in tags7]
#Add to results
titles.extend(titles_tmp)
absts.extend(absts_tmp)
cites.extend(cites_tmp)
authors.extend(auths_tmp)
dates.extend(dates_tmp)
dates_orig_y.extend(dates_orig_y_tmp)
dates_orig_m.extend(dates_orig_m_tmp)
dois.extend(dois_tmp)
fields.extend(fields_tmp)
#Update start number on the next page of search results
startnum += sizenum
nt = np.array(titles)
na = np.array(absts)
nauth = np.array(authors)
ncite = np.array(cites)
nd = np.array(dates)
ndy = np.array(dates_orig_y)
ndm = np.array(dates_orig_m)
ndoi = np.array(dois)
nfields=np.array(fields)
npdataset = np.concatenate([[ncite],[nt],[na],[nauth],[nfields],[ndoi],[ndy],[ndm],[nd]],axis=0).T
print(" collected data number : ", npdataset.shape[0])
return npdataset
#Dictionary for specifying search target classification of search query
dict_class={'cs': '&classification-computer_science=y',
'econ': '&classification-economics=y',
'eess': '&classification-eess=y',
'math': '&classification-mathematics=y',
'q-bio': '&classification-q_biology=y',
'q-fin': '&classification-q_finance=y',
'stat': '&classification-statistics=y',
'all': '&classification-physics=y&classification-physics_archives=all',
'astro-ph': '&classification-physics=y&classification-physics_archives=astro-ph',
'cond-mat': '&classification-physics=y&classification-physics_archives=cond-mat',
'gr-qc': '&classification-physics=y&classification-physics_archives=gr-qc',
'hep-ex': '&classification-physics=y&classification-physics_archives=hep-ex',
'hep-lat': '&classification-physics=y&classification-physics_archives=hep-lat',
'hep-ph': '&classification-physics=y&classification-physics_archives=hep-ph',
'hep-th': '&classification-physics=y&classification-physics_archives=hep-th',
'math-ph': '&classification-physics=y&classification-physics_archives=math-ph',
'nlin': '&classification-physics=y&classification-physics_archives=nlin',
'nucl-ex': '&classification-physics=y&classification-physics_archives=nucl-ex',
'nucl-th': '&classification-physics=y&classification-physics_archives=nucl-th',
'physics': '&classification-physics=y&classification-physics_archives=physics',
'quant-ph': '&classification-physics=y&classification-physics_archives=quant-ph'}
years = [y for y in range(2015,2020)]
key = "quant-ph"#Search target field dict_Specify the key of class
output_fname="df_quant-ph" #Output file name
url0="https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=&terms-0-field=title"
url1="&classification-include_cross_list=include"
url_daterange="&date-year=&date-filter_by=date_range"
url2="&date-date_type=submitted_date&abstracts=show&size="
urlmid = "&order=-announced_date_first&start="
sizenum = 25
startnum=0
for year in years:
m_divide = 1 #Number of divisions in the search period
mstart =1
mstop = 1
url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year+1)+"-"+str(mstop).zfill(2)+"-01"
urlhead = url0+dict_class[key]+url1+url_daterange+url_date+url2
urlmid = "&order=-announced_date_first&start="
url = urlhead+str(sizenum)+urlmid+str(startnum)
datanum=get_number_of_searchresult(url) #Get the number of search results
print("Number of search results ("+str(year)+") : "+str(datanum))
if datanum >=10000: #If the number of cases exceeds the limit, the data acquisition for one year will be divided by month.
m_divide = 13
for month_divide in range(2,12): #Search for the number of divisions where the number of individual cases is less than or equal to the limit
flag_numlimit = False
for idx in range(month_divide):
mstart = int(idx*12/month_divide+1)
mstop = (int((idx+1)*12/month_divide)+1)%12
if mstop !=1:
url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year)+"-"+str(mstop).zfill(2)+"-01"
else:
url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year+1)+"-"+str(mstop).zfill(2)+"-01"
urlhead = url0+dict_class[key]+url1+url_daterange+url_date+url2
url = urlhead+str(sizenum)+urlmid+str(startnum)
datanum=get_number_of_searchresult(url)#Acquisition of the number of data items for each number of divisions
if datanum >= 10000:
flag_numlimit = True
if not flag_numlimit:
m_divide = month_divide
break
if m_divide > 12:
print("*** Number of search result is over the limit 10,000. Please refine your search. ***")
sizenum=200
npdataset = np.empty((0,9))
for idx in range(m_divide):
mstart = int(idx*12/m_divide+1)
mstop = (int((idx+1)*12/m_divide)+1)%12
if mstop !=1:
url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year)+"-"+str(mstop).zfill(2)+"-01"
else:
url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year+1)+"-"+str(mstop).zfill(2)+"-01"
urlhead = url0+dict_class[key]+url1+url_daterange+url_date+url2
url = urlhead+str(25)+urlmid+str(0)
datanum=get_number_of_searchresult(url)
print("Collect search results ..." + url_date + ", Number of search results : " + str(datanum))
urlhead2 = urlhead+str(sizenum)+urlmid
npdataset_tmp = collect_info_from_advancedsearch(urlhead2,datanum,key)
npdataset = np.concatenate([npdataset, npdataset_tmp], axis=0)
#Convert one year's worth of information to numpy, pandas DataFrame and save in csv
dataset = pd.DataFrame(npdataset)
dataset.columns =["Cite","Title","Abst","Authors","Fields","DOI", "OrigDateY","OrigDateM","Date Info"]
dataset.to_csv(output_fname+str(year)+".csv")
Read and concatenate the csv collected for each year. Extracts the concatenated datasets that contain keywords in the title and abstract. This time, the keyword is quantum comput.
fname_head = "df_quant-ph"
fname = fname_head + str(2020)+".csv"
dataset = pd.read_csv(fname, index_col=0)
for year in range(2010,2020):
fname = fname_head + str(year)+".csv"
print(fname)
dataset_tmp = pd.read_csv(fname, index_col=0)
dataset = pd.concat([dataset,dataset_tmp])
dataset =dataset.reset_index()
dataset_r=dataset.query('title.str.contains("quantum comput") or abst.str.contains("quantum comput")', engine='python')
However, I thought that this form would not be able to search properly for case problems and other search words. We also added Stemming processing using nltk. It is based on here. I've written Lemmatization for the time being, but I'm commenting it out now.
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
def preprocess_text(text):
text = re.sub('[^a-zA-Z]', ' ', text)#Remove punctuations
text = text.lower()
text=re.sub("</?.*?>"," <> ",text)#remove tags
text=re.sub("(\\d|\\W)+"," ",text)# remove special characters and digits
text = text.split()
stop_words = set(stopwords.words("english"))
##Stemming
#Lemmatisation
ps=PorterStemmer()
#lem = WordNetLemmatizer()
text = [ps.stem(word) for word in text if not word in stop_words]
text=" ".join(text)
return text
ndata = (dataset['Title']+" "+ dataset['Abst']).values
ndata= np.array([preprocess_text(text) for text in ndata])
dataset['Keywords'] = ndata
dataset_r=dataset.query('Keywords.str.contains("quantum comput")', engine='python')
Use networkx to draw a directed graph (tops are author name and paper number) with the relationship between the paper and the author as an edge. I'm doing something strange because I put the author information in a format that is difficult to read.
Postscript: Regarding fluctuations such as part of the name being initialized or not, other than Family name? Takes data using the initialized query string. (For the notation, one mode is selected. If you want to change it, please change the dictionary auth_dict separately.) However, there may be author names that are over-combined because they are searched by the initialized character string. If you want to classify as it is, use the commented out authname.
authtext = " ".join(auths)
match = re.findall(r"(\()(.*?)\)",authtext)
for key in auth_dict.keys():
l=[m[1].split(m[1][0])[1] for m in match if key in m[1]]
c = collections.Counter(l)
print(c.most_common())
import networkx as nx
import collections
auths= dataset_r['Authors'].values
datanum = auths.shape[0]
#Creating an author name dictionary
auth_dict = {}
for idx in range(datanum):
match = re.findall(r"(\()(.*?)\)",auths[idx])
for m in match:
auth = m[1].split(m[1][-1])[-2]
authname = m[1].split(m[1][0])[1]
auth_dict[auth]=authname
#Change each item in the dictionary to the mode
authtext = " ".join(auths)
match = re.findall(r"(\()(.*?)\)",authtext)
for key in auth_dict.keys():
l=[m[1].split(m[1][0])[1] for m in match if key in m[1]]
c = collections.Counter(l)
auth_dict[key]=c.most_common()[0][0]
#Add information to the graph
G=nx.DiGraph()
for idx in range(datanum):
match = re.findall(r"(\()(.*?)\)",auths[idx])
for m in match:
auth = m[1].split(m[1][-1])[-2]
#authname = m[1].split(m[1][0])[1]#If you want to create a graph with the notation as it is, add an edge with this authname
G.add_edges_from([(idx,auth_dict[auth])])
For each author, exclude authors with a small number of papers from the graph. (Deleted 15 or less this time) As a result, the papers with a degree of 0 are excluded.
thr=15
authorlist = [n for n in G.nodes() if type(n) is str]
for auth in authorlist:
deg=G.degree(auth)
if deg <=thr:
G.remove_node(auth)
for idx in range(datanum):
deg=G.degree(idx)
if deg <=0:
G.remove_node(idx)
Draw the resulting graph. Draw the author in blue and the dissertation in red. I'm trying to increase the size of important vertices using PageRank. This article is used as a reference.
def draw_graph(G, label = False):
#Pagerank calculation
pr = nx.pagerank(G)
pos = nx.spring_layout(G)
fig = plt.figure(figsize=(8.0, 6.0))
c=[(0.4,0.4,1) if type(n) is str else (1,0.4,0.4) for n in G.nodes()]
nx.draw_networkx_edges(G,pos, edge_color=(0.3,0.3,0.3))
nx.draw_networkx_nodes(G,pos, node_color=c, node_size=[5000*v for v in pr.values()])
The output result is as follows. Many papers are in a state where they refer to only one author.
For the time being, I was interested in the co-author network, so the information in the dissertation is weighted.
import itertools
def convert_weightedGraph(graph):
graph_new =nx.Graph()
for node in graph.nodes:
if type(node) is str:
continue
n_new = [e[1] for e in graph.edges if e[0]==node]
for e_new in itertools.permutations(n_new, 2):
flag_dup = False
for e_check in graph_new.edges(data=True):
if e_new[0] in e_check and e_new[1] in e_check:
e_check[2]['weight']+=1
flag_dup = True
if not flag_dup:
graph_new.add_edge(e_new[0],e_new[1],weight=1)
return graph_new
wG=convert_weightedGraph(G)
Draw the obtained graph with the following function. If the labels overlap, you can adjust them with the size of the figure or the argument k of spring_layout. (Reference)
def draw_weightedG(G):
fig = plt.figure(figsize=(8.0, 6.0))
pr = nx.pagerank(G)
pos = nx.fruchterman_reingold_layout(G,k=k0/math.sqrt(G.order()))
#pos = nx.spring_layout(G,k=15/math.sqrt(G.order()))
#Use the appropriate layout. Fine-tune the distance between nodes with the value of k
x_values, y_values = zip(*pos.values())
x_max = max(x_values)
x_min = min(x_values)
x_margin = (x_max - x_min) * 0.25
plt.xlim(x_min - x_margin, x_max + x_margin) #Secure margin so that label characters are not cut off
node_sizes=[5000*v for v in pr.values()]
edge_colors = [e[2]['weight'] for e in G.edges(data=True)] #Colored with edge weights
nodes = nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color='#9999ff')
edges = nx.draw_networkx_edges(G, pos, node_size=node_sizes, arrowstyle='->',
arrowsize=10, edge_color=edge_colors,
edge_cmap=plt.cm.Blues, width=2)
nx.draw_networkx_labels(G,pos)
ax = plt.gca()
ax.set_axis_off()
From the drawing result of the obtained graph, you can see that there is an unconnected subgraph.
Create a connected subgraph starting from each node of the graph. I didn't have any tools that I could use at a glance, so I wrote them solidly.
def add_edges_to_wsubgraph(subg, edge_new,node,edges_all):
subg.add_edges_from([edge_new])
if node == edge_new[1]:
node_new = edge_new[0]
else:
node_new = edge_new[1]
edges_new = [e for e in edges_all if node_new in e and e not in subg.edges(data=True)]
for edge in edges_new:
if edge not in subg.edges(data=True):
add_edges_to_wsubgraph(subg,edge,node_new,edges_all)
def separate_wG_by_connectivity(G):
nodes_all =[n for n in G.nodes()]
edges_all = [e for e in G.edges(data=True)]
subgraphs = []
for node in nodes_all:
usedflag = any([node in g.nodes for g in subgraphs])
if usedflag:
continue
subg = nx.Graph()
subg.add_node(node)
edges_new = [e for e in edges_all if node in e]
for edge in edges_new:
if edge not in subg.edges(data=True):
add_edges_to_wsubgraph(subg,edge,node,edges_all)
subgraphs.append(subg)
return subgraphs
subgraphs = separate_wG_by_connectivity(wG)
If you draw each of the subgraphs obtained here You will get the following three graphs. (Manually made into one image)
cnt=0
for subg in subgraphs:#Draw subgraph, save image
draw_weightedG(subg)
plt.savefig("subgraph_"+str(cnt)+".png ")
plt.cla()
cnt+=1
Starting with data acquisition from arxiv, I wrote a co-author network-like thing about a specific keyword. I brought in relatively clean results, but in some cases only large clusters remained, depending on search keywords and threshold settings. In such a case, it may be possible to get a little more detailed by cutting the number of co-authored papers, which are the weights of the edges, at an appropriate threshold. After that, I think it would be interesting to get citations and citations for papers, but I couldn't think of an easy source of information, so I'd be happy if you could tell me something.
Recommended Posts