Preface

It is a content that I am playing with collecting information from arxiv and trying to get useful information. The procedure of what I did this time is as follows.

Collect paper titles, abstracts, authors, etc. submitted within a certain period of time --Extract papers containing specific keywords --Create a directed graph of extracted papers and author names --Extract only authors with a large number of posts with an appropriate threshold --Convert to a co-author graph with the number of papers as edge weights --Split into subgraphs depending on whether they are concatenated

I wonder if I can see a cluster such as a large research group in the research field related to appropriate keywords. Finally, extract the co-author network (conditioned by the number of papers) as shown in the graph below.

This is a co-author network when only authors with 15 or more relevant papers are selected from arxiv's quant-ph papers from 2015 to 2020 that include "quantum comput" in the title / abstract. (The density of the edges is the number of co-authored papers and standardized for each cluster)

** (Addition: Updated the notation fluctuation of the author name and corrected the data) **

Collecting information from arxiv

Scraping normally in Python. (There was also an arxiv API) Hit arxiv's Advanced search directly to pick it up. The information collected every year is converted into a pandas DataFrame and saved as a csv file.

The information we collect

--Cite: arxiv: XXXXX. --Title: Paper title --Abst: Abst --Authors: Saves as a list of authors, a tuple of author names in linked text and strings used in search queries. --Fields: Field information. If it is a Cross-list, multiple files will be saved. --DOI: Acquires the one to which DOI is added. --OrigDateY: First posting year --OrigDateM: First post month --Date Info: Saves date information such as other revisions as text.

After all, there are some information that I haven't used this time.

** * The code has been modified because it does not work due to the specifications of arxiv if the number of search targets for one year exceeds 10,000. If you check the number of cases and it exceeds, search monthly and collect data. ** **

from bs4 import BeautifulSoup
import requests
import re
import pandas as pd
import numpy as np
import math 

def month_string_to_number(string):
    m = {
        'jan': 1,
        'feb': 2,
        'mar': 3,
        'apr':4,
         'may':5,
         'jun':6,
         'jul':7,
         'aug':8,
         'sep':9,
         'oct':10,
         'nov':11,
         'dec':12
        }
    s = string.strip()[:3].lower()

    try:
        out = m[s]
        return out
    except:
        raise ValueError('Not a month')

#Get the number of search results from the Search result page
def get_number_of_searchresult(url):
    html_doc = requests.get(url).text
    soup = BeautifulSoup(html_doc, "html.parser") 
    tags = soup.find_all("h1",{"class":"title is-clearfix"})
    text = tags[0].text.strip()
    if "Showing" in text and "results" in text:
        stext = text.split(" ")
        datanum = int(stext[3].replace(',', ''))#Get the number of search results
    else:
        datanum=math.nan
    return datanum

#Get information from Search results and np.ndarray
def collect_info_from_advancedsearch(urlhead, datanum,key):

    titles=[]#List for data storage
    absts=[]#Abst
    cites=[]#cite information(arxiv:XXXXXX)
    authors=[]# author
    dates=[]#Date information
    dates_orig_m =[]#First posting month
    dates_orig_y =[]#First posting year
    dois = []#doi
    fields=[]#cross-Field information including list

    startnum=0
    while datanum > startnum:
        print(str(startnum)+"...", end="")
        url = urlhead+str(startnum)#advanced search URL
        html_doc = requests.get(url).text
        soup = BeautifulSoup(html_doc, "html.parser") 

        #title information
        tags1= soup.find_all("p", {"class": "title is-5 mathjax"}) 
        titles_tmp = [tag.text.strip() for tag in tags1]

        #abst information
        tags2 = soup.find_all("span", {"class": "abstract-full has-text-grey-dark mathjax"})
        absts_tmp = [tag.text[:-7].strip() for tag in tags2 if "Less" in tag.text]

        #cite information
        tags3 =soup.find_all("p", {"class": "list-title is-inline-block"})
        cites_tmp = [tag.select("a")[0].text for tag in tags3]

        #Date information
        tags4 = soup.find_all("p",{"class":"is-size-7"})
        text = [tag.text.strip() for tag in tags4 if "originally announced" in tag.text ]
        dates_tmp = text
        dates_orig_y_tmp=[txt.split(" ")[-1][:-1] for txt in text]
        dates_orig_m_tmp=[month_string_to_number(txt.split(" ")[-2]) for txt in text]

        #DOI information
        tags5 = soup.find_all("div",{"class":"is-marginless"})
        dois_tmp = [tag.text[tag.text.rfind("doi"):].split("\n")[1] for tag in tags5 if key in tag.text ]   

        #Author information
        tags6= soup.find_all("p", {"class": "authors"}) 
        auths_tmp = []
        for tag in tags6:
            auths=tag.select("a")
            authlist=[(author.text,author.get("href")[33:]) for author in auths]
            auths_tmp.append(authlist)

        # Cross-list information
        tags7= soup.find_all("div", {"class": "tags is-inline-block"})  # title#("span", {"class": "tag is-small is-link tooltip is-tooltip-top"})  # title
        fields_tmp=[tag.text.strip().split("\n") for tag in tags7]

        #Add to results
        titles.extend(titles_tmp)
        absts.extend(absts_tmp)
        cites.extend(cites_tmp)
        authors.extend(auths_tmp)
        dates.extend(dates_tmp)
        dates_orig_y.extend(dates_orig_y_tmp)
        dates_orig_m.extend(dates_orig_m_tmp)
        dois.extend(dois_tmp)
        fields.extend(fields_tmp)
        
        #Update start number on the next page of search results
        startnum += sizenum
        
    nt = np.array(titles)
    na = np.array(absts)
    nauth = np.array(authors)
    ncite = np.array(cites)
    nd = np.array(dates)
    ndy = np.array(dates_orig_y)
    ndm = np.array(dates_orig_m)
    ndoi = np.array(dois)
    nfields=np.array(fields)
    npdataset = np.concatenate([[ncite],[nt],[na],[nauth],[nfields],[ndoi],[ndy],[ndm],[nd]],axis=0).T
    print(" collected data number : ", npdataset.shape[0])
    return npdataset

#Dictionary for specifying search target classification of search query
dict_class={'cs': '&classification-computer_science=y',
 'econ': '&classification-economics=y',
 'eess': '&classification-eess=y',
 'math': '&classification-mathematics=y',
 'q-bio': '&classification-q_biology=y',
 'q-fin': '&classification-q_finance=y',
 'stat': '&classification-statistics=y',
 'all': '&classification-physics=y&classification-physics_archives=all',
 'astro-ph': '&classification-physics=y&classification-physics_archives=astro-ph',
 'cond-mat': '&classification-physics=y&classification-physics_archives=cond-mat',
 'gr-qc': '&classification-physics=y&classification-physics_archives=gr-qc',
 'hep-ex': '&classification-physics=y&classification-physics_archives=hep-ex',
 'hep-lat': '&classification-physics=y&classification-physics_archives=hep-lat',
 'hep-ph': '&classification-physics=y&classification-physics_archives=hep-ph',
 'hep-th': '&classification-physics=y&classification-physics_archives=hep-th',
 'math-ph': '&classification-physics=y&classification-physics_archives=math-ph',
 'nlin': '&classification-physics=y&classification-physics_archives=nlin',
 'nucl-ex': '&classification-physics=y&classification-physics_archives=nucl-ex',
 'nucl-th': '&classification-physics=y&classification-physics_archives=nucl-th',
 'physics': '&classification-physics=y&classification-physics_archives=physics',
 'quant-ph': '&classification-physics=y&classification-physics_archives=quant-ph'}

years = [y for y in range(2015,2020)]
key = "quant-ph"#Search target field dict_Specify the key of class
output_fname="df_quant-ph" #Output file name

url0="https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=&terms-0-field=title"
url1="&classification-include_cross_list=include"
url_daterange="&date-year=&date-filter_by=date_range"
url2="&date-date_type=submitted_date&abstracts=show&size="
urlmid = "&order=-announced_date_first&start="

sizenum = 25
startnum=0
for year in years:
    m_divide = 1 #Number of divisions in the search period
    mstart =1
    mstop = 1
    url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year+1)+"-"+str(mstop).zfill(2)+"-01"
    urlhead = url0+dict_class[key]+url1+url_daterange+url_date+url2
    urlmid = "&order=-announced_date_first&start="
    url = urlhead+str(sizenum)+urlmid+str(startnum)
    
    datanum=get_number_of_searchresult(url) #Get the number of search results
    print("Number of search results ("+str(year)+") : "+str(datanum))
    
    if datanum >=10000: #If the number of cases exceeds the limit, the data acquisition for one year will be divided by month.
        m_divide = 13
        for month_divide in range(2,12): #Search for the number of divisions where the number of individual cases is less than or equal to the limit
            flag_numlimit = False
            for idx in range(month_divide):
                mstart = int(idx*12/month_divide+1)
                mstop = (int((idx+1)*12/month_divide)+1)%12
                if mstop !=1:
                    url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year)+"-"+str(mstop).zfill(2)+"-01"
                else:
                    url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year+1)+"-"+str(mstop).zfill(2)+"-01"
                    
                urlhead = url0+dict_class[key]+url1+url_daterange+url_date+url2
                url = urlhead+str(sizenum)+urlmid+str(startnum)
                datanum=get_number_of_searchresult(url)#Acquisition of the number of data items for each number of divisions
                if datanum >= 10000:
                    flag_numlimit = True
            if not flag_numlimit:
                m_divide = month_divide
                break
        if m_divide > 12:
            print("*** Number of search result is over the limit 10,000. Please refine your search. ***")
    
    sizenum=200
    npdataset = np.empty((0,9))
    for idx in range(m_divide):
        mstart = int(idx*12/m_divide+1)
        mstop = (int((idx+1)*12/m_divide)+1)%12
        if mstop !=1:
            url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year)+"-"+str(mstop).zfill(2)+"-01"
        else:
            url_date = "&date-from_date="+str(year)+"-"+str(mstart).zfill(2)+"-01&date-to_date="+str(year+1)+"-"+str(mstop).zfill(2)+"-01"
 
        urlhead = url0+dict_class[key]+url1+url_daterange+url_date+url2       
        url = urlhead+str(25)+urlmid+str(0)
        datanum=get_number_of_searchresult(url)
        
        print("Collect search results ..." + url_date + ", Number of search results : " + str(datanum))
        urlhead2 = urlhead+str(sizenum)+urlmid
        npdataset_tmp = collect_info_from_advancedsearch(urlhead2,datanum,key)
        npdataset = np.concatenate([npdataset, npdataset_tmp], axis=0)
        
    #Convert one year's worth of information to numpy, pandas DataFrame and save in csv
    dataset = pd.DataFrame(npdataset)
    dataset.columns =["Cite","Title","Abst","Authors","Fields","DOI", "OrigDateY","OrigDateM","Date Info"]
    dataset.to_csv(output_fname+str(year)+".csv")

Extract papers containing specific keywords

Read and concatenate the csv collected for each year. Extracts the concatenated datasets that contain keywords in the title and abstract. This time, the keyword is quantum comput.

fname_head = "df_quant-ph"

fname = fname_head + str(2020)+".csv"
dataset = pd.read_csv(fname, index_col=0)

for year in range(2010,2020):
    fname = fname_head + str(year)+".csv"
    print(fname)
    dataset_tmp = pd.read_csv(fname, index_col=0)
    dataset = pd.concat([dataset,dataset_tmp])

dataset =dataset.reset_index()
dataset_r=dataset.query('title.str.contains("quantum comput")  or abst.str.contains("quantum comput")', engine='python')

However, I thought that this form would not be able to search properly for case problems and other search words. We also added Stemming processing using nltk. It is based on here. I've written Lemmatization for the time being, but I'm commenting it out now.

import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords

def preprocess_text(text):
    text = re.sub('[^a-zA-Z]', ' ', text)#Remove punctuations
    text = text.lower()
    text=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",text)#remove tags
    text=re.sub("(\\d|\\W)+"," ",text)# remove special characters and digits
    
    text = text.split()
    
    stop_words = set(stopwords.words("english"))
    ##Stemming
    #Lemmatisation
    ps=PorterStemmer()
    #lem = WordNetLemmatizer()
    text = [ps.stem(word) for word in text if not word in stop_words]
    text=" ".join(text)
            
    return text

ndata = (dataset['Title']+" "+ dataset['Abst']).values
ndata= np.array([preprocess_text(text) for text in ndata])
dataset['Keywords'] = ndata
dataset_r=dataset.query('Keywords.str.contains("quantum comput")', engine='python')

Create a graph of extracted papers and author names

Use networkx to draw a directed graph (tops are author name and paper number) with the relationship between the paper and the author as an edge. I'm doing something strange because I put the author information in a format that is difficult to read.

Postscript: Regarding fluctuations such as part of the name being initialized or not, other than Family name? Takes data using the initialized query string. (For the notation, one mode is selected. If you want to change it, please change the dictionary auth_dict separately.) However, there may be author names that are over-combined because they are searched by the initialized character string. If you want to classify as it is, use the commented out authname.

To check the grouped author names

For example, if you execute the following code, you will get the output shown in the image below.

authtext = " ".join(auths)
match = re.findall(r"(\()(.*?)\)",authtext)
for key in auth_dict.keys():
    l=[m[1].split(m[1][0])[1] for m in match if key in m[1]]
    c = collections.Counter(l)
    print(c.most_common())

import networkx as nx
import collections

auths= dataset_r['Authors'].values
datanum = auths.shape[0]

#Creating an author name dictionary
auth_dict = {} 
for idx in range(datanum):
    match = re.findall(r"(\()(.*?)\)",auths[idx])
    for m in match:
        auth = m[1].split(m[1][-1])[-2]
        authname = m[1].split(m[1][0])[1]
        auth_dict[auth]=authname
#Change each item in the dictionary to the mode
authtext = " ".join(auths)
match = re.findall(r"(\()(.*?)\)",authtext)
for key in auth_dict.keys():
    l=[m[1].split(m[1][0])[1] for m in match if key in m[1]]
    c = collections.Counter(l)
    auth_dict[key]=c.most_common()[0][0]

#Add information to the graph
G=nx.DiGraph()
for idx in range(datanum):
    match = re.findall(r"(\()(.*?)\)",auths[idx])
    for m in match:
        auth = m[1].split(m[1][-1])[-2]
        #authname = m[1].split(m[1][0])[1]#If you want to create a graph with the notation as it is, add an edge with this authname
        G.add_edges_from([(idx,auth_dict[auth])])

For each author, exclude authors with a small number of papers from the graph. (Deleted 15 or less this time) As a result, the papers with a degree of 0 are excluded.

thr=15
authorlist = [n  for n in G.nodes() if type(n) is str]
for auth in authorlist:
    deg=G.degree(auth)
    if deg <=thr:
        G.remove_node(auth)

for idx in range(datanum):
    deg=G.degree(idx)
    if deg <=0:
        G.remove_node(idx)

Draw the resulting graph. Draw the author in blue and the dissertation in red. I'm trying to increase the size of important vertices using PageRank. This article is used as a reference.

def draw_graph(G, label = False):
    #Pagerank calculation
    pr = nx.pagerank(G)
    pos = nx.spring_layout(G)
    
    fig = plt.figure(figsize=(8.0, 6.0))
    c=[(0.4,0.4,1) if type(n) is str else (1,0.4,0.4)  for n in G.nodes()]

    nx.draw_networkx_edges(G,pos, edge_color=(0.3,0.3,0.3))
    nx.draw_networkx_nodes(G,pos, node_color=c, node_size=[5000*v for v in pr.values()])

The output result is as follows. Many papers are in a state where they refer to only one author.

Convert to a co-author graph with the number of papers as edge weights

For the time being, I was interested in the co-author network, so the information in the dissertation is weighted.

import itertools

def convert_weightedGraph(graph):
    graph_new =nx.Graph()
    for node in graph.nodes:
        if type(node) is str:
            continue
        n_new = [e[1] for e in graph.edges if e[0]==node]
        for e_new in itertools.permutations(n_new, 2):

            flag_dup = False
            for e_check in graph_new.edges(data=True):
                if e_new[0] in e_check and e_new[1] in e_check:
                    e_check[2]['weight']+=1
                    flag_dup = True
            if not flag_dup:
                graph_new.add_edge(e_new[0],e_new[1],weight=1)
    return graph_new

wG=convert_weightedGraph(G)

Draw the obtained graph with the following function. If the labels overlap, you can adjust them with the size of the figure or the argument k of spring_layout. (Reference)

def draw_weightedG(G):
    fig = plt.figure(figsize=(8.0, 6.0))
    pr = nx.pagerank(G)
    pos = nx.fruchterman_reingold_layout(G,k=k0/math.sqrt(G.order()))
    #pos = nx.spring_layout(G,k=15/math.sqrt(G.order())) 
    #Use the appropriate layout. Fine-tune the distance between nodes with the value of k
 
    
    x_values, y_values = zip(*pos.values())
    x_max = max(x_values)
    x_min = min(x_values)
    x_margin = (x_max - x_min) * 0.25
    plt.xlim(x_min - x_margin, x_max + x_margin) #Secure margin so that label characters are not cut off

    node_sizes=[5000*v for v in pr.values()]
    edge_colors = [e[2]['weight'] for e in G.edges(data=True)] #Colored with edge weights
    nodes = nx.draw_networkx_nodes(G, pos, node_size=node_sizes, node_color='#9999ff')
    edges = nx.draw_networkx_edges(G, pos, node_size=node_sizes, arrowstyle='->',
                                   arrowsize=10, edge_color=edge_colors,
                                   edge_cmap=plt.cm.Blues, width=2)
    nx.draw_networkx_labels(G,pos)

    ax = plt.gca()
    ax.set_axis_off()

From the drawing result of the obtained graph, you can see that there is an unconnected subgraph.

Divide into subgraphs depending on whether they are connected or not

Create a connected subgraph starting from each node of the graph. I didn't have any tools that I could use at a glance, so I wrote them solidly.

def add_edges_to_wsubgraph(subg, edge_new,node,edges_all):
    subg.add_edges_from([edge_new])
    
    if node == edge_new[1]:
        node_new = edge_new[0]
    else:
        node_new = edge_new[1]
      
    edges_new = [e for e in edges_all if node_new in e and e not in subg.edges(data=True)]
    for edge in edges_new:
        if edge not in subg.edges(data=True):
            add_edges_to_wsubgraph(subg,edge,node_new,edges_all)

def separate_wG_by_connectivity(G):
    nodes_all =[n for n in G.nodes()]
    edges_all = [e for e in G.edges(data=True)]
    subgraphs = []

    for node in nodes_all:
        usedflag = any([node in g.nodes for g in subgraphs])
        if usedflag:
            continue 

        subg = nx.Graph()
        subg.add_node(node)
        edges_new = [e for e in edges_all if node in e]
        for edge in edges_new:
            if edge not in subg.edges(data=True):
                add_edges_to_wsubgraph(subg,edge,node,edges_all)
        subgraphs.append(subg)
    
    return subgraphs

subgraphs = separate_wG_by_connectivity(wG)

If you draw each of the subgraphs obtained here You will get the following three graphs. (Manually made into one image)

cnt=0
for subg in subgraphs:#Draw subgraph, save image
    draw_weightedG(subg)
    plt.savefig("subgraph_"+str(cnt)+".png ")
    plt.cla()
    cnt+=1

Summary

Starting with data acquisition from arxiv, I wrote a co-author network-like thing about a specific keyword. I brought in relatively clean results, but in some cases only large clusters remained, depending on search keywords and threshold settings. In such a case, it may be possible to get a little more detailed by cutting the number of co-authored papers, which are the weights of the edges, at an appropriate threshold. After that, I think it would be interesting to get citations and citations for papers, but I couldn't think of an easy source of information, so I'd be happy if you could tell me something.

Write a co-author network in a specific field using arxiv information