I tried to visualize the age group and rate distribution of people participating in AtCoder (competitive programming) by scraping and statistical processing with Python.
First of all, the age groups participating in Atcoder are scraped and tabulated, so they are shown below. In addition, people who do not enter their age in their profile are not counted. As you can imagine, there are many young people, especially university students.
Not surprisingly, there seems to be a correlation between the number of contest entries and the rate. By the way, according to the specifications of AtCoder's rating system, if the number of participation is 10 or less, the rate may be significantly lower than the actual ability. Please check the following page for details. About AtCoder Contest Rating
I tried to visualize the rate of active users by the average value and standard deviation for each number of times I participated in the contest so far. The mean value is the blue dot, and the mean ± standard deviation is shown by the yellow band. Even after subtracting the above rating system specifications, there seems to be a positive correlation between the number of participations and the rate. In my imagination, if the number of participation is about 30 times, there is not much correlation between the number of participation and the rate in the area beyond that (it sticks to the upper limit), but it seems that it is actually. ..
As an example, here is a histogram of the number and rates of people who have participated in the contest five times so far.
Visualization by the average value is strongly influenced by outliers such as those who have experience in competitive programming (the ability is abnormally high from the beginning), so I decided to visualize by the median value. The average value is a blue dot, and the top 25% to bottom 25% are represented by a yellow band. The median seems to have a slightly lower overall score than the mean.
One question arose in advancing the visualization of age groups and rate distributions. You've probably heard of the programmer's 35-year-old retirement age theory, but is there a correlation between age and AtCoder rate? Therefore, I decided to actually visualize it. As mentioned above, due to the specifications of AtCoder's rating system, if the number of participation is 10 or less, the rate may be significantly lower than the actual ability, so the graph below shows that the number of participation is 10. We limited it to people who had more than one time, and visualized it with the median so that the influence of outliers would be less likely to occur. Looking at the results, it seems that there is almost no correlation between age and rating. There is little data for people in their 40s, and the results vary, so this is just for reference.
The source code is shown below.
code
from urllib import request
from bs4 import BeautifulSoup
#By changing the url here, you can limit the number of participation etc.
url = "https://atcoder.jp/ranking/?f.Country=&f.UserScreenName=&f.Affiliation=&f.BirthYearLowerBound=0&f.BirthYearUpperBound=9999&f.RatingLowerBound=0&f.RatingUpperBound=9999&f.HighestRatingLowerBound=0&f.HighestRatingUpperBound=9999&f.CompetitionsLowerBound=1&f.CompetitionsUpperBound=9999&f.WinsLowerBound=0&f.WinsUpperBound=9999&page="
html = request.urlopen(url+"0")
soup = BeautifulSoup(html, "html.parser") #Extract information from html file
ul = soup.find_all("ul") #Can be extracted by specifying the element name and attribute
a = []
page = 0
i = 0
for tag in ul:
i+=1
try:
string_ = tag.get("class")
if "pagination" in string_:
a = tag.find_all("a")
break
except:
pass
for tag in a:
try:
string_ = tag.get("href")
if "ranking" in string_:
page = max(page, int(tag.string))
except:
pass
organization = []
rank = []
name = []
for i in range(1,page+1): #page
html = request.urlopen(url+str(i))
soup = BeautifulSoup(html, "html.parser")
td = soup.find_all("span")
for tag in td:
try:
string_ = tag.get("class")[0]
except:
continue
try:
if string_ == "ranking-affiliation":
organization.append(str(tag.string))
except:
pass
pp = soup.find_all("a")
for tag in pp:
try:
string_ = tag.get("class")[0]
except:
continue
try:
if string_ == "username":
name.append(str(tag.string))
except:
pass
information = []
for i in range(1,page+1): #page
html = request.urlopen(url+str(i))
soup = BeautifulSoup(html, "html.parser")
tbody = soup.find_all("tbody")
for tr in tbody:
for td in tr:
temp = []
for tag in td:
try:
string_ = str(tag.string).strip()
if len(string_) > 0:
temp.append(string_)
except:
pass
if len(temp)>0:
information.append(temp[2:])
information = [[name[i],organization[i]] + (information[i]) for i in range(len(information))]
#%%
import matplotlib.pyplot as plt
year_upper = 2020
rank_dic = {i:[] for i in range(year_upper+1)}
generation = [0 for i in range(year_upper)]
for i in range(len(information)):
old = information[i][2]
try:
rank_dic[int(old)].append(int(information[i][3]))
generation[int(old)] += 1
except:
pass
for i in range(len(rank_dic)-1, -1, -1): #Deleted when there are no 10 people
if len(rank_dic[int(i)]) < 10:
del rank_dic[int(i)]
#%%
import numpy as np
from statistics import mean, median,variance,stdev
ave_rank = np.array([[i ,mean(rank_dic[i])] for i in list(rank_dic.keys())], dtype = "float32")
stdev_rank = np.array([[i ,stdev(rank_dic[i])] for i in list(rank_dic.keys())], dtype = "float32")
max_rank = np.array([[i ,max(rank_dic[i])] for i in list(rank_dic.keys())], dtype = "float32")
median_rank = np.array([[i ,median(rank_dic[i])] for i in list(rank_dic.keys())], dtype = "float32")
percent25 = np.array([[i,np.percentile(rank_dic[i], [25])] for i in list(rank_dic.keys())], dtype = "float32")
percent75 = np.array([[i,np.percentile(rank_dic[i], [75])] for i in list(rank_dic.keys())], dtype = "float32")
#Average rank by age
plt.fill_between(ave_rank[:,0], ave_rank[:,1]-stdev_rank[:,1], ave_rank[:,1]+stdev_rank[:,1],facecolor='y',alpha=0.5)
plt.scatter(ave_rank[:,0], ave_rank[:,1])
plt.xlim(1970,2010)
plt.ylim(-100,2000)
plt.tick_params(labelsize=15)
plt.grid()
plt.title("ave")
plt.show()
#Central rank by age
plt.fill_between(percent25[:,0], percent25[:,1], percent75[:,1],facecolor='y',alpha=0.5)
plt.scatter(median_rank[:,0], median_rank[:,1])
plt.xlim(1970,2010)
plt.ylim(-100,2000)
plt.tick_params(labelsize=15)
plt.grid()
plt.title("med")
plt.show()
#Distribution of participating age groups
plt.plot([1996,1996],[-200,5000],zorder=1,linestyle="dashed",color="red")
plt.plot([2001,2001],[-200,5000],zorder=1,linestyle="dashed",color="red")
plt.fill_between([1996,2001], [-200,-200],[5000,5000],facecolor='red',alpha=0.5)
plt.scatter(range(len(generation)), generation,s=80,c="white",zorder=2,edgecolors="black",linewidths=2)
plt.xlim(1960,2010)
plt.ylim(-100,4500)
plt.tick_params(labelsize=15)
plt.grid()
plt.title("population")
plt.show()
#%%
compe_count = [[] for i in range(201)]
for i in range(len(information)):
compe_count[int(information[i][5])].append(int(information[i][3]))
ave_rank_count = np.array([[i,mean(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
stdev_rank_count = np.array([[i,stdev(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
max_rank_count = np.array([[i,max(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
min_rank_count = np.array([[i,min(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
med_rank_count = np.array([[i,median(X)] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
percent25_count = np.array([[i,np.percentile(X, [25])] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
percent75_count = np.array([[i,np.percentile(X, [75])] if len(X)>5 else [i,None] for i,X in enumerate(compe_count)], dtype = "float32")[1:]
#Histogram confirmation
for i, X in enumerate(compe_count[1:20]):
plt.hist(X, bins=40)
plt.title(i)
plt.show()
#Participation count and average score
plt.fill_between(ave_rank_count[:,0],ave_rank_count[:,1]-stdev_rank_count[:,1],ave_rank_count[:,1]+stdev_rank_count[:,1],facecolor='y',alpha=0.5)
plt.scatter(ave_rank_count[:,0], ave_rank_count[:,1],zorder=2)
plt.tick_params(labelsize=15)
plt.grid()
plt.ylim(-100,2500)
#plt.title("ave_count")
plt.show()
#Participation count and central score
plt.fill_between(percent25_count[:,0], percent25_count[:,1], percent75_count[:,1],facecolor='y',alpha=0.5)
plt.scatter(med_rank_count[:,0], med_rank_count[:,1])
plt.tick_params(labelsize=15)
plt.ylim(-100,2500)
plt.grid()
#plt.title("med_count")
plt.show()
I have referred to the following article very much. I tried to get the rate distribution of AtCoder by web scraping of Python I examined the distribution of AtCoder ratings