It's been about half a year since I started posting to Qiita, mainly for articles related to statistics, machine learning, and data analysis. Let's look back on the articles so far while using the Qiita API. (Hereafter calculated from the data as of August 10, 2015)
We'll look at the data first, then the Python code that generated those contents, and how to use the Qiita API from Python.
The top 5 are 73%. Popular articles are biased ... I personally like "The meaning of division of fractions understood by pizza" at the bottom, but it is not stocked at all. : sweat_smile:
I write articles in the major categories of "machine learning," "statistics," "mathematics," "data analysis," and "others."
Math |
---|
[Mathematics] Let's visualize what are eigenvalues and eigenvectors |
The meaning of fractional division understood in pizza |
Let's look at each tag. Since I'm basically using Python, the top number of articles is Python. Looking at the stock / article ratio, "Deep Learning", "Deep Learning", and "Chainer" are overwhelmingly high. You can see the excitement of deep learning these days.
"Mathematics" and "machine learning" also seem to have a relatively high stock rate.
tag | Number of articles | Stock quantity | stock/Article ratio |
---|---|---|---|
Python | 30 | 2664 | 88.8 |
statistics | 22 | 1589 | 72.2 |
statistics | 17 | 1274 | 74.9 |
Machine learning | 9 | 1127 | 125.2 |
6 | 376 | 62.7 | |
Natural language processing | 6 | 379 | 63.2 |
Math | 6 | 1054 | 175.7 |
matplotlib | 5 | 63 | 12.6 |
MongoDB | 4 | 314 | 78.5 |
MachineLearning | 4 | 148 | 37.0 |
DeepLearning | 2 | 874 | 437.0 |
statistics | 2 | 35 | 17.5 |
scikit-learn | 2 | 55 | 27.5 |
Deep learning | 2 | 874 | 437.0 |
Scraping | 2 | 37 | 18.5 |
Chainer | 2 | 874 | 437.0 |
Database | 1 | 21 | 21.0 |
Data visualization | 1 | 45 | 45.0 |
Statistical test | 1 | 12 | 12.0 |
Way of thinking | 1 | 5 | 5.0 |
Pattern recognition | 1 | 50 | 50.0 |
Note | 1 | 5 | 5.0 |
R | 1 | 16 | 16.0 |
Data analysis | 1 | 40 | 40.0 |
Visualization | 1 | 20 | 20.0 |
math | 1 | 82 | 82.0 |
numpy | 1 | 8 | 8.0 |
Graph database | 1 | 21 | 21.0 |
BeautifulSoup | 1 | 17 | 17.0 |
Statistical modeling | 1 | 28 | 28.0 |
neo4j | 1 | 21 | 21.0 |
Introduction to Statistics | 1 | 11 | 11.0 |
Looking at the graph, it looks like this.
I imagined that the same person would stock a lot, but it seems that there are quite a lot of people at first glance. The table below shows the regulars who are well stocked. Thank you: relaxed:
Ranking | Stock quantity |
---|---|
1 | 22 |
2 | 18 |
3 | 13 |
4 | 10 |
5 | 10 |
6 | 10 |
7 | 9 |
8 | 9 |
9 | 9 |
10 | 9 |
11 | 8 |
12 | 8 |
13 | 8 |
14 | 8 |
15 | 8 |
16 | 8 |
17 | 7 |
18 | 7 |
19 | 7 |
20 | 7 |
It is a graph of the top 150 users with a large number of stocks. The number of unique users was 1771.
This is a histogram of the number of stocks. It is closer to 1 to 5 stocks than I imagined. Low repeat rate ...: weary: In the future, I will do my best to write articles that will be repeated!
The access token is Qiita
[Settings] → [Applications] → [Issue new token]
It can be issued at. Please set the acquired token in the following'
%matplotlib inline
import requests
import json, sys
from collections import defaultdict
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('ggplot')
key = '<Access token>'
auth_str = 'Bearer %s'%(key)
headers = {'Authorization': auth_str}
cnt = 0
data_list = []
users = defaultdict(int)
Define a get_stockers function to get the stock user and the number of stocks.
# -------------------Get the number of stocks for each article-----------------------#
def get_stockers(_id):
global headers
url = 'https://qiita.com/api/v2/items/{}/stockers'.format(_id)
cnt = 0
_sum = 0
while True:
cnt += 1
payload = {'page': cnt, 'per_page': 20}
res = requests.get(url, params=payload, headers=headers)
data = res.json()
for d in data:
users[d['id']] += 1
num = len(data)
if num == 0:
break
_sum += num
return _sum
In the loop below, get the set of articles you posted, get the stock user information associated with it, and keep it in the list.
# -------------------Article information acquisition-----------------------#
url = 'https://qiita.com/api/v2/authenticated_user/items'
while True:
cnt += 1
sys.stdout.write("{}, ".format(cnt))
payload = {'page': cnt, 'per_page': 20}
res = requests.get(url, params=payload, headers=headers)
data = res.json()
if len(data) == 0:
break
data_list.extend(data)
res = []
Extract necessary information from the acquired data and organize it. Also, private articles (limited shared posts) are excluded.
# -------------------Data formatting-----------------------#
for i, d in enumerate(data_list):
sys.stdout.write("{}, ".format(i))
#Excludes private articles
if d['private'] == True:
continue
article_info = {}
for k in ['id', 'title', 'private', 'created_at', 'tags', 'url']:
article_info[k] = d[k]
article_info['stock'] = get_stockers(d['id'])
res.append(article_info)
Below, the article set, the number of stocks, and the ratio are output in a form that can be pasted as a markdown table as it is.
sum_of_stocks = np.sum([r['stock'] for r in res]).astype(np.float32)
cum = 0
print "|Stock quantity|Percentage(%)|Accumulation(%)|title|"
print "|:----------:|:----------:|:----------:|:----------|"
for i in np.argsort([r['stock'] for r in res])[::-1]:
r = res[i]
ratio = r['stock']/sum_of_stocks*100
cum += ratio
print "|{0}|{1:.1f}|{2:.1f}|[{3}]({4})|".format(r['stock'],ratio,cum,r['title'].encode('utf-8'),r['url'])
Aggregate around tags.
#Tag aggregation
tag_cnt = defaultdict(int)
for r in res:
for t in r['tags']:
tag_cnt[t['name']] += 1
#Number of stocks by tag
tag_stock_cnt = defaultdict(int)
for t in tag_cnt.keys():
for r in res:
for _t in r['tags']:
if t == _t['name']:
tag_stock_cnt[t] += r['stock']
tag_stock_dict = {}
for t, cnt in tag_stock_cnt.items():
tag_stock_dict[t] = cnt
#Processed so that it can be placed in a DataFrame
tag_list = []
ind_list = []
for k, t in tag_cnt.items():
ind_list.append(k)
tag_list.append((t , tag_stock_dict[k]))
#Data frame generation
tag_list = np.array(tag_list)
df = pd.DataFrame(tag_list, index=ind_list, columns=['cnt', 'stocks'])
n = float(len(tag_cnt))
df['cnt_ratio'] = df['cnt']/n
df['stock_ratio'] = df['stocks']/sum_of_stocks
#Display of stock quantity and stock ratio by tag
df_tag = df.sort(columns='cnt', ascending=False)
print "|tag|Number of articles|Stock quantity|stock/Article ratio|"
print "|:----------:|:----------:|:----------:|:----------:|"
for d in df_tag.iterrows():
print "|[{0}](http://qiita.com/tags/{0})|{1}|{2}|{3:.1f}|".format(d[0].encode('utf-8'),int(d[1][0]),int(d[1][1]),d[1][1]/d[1][0])
#graph display
df[['cnt_ratio','stock_ratio']].sort(columns='cnt_ratio', ascending=False).plot(kind="bar", figsize=(17, 8), alpha=0.7,
title="The ratio of article and stocks for each tag.")
Next, the function is aggregated and displayed to the user.
#User aggregation
id_list = []
cnt_list = []
for _id, cnt in users.items():
id_list.append((_id, cnt))
df = pd.DataFrame(id_list, columns=["id","cnt"])
#Top 20 people display
print "|Ranking|Stock quantity|"
print "|:----------:|:----------:|"
for i, d in enumerate(df.sort(columns="cnt", ascending=False)['cnt'][:20]):
print "| {} | {} |".format(i+1, d)
#Bar chart by user with the most stock
df.sort(columns="cnt", ascending=False)[:150].plot(kind="bar", figsize=(17, 8), alpha=0.6, xticks=[],
title="The number of stocks from 1 user.", width=1, color="blue")
#Histogram of stock numbers
df['cnt'].plot(kind="hist", figsize=(13, 10), alpha=0.7, color="Green", bins=25, xlim=(1,26),
title="Histgram of stocked users.")