In the previous article [^ 1], I visualized Qiita's popular tags on a monthly Bar Chart Race, so I will post the procedure.
As mentioned in the previous article, we basically borrow the wisdom of our predecessors [^ 2].
This method retrieves articles written within a half-month and tries to aggregate all periods by shifting the periods. But,
query = "&query=created:>" + start_date + "+created:<" + end_date
To start_date = ["2018-01-15","2018-01-31",...] end_date = ["2018-01-31","2018-02-15",...]
Because it is, the boundary is not included. Therefore, I did the following.
query = "&query=created:>" + start_date + "+created:<=" + end_date
As follows. See comments for details.
import datetime
from dateutil.relativedelta import relativedelta
import copy
# 1.Load all result files created in
df_all = pd.read_csv("results/summary.csv")
#Start date and time
ref_date = datetime.date(2011,9,1)
# created_Sort by at
df_all = df_all.sort_values("created_at")
#Extract only tag information and date information
tags_list = list(df_all["tags_str"])
date_list = list(df_all["created_at"])
#Convert to a type that can use relativedelta etc.
date_list = [pd.to_datetime(one) for one in date_list]
# key:Tag name, value:Number of times
tags_dict =dict()
#Updated every time the first year (2011) and the year to be aggregated change
y = date_list[0].year
#First month(9), Updated every time the month to be aggregated changes
m = date_list[0].month
#For storing results
ref_date = datetime.date(y,m,1)
#List for storing intermediate results (sum) in each month
monthly_result = []
#Monthly storage list
month = []
for i,(one_tags, one_date) in tqdm(enumerate(zip(tags_list,date_list))):
try:
#List comma-separated text
tags = one_tags.split(",")
except AttributeError:
#Sometimes NaN is included, so at that time continue (when tag is not set?)
continue
# tags_If you look at the dict and the tag is already in+1, otherwise register in dict and store 1
for one_tag in tags:
try:
tags_dict[one_tag] += 1
except KeyError:
tags_dict[one_tag] = 1
#Processing when the month changes
if one_date.year == y and one_date.month == m:
continue
else:
# month, monthly_Store the date at that time and the dict up to that point in result
month.append(ref_date)
monthly_result.append(copy.deepcopy(tags_dict))
ref_date += relativedelta(months=1)
y = ref_date.year
m = ref_date.month
#Store last state on exit
month.append(ref_date)
monthly_result.append(copy.deepcopy(tags_dict))
#For each month's dict, register tags that have not been posted by that month in the dict and store 0
for one in monthly_result:
ref_keys = one.keys()
for one_tag in tags_dict:
if not one_tag in ref_keys:
one[one_tag] = 0
#Molding
monthly_result_num = []
for one_dict in monthly_result:
#From dict to list to sort
tmp_list = [one for one in one_dict.items()]
#Sort by name
tmp_list = sorted(tmp_list, key=lambda x:x[0])
#Store only the number of times
monthly_result_num.append([one[1] for one in tmp_list ])
#Temporarily store the tag name in the value of DataFrame
df_align = pd.DataFrame({"tags":sorted(ref_keys)})
#Store the cumulative value of the number of tag registrations up to each month in the DataFrame
for one_date,one_nums in zip(month,monthly_result_num):
df_align[one_date.strftime("%Y-%m")] = one_nums
#Export to csv with tag name as index
df_align.set_index('tags').to_csv("all_result.csv")
https://app.flourish.studio/ Upload the csv that came out to the bar chart race of. Now you can visualize it! !!
Recommended Posts