This article is the 19th day of the NTT DoCoMo SI Department Advent Calendar.
Hello! It is Hashimoto of read-ahead engine team. In the business we are working to develop technologies of personal data analysis for the agent service.
In this article, we will introduce ** a method to quantitatively evaluate the lifespan of content (continuous article posting) by applying survival time analysis to Qiita posted article data **.
Survival time analysis is a method of analyzing the relationship between the time to event occurrence and the event. Generally, it is used to analyze the period until a patient's death event (human life) in the medical field and the period until a component failure event (part life) in the engineering field. This time, in ** Qiita post data, I will analyze the lifespan of technical posts ** by assuming that the user stopped posting articles of a specific technology continuously as an event occurs!: Muscle:
By using the survival time analysis, it is possible to evaluate whether the content has a long / short lifespan, or whether the content utilization rate decreases slowly / suddenly. There are various possible uses for the evaluation results, such as creating features for content classification tasks, creating features for user classification tasks using content usage history, and making decisions for content recommendation measures.
For more information on survival time analysis, please refer to the following articles and books.
The rough flow of the analysis in this article is as follows.
The execution environment of the program in the article is Python3.6, macOS 10.14.6. We will also use a survival time analysis library called lifelines 0.22.9.
This article uses the Qiita dataset obtained below. I made a dataset from the article posted by Qiita
This dataset is a dataset of user's article posting history acquired from the API provided by Qiita. You can check the posting history from 2011 to 2018. In carrying out this analysis
--It must be the user's content history. --The user must use the same content (this time, the same article tag) on a regular basis.
This data was adopted because it satisfies the above two conditions.
If you read the data with pandas dataframe, it will be as follows.
import pandas as pd
df = pd.read_csv('qiita_data1113.tsv', sep='\t')
df.head()
created_at | updated_at | id | title | user | likes_count | comments_count | page_views_count | url | tags |
---|---|---|---|---|---|---|---|---|---|
2011-09-30T22:15:42+09:00 | 2015-03-14T06:17:52+09:00 | 95c350bb66e94ecbe55f | Gentoo is cute Gentoo | {'description': ';-)',... | 1 | 0 | NaN | https://... | [{'name': 'Gentoo', 'versions': []}] |
2011-09-30T21:54:56+09:00 | 2012-03-16T11:30:14+09:00 | 758ec4656f23a1a12e48 | Earthquake early warning code | {'description': 'Emi Tamak... | 2 | 0 | NaN | https://... | [{'name': 'ShellScript', 'versions': []}] |
2011-09-30T20:44:49+09:00 | 2015-03-14T06:17:52+09:00 | 252447ac2ef7a746d652 | parsingdirtyhtmlcodesiskillingmesoftly | {'description': 'Don't call github... | 1 | 0 | NaN | https://... | [{'name': 'HTML', 'versions': []}] |
2011-09-30T14:46:12+09:00 | 2012-03-16T11:30:14+09:00 | d6be6e81aba24f39e3b3 | Objective-How is the following variable x handled in the C class implementation?... | {'description': 'Hello. Hatena... | 2 | 1 | NaN | https://... | [{'name': 'Objective-C', 'versions': []}] |
2011-09-28T16:18:38+09:00 | 2012-03-16T11:30:14+09:00 | c96f56f31667fd464d40 | HTTP::Request->AnyEvent::HTTP->HTTP::Response | {'description'... | 1 | 0 | NaN | https://... | [{'name': 'Perl', 'versions': []}] |
By the way, in each article, up to 3 tags are extracted from the tags column, and the ranking based on the total number is as follows.
index | tag | |
---|---|---|
0 | JavaScript | 14403 |
1 | Ruby | 14035 |
2 | Python | 13089 |
3 | PHP | 10246 |
4 | Rails | 9274 |
5 | Android | 8147 |
6 | iOS | 7663 |
7 | Java | 7189 |
8 | Swift | 6965 |
9 | AWS | 6232 |
Extract the necessary data from the DataFrame that has read the above dataset.
df_base = <get tags>
df_base.head()
user_id | time_stamp | tag |
---|---|---|
kiyoya@github | 2011-09-30 22:15:42+09:00 | Gentoo |
hoimei | 2011-09-30 21:54:56+09:00 | ShellScript |
inutano | 2011-09-30 20:44:49+09:00 | HTML |
hakobe | 2011-09-30 14:46:12+09:00 | Objective-C |
motemen | 2011-09-28 16:18:38+09:00 | Perl |
ichimal | 2011-09-28 14:41:56+09:00 | common-lisp |
l_libra | 2011-09-28 08:51:27+09:00 | common-lisp |
ukyo | 2011-09-27 23:57:21+09:00 | HTML |
g000001 | 2011-09-27 22:29:04+09:00 | common-lisp |
suginoy | 2011-09-27 10:20:28+09:00 | Ruby |
Created \ _at and tag were extracted as user_id, time \ _stamp from each record. For those with multiple tags, I took out up to 5 and concated each as one record. Note that tag notation fluctuations (golang and Go, Rails and RubyOnRails, etc.) are not taken into consideration.
Converts the data format to two-column data of lifetime and event flags for input to the lifelines Weibull model. Since this data does not reveal a clear event (stopped posting articles) or survival time (duration of continuous article posting), it is necessary to define it independently.
In this section, an event is defined as an event occurrence when the following two conditions are met.
If the deadline for the observation period and the period for the latest posting are less than θ days, it will be treated as an observation termination.
It is a little difficult to understand, so I will explain it with a figure.
The above figure shows the timing of article posting arranged in chronological order for 3 users. User A has an observation period deadline and the latest posting period of θ days or more. Therefore, the event will occur after the final post. User B has a period of θ days or more for the last two posts. In this case as well, it is judged as an event occurrence. For User C, the period between two adjacent posts is less than θ, and the deadline for the observation period and the period of the latest post are less than θ days. Therefore, it is treated as an observation discontinuation. Regarding the survival time, the period until the event occurs or the period until the deadline of the observation period is defined as the survival time.
This time, we decided to determine whether or not an event occurred and the survival time based on the above rules. If the above logic is defined and implemented as ``` make_survival_dataset ()` ``, it will be as follows. This time, θ = 365 days. Also, specify 2018/12/01 as the observation deadline. It is assumed that DataFrame filtered by a specific tag is input as an argument.
import datetime
import pytz
def make_survival_dataset(df_qiita_hist, n = 365):
id_list = []
duration_list = []
event_flag_list = []
for index, (userid, df_user) in enumerate(df_qiita_hist.groupby('user_id')):
#Add observation deadline to the end
dt = datetime.datetime(2018, 12, 1, tzinfo=pytz.timezone("Asia/Tokyo"))
last = pd.Series(['test', dt, 'last'], index=['user_id', 'time_stamp', 'tag'], name='last')
df_user= df_user.append(last)
#Calculate the period between two adjacent posts(The top of the list is None.)
day_diff_list = df_user.time_stamp.diff().apply(lambda x: x.days).values
#Lists with a length of 2 or less are excluded from the calculation.
if len(day_diff_list) <= 2:
continue
#Search for whether or not an event occurs.
event_flag = False
#List to calculate the period until the event occurs
day_list = []
for day in day_diff_list[1:]:
if day >= n:
event_flag = True
break
day_list.append(day)
#Calculate the period until the event occurs
s = sum(day_list)
#Those with a period of 0 are excluded.
if s == 0:
continue
#Create a DataFrame
id_list.append(userid)
duration_list.append(s)
event_flag_list.append(event_flag)
return pd.DataFrame({'userid':id_list, 'duration':duration_list, 'event_flag': event_flag_list})
Extract the records with Python tags and enter them in make \ _survival \ _dataset.
df_python = df_base[df_base['tag'] == 'Python'].sort_values('time_stamp')
df_surv = make_survival_dataset(df_python, n=365)
df_surv.head()
userid | duration | event_flag |
---|---|---|
33yuki | 154.0 | False |
5zm | 432.0 | False |
AketiJyuuzou | 57.0 | True |
AkihikoIkeda | 308.0 | False |
Amebayashi | 97.0 | True |
Now you have the data to input to the Weibull model.
Input the data created above into the Weibull model, fit the parameters and plot the survival curve. Here, in addition to Python, let's plot the data with Ruby tags.
import lifelines
%matplotlib inline
import matplotlib.pyplot as plt
plt.rcParams['font.family'] = 'IPAexGothic'
_, ax = plt.subplots(figsize=(12, 8))
# Python
name = 'Python'
df_surv = make_survival_dataset(df_base[df_base['tag'] == name].sort_values('time_stamp'), n=365)
wf = lifelines.WeibullFitter().fit(df_surv['duration'], df_surv['event_flag'], label=name)
wf.plot_survival_function(ax=ax, grid=True)
# Ruby
name = 'Ruby'
df_surv = make_survival_dataset(df_base[df_base['tag'] == name].sort_values('time_stamp'), n=365)
wf = lifelines.WeibullFitter().fit(df_surv['duration'], df_surv['event_flag'], label=name)
wf.plot_survival_function(ax=ax, grid=True)
ax.set_ylim([0, 1])
ax.set_xlabel('Lifetime(Day)')
ax.set_ylabel('Survival rate')
The vertical axis is the number of days, and the vertical axis is the survival rate (percentage of users who continue to post). As a whole, we can see that the survival rate decreases as the number of days progresses. Focusing on Python, we can see that the survival rate is just below 0.2 at the timing of 1500 days. This means that 20% of people will continue to post 1500 days after they start posting. On the other hand, the remaining 80% means to stop posting continuously. After 1500 days comparing Python and Ruby, you can see that there is a difference of about 10%. As far as I can see, it can be said that ** overall, Python articles have a longer survival time than Ruby articles, and there is a tendency for continuous article posting. ** Python's long life seems to be influenced by the recent increase in demand as a machine learning / data analysis tool.
In this way, by defining the event occurrence and survival time for the content log and performing survival time analysis, the survival time of the content can be compared.
According to the lifelines documentation, the survival curve is based on the formula below. It is plotted based on.
The survival rate curve depends on the parameters λ and ρ, and the fit function of Weibull Fitter is in the form of calculating the above parameters. Therefore, by plotting the values of λ and ρ obtained from WeibullFitter on a two-dimensional graph, the similarity of the survival rate curve for each tag can be visually confirmed.
I narrowed down the tags with 1000 or more posting users in the dataset and plotted them.
Λ is plotted on the vertical axis and ρ is plotted on the horizontal axis.
In general, the larger the value of λ, the longer the survival time, and the larger the value of ρ, the steeper the slope of the survival rate curve with the passage of time. If you roughly classify by the size of λ and ρ, it looks like the following: thinking:
-** Large λ: Long survival ** --PHP (and Laravel), Ruby (and Rails), C #, iOS, Android, etc. --Impression that it is often used in products (many users?) And that there are many programming languages (and frameworks) and mobile development. ――Since it is the language and framework used for the product, there are many stories and it is easy to continue posting articles. ――It seems to be related to the impact of functional changes due to updates
-** Small λ: Short survival time ** --CentOS, Ubuntu, PostgreSQL, Nginx, Git, Slack, etc. --Impression that development platform tools such as OS and middleware and development support tools such as Git and Slack are concentrated ――Since it is a basic part, there is relatively little material, so article posting tends to be short-term
-** ρ is large: The slope of the survival rate curve becomes steeper over time ** --Ssh, Chrome, Git, Slack, Mac, Windows, etc. --Impression that basic tools are concentrated --Many articles related to basic tools are introductory articles, and after the start of posting, continuous posting continues for a while, but it decreases after a while.
-** ρ is small: The slope of the survival rate curve becomes gentler with the passage of time ** --Programming language, middleware, Linux OS, etc. --Impression that relatively highly specialized tools (technologies) are concentrated. ――It is easy to stop posting articles on highly specialized tools (techniques) immediately after starting posting, but some people continue to post.
It is summarized in the figure below.
For most things, I have the impression that it matches the interpretation (?). It is interesting that the difference in parameters is related to the type of actual technology. There are some subtleties such as C # and Objective-C, but ...: thinking:
We performed a survival time analysis on the Qiita article data and classified the contents from the two viewpoints of the length of survival time and the degree of change in the slope of the survival rate curve. It was a rough interpretation, but I found that it seems to be related to the difference in parameters and the type of technology. I think the content introduced in this article is a method that can be applied to other content logs. If you have a chance to touch the content usage log, please refer to it. Finally, I will share some of the extra analysis results and finish. Well then, have a good year ~: raised_hand:
iOS V.S. Android
Emacs V.S. Vim
Recommended Posts