When I came up with the idea that "data science seems to be fun and I want to start!", There is information on "methods" such as packages such as pandas and models such as SVM, but there are few introductions of familiar cases. I think so. ~~ I'm not excited about Ayame's classification. ~~
Therefore, the purpose of this article is to let you experience a series of data analysis processes, using my hobby "Analysis to increase the number of views of VOCALOID's first posted work" as an example. I hope I could have some influence on your analysis and motivation.
We will proceed with the analysis in the following order. I don't refer to anything specific, but I am conscious of the form that incorporates programming technology based on the flow of empirical papers using econometrics. [^ 1]
[^ 1]: This analysis is part of the classical data analysis framework of "making a hypothesis first and then testing it with data." I think the process is different from the data mining approach of "finding useful knowledge from messy data".
The main purpose of this article is to guide you through the process of data analysis, but if you are interested in the following topics used for analysis, please stop by!
--How to hit Nico Nico Douga "Snapshot Search API v2" using Python --Grouping under arbitrary conditions using pandas groupby --Mann-Whitney U test (comparison of median between two groups) --How to eliminate "fake correlation"
--Basic knowledge about libraries such as numpy, pandas, matplotlib --Introductory knowledge of statistics
You have completed your first work as a new Vocaloid P. When posting a work to a video site, I'm thinking of trying to increase the number of views as much as possible. The quality of the song itself can't be changed anymore, so I'm thinking of devising something for the title and poster comments.
The first post of a VOCALOID song is almost always tagged as "VOCALOID virgin work". On the other hand, few people put "first post" in the title of their work. (For example, "[Hatsune Miku Original] ~ Title ~ [First Post]")
If you add "first post" to the title, what kind of image will viewers have when they see it? There are two possible reactions: "Oh, I'll ask you what kind of newcomer you are" and "I can't help but ask you if you're a newcomer with no track record." If the former reaction is predominant, you can expect an increase in the number of views by adding a word to the title to make it more eye-catching.
Therefore, in this article, I would like to test the hypothesis that the number of views may increase by adding "first post" to the title of works tagged with "VOCALOID virgin work".
The data required for this analysis is
-The work with the "VOCALOID virgin work" tag --"Number of views" + "Title" -+ "Posted date and time" (used in the second half)
Will be. In addition, we will handle the data for the past 4 years (works posted from 2013 to 2016).
Let's get the work information from Nico Nico Douga "Snapshot Search API v2". How to use is organized in the official guide. http://site.nicovideo.jp/search-api-docs/snapshot.html
The points are as follows.
--By defining the search conditions in dictionary format and encoding with the urllib library, the code becomes easier to see (around url_query in the sample code). --Send get requests in the requests library. Convert the response to json format and handle it ――Since you can get up to 100 works at a time, make full use of offset and loop statements [^ 2] An error may occur if the offset is set to 1601 or higher (is the server overloaded and limited?). Prevent the offset from exceeding 1601 by subdividing the post year filter into 2013, ..., 2016 instead of 2013 to 2016 at once.
[^ 2]: How to use offset: For example, if you sort in the order of the number of playbacks and set offset = 30, you can specify the works after the 30th ranking.
import urllib
import requests
import time
import pandas as pd
class NiconicoApi():
def __init__(self, keyword):
self.keyword = keyword
def gen_url(self, year, offset):
url_body = 'http://api.search.nicovideo.jp/api/v2/video/contents/search?'
url_query = urllib.parse.urlencode( {
'q': self.keyword,
'filters[startTime][gte]': '%i-01-01T00:00:00' % year,
'filters[startTime][lt]': '%i-01-01T00:00:00' % (year + 1),
'_offset': offset,
'targets': 'tags',
'fields': 'title,viewCounter,startTime',
'_sort': '-viewCounter',
'_limit': 100,
'_context': 'apiguide'
} )
self.url_ = url_body + url_query
return self
def get_json(self):
response = requests.get(self.url_)
self.json_ = response.json()
return self.json_
'''Data acquisition'''
data = []
nicoApi = NiconicoApi('VOCALOID virgin work')
for year in range(2013, 2017):
offset = 0
nicoApi.gen_url(year=year, offset=offset)
json = nicoApi.get_json()
while json['data']:
data += json['data']
offset += 100
nicoApi.gen_url(year=year, offset=offset)
json = nicoApi.get_json()
time.sleep(1)
'''Conversion to DataFrame'''
df = pd.DataFrame.from_dict(data)
df.shape # => (4579, 3)
The sample size is now 4579 [^ 3]. Since it is data for 4 years, it is calculated that more than 1000 new Vocaloid-Ps are born every year.
[^ 3]: As of 06/03/2017
Before we get into full-scale analysis, let's get a quick overview of the data.
First, plot the number of playbacks on the vertical axis and the ranking on the horizontal axis. At this time, since it can be predicted that there is a large difference in the number of views between popular works and unknown works, we will take the vertical axis on the log scale.
import numpy as np
import matplotlib.pyplot as plt
df = df.sort_values('viewCounter', ascending=False)
ranking = np.arange(len(df.index))
view_counter = df['viewCounter'].values
plt.scatter(ranking, view_counter)
plt.yscale('log')
plt.grid()
plt.title('2013~2016 Ranking & ViewCounter')
plt.xlabel('Ranking')
plt.ylabel('ViewCounter')
plt.show()
result:
――It seems that about 3/4 of the works are played 100 to 1000 times. ――The degree of this distortion even though the vertical axis is log. The disparity is terrible ... ――Since we were able to confirm the degree of distortion of the distribution and the existence of outliers, we would like to utilize it for future analysis.
For analysis, it is necessary to separate works that include "first post" in the title and works that do not. Use the pandas groupby method.
groupby often passes a column name as an argument, but you can also pass a function. Passing a function makes it easy to group under arbitrary conditions.
This time, we want to group by "whether or not the title includes" first post "", so let's define the function of the judgment condition. We will also include "virgin work", which has the same meaning as "first post", in the analysis.
As shown below, define a function with different return values depending on whether the argument contains'first post'or'virgin work' or neither.
def include_keyword(title):
if 'First post' in title:
return 'First post'
elif 'First book' in title:
return 'First book'
else:
return 'Control group'
Pass the defined function as an argument of groupby. Then, the function is applied to each element of index of df1 and grouped according to the return value. It's similar to filter.
#Change index to title column
df1 = df[['viewCounter', 'title']].set_index('title')
#grouping
df1_grouped = df1.groupby(include_keyword)
Now that the grouping is complete, let's move on to a detailed analysis.
If you calculate the average value and median value for each group and compare them with "works with keywords (hereinafter, treatment group)" and "works without keywords (control group)", "effects of including keywords (effects by including keywords)" Treatment effect) ”can be measured.
Then for each group
Let's ask for.
To find multiple descriptive statistics at once, it is convenient to use agg. By passing multiple methods as an array to the argument of agg, those methods can be applied to df1_grouped at once.
functions = ['count', 'mean', 'median', 'std']
df1_summary = df1_grouped.agg(functions)
result: ――Of the 4579 samples, 109 works have "first post" in the title, and 144 works including "virgin work". ――The results showed that the treatment group had more regenerations than the control group, both by mean and median. In other words, by adding a word "first post" to the title, it seems that it was able to attract attention.
From now on, let's consider two groups, the "treatment group" and the "control group", to simplify the analysis. Change df1_grouped as follows.
def include_keyword(title):
if 'First post' in title or 'First book' in title:
return 'Treatment group'
else:
return 'Control group'
df1_grouped = df1.groupby(include_keyword)
df1_summary = df1_grouped.agg(functions)
df1_summary
result:
Now, let's look at the difference between the two groups in a graph.
--Draw a hist graph with the number of playbacks (log scale) on the horizontal axis and the number of samples (normalized) on the vertical axis. --By giving opacity to plt.hist, the graph becomes translucent and the two-group comparison becomes easier to see.
X_treated = np.log(df1_grouped.get_group('Treatment group').values)
X_untreated = np.log(df1_grouped.get_group('Control group').values)
plt.hist(X_treated, normed=True, bins=20, label='Treatment group', color='r', alpha=0.5)
plt.hist(X_untreated, normed=True, bins=20, label='Control group', color='b', alpha=0.5)
plt.show()
result:
The graph also shows the difference in distribution between the treatment group and the control group.
By the way, the desired result (the number of views increases by including "first post" in the title) has been suggested, but the "difference in the number of views between the treatment group and the control group" obtained in the previous section is statistical. Let's verify whether it is significant (isn't it a difference that can be explained by chance)?
As mentioned earlier, the distribution of playbacks is severely distorted, so you should look at the median rather than the mean. It is not the median comparison of the two groups, but an alternative is the Mann-Whitney U test [^ 4].
The Mann-Whitney U test puts the null hypothesis that the shape of the distribution between the two groups is the same [^ 5]. If this null hypothesis is rejected, the number of regenerations in the treatment and control groups happens to be unexplainably different.
The U test is provided in scipy's stats module. Let's test it.
from scipy import stats
result = stats.mannwhitneyu(X_treated, X_untreated)
print(result.pvalue) # => 0.00137327945838
Since the p value is 0.0014, it can be said that the distribution between the two groups is significantly different. You did it!
[^ 5]: In order to perform the Mann-Whitney U test, it is necessary to assume the homoscedasticity of the two groups, but this is omitted for the sake of simplicity. I also tried the Brunner-Munzel test, which does not require a process of homoscedasticity, and found significant results, so it seems that the results are robust. (For the Brunner-Munzel test, refer to here: http://oku.edu.mie-u.ac.jp/~okumura/stat/brunner-munzel.html)
One thing that must be noted in such an analysis is the "fake correlation". It can be said that "correlation ≠ causality".
All we've made so far is that there is a correlation between adding a "first post" to the title and the number of views. We will consider whether this correlation can be interpreted as a causal relationship.
For example, there is a correlation between ice cream sales and water accidents. It seems rough to conclude from here that eating ice cream makes you more susceptible to water accidents, right?
The real cause is the "season". In econometric terms, "missing variable bias" is the derivation of the above-mentioned fake correlation by forgetting to consider "season". (In this case, missing variable = season)
Let's go back to the VOCALOID song. If there are "missing variables" in this analysis, what are the possible factors?
I came up with the case where "posted date" is a missing variable. Suppose the following situation holds.
――The VOCALOID culture has entered a period of decline, and the number of views is declining from 2013 to 2016. ――It is an old custom to add "first post" to the title of VOCALOID songs, and it has disappeared recently.
Under this circumstance, the two phenomena of "posting year old-> high number of views" and "posting year old-> adding" first post "to the title" overlap, creating a fake correlation. I understand this.
Let's verify if there is a fake correlation like the one above. Two columns representing the "posting year" and "treatment group or control group" are created in advance, and grouping is performed based on these two columns.
df2 = df.copy()
df2['title'] = df2['title'].apply(include_keyword)
df2['startTime'] = pd.DatetimeIndex(df2['startTime'])
# tz_To apply localize'startTime'Is specified in index
df2 = df2.set_index('startTime')
df2.index = df2.index.tz_localize('UTC').tz_convert('Asia/Tokyo')
df2['startTime'] = df2.index
df2['startTime'] = df2['startTime'].apply(lambda x: x.year)
df2_grouped = df2.groupby(['startTime', 'title']).viewCounter
df2_summary = df2_grouped.agg(functions)
df2_summary
Results: Even when grouped by posting year, it can be seen that the number of views in the treatment group and the control group is different except in 2013.
Now that we've roughly confirmed in the table above that the year of posting is not a missing variable, let's analyze it more rigorously using multiple regression.
Suppose the number of views is determined as follows:
Here, the details of each variable are as follows.
--$ log (plays_i)
Next, I will explain the advantages of using multiple regression. With multiple regression, it is possible to estimate "how much the number of plays differs between the treatment group and the control group when the time trend is fixed ($ \ beta_1 $)". Therefore, it is possible to remove the apparent correlation due to the time trend.
The downside here is that it uses the average number of plays to estimate $ \ beta_1 $, making it more susceptible to outliers.
Let's run it.
import statsmodels.formula.api as smf
def include_keyword(title):
if 'First post' in title or 'First book' in title:
return 1
else:
return 0
df3 = df.copy()
df3['title'] = df3['title'].apply(include_keyword)
df3['startTime'] = pd.DatetimeIndex(df3['startTime'])
df3['timeTrend'] = df3['startTime'].apply(lambda x: (df3['startTime'].max() - x).days)
df3['lviewCounter'] = np.log(df3['viewCounter'])
mod = smf.ols('lviewCounter ~ title + timeTrend', data=df3).fit()
mod.summary()
result:
--From the p value of the coefficient $ \ beta_1 $ of the treatment dummy (variable title), it can be said that the treatment effect is significant even in multiple regression. --The time trend (posting date and time) does not seem to affect the number of views. -It was estimated that $ \ beta_1 = 0.3582 $. This value can be interpreted as "the treatment group has a 36% higher replay rate than the control group."
We can say that this analysis result is more robust because we can confirm that the "post date" does not create a missing variable bias. You did it!
This concludes the "validation of analysis", but if you think "Is it better to verify this factor as well?", Please comment!
It's been a long time, but thank you so much for reading this far.
By the way, based on this analysis, I added "first post" to the title and posted it to Nico Douga, but the number of views was less than 200. Note that statistical trends do not apply to individual cases.
This article ends when the punch line is attached. I would be happy if I could tell you something about the flow of analysis, what you need to be careful about, and the technology you need.