Finally previous, this time is the second part.
Now, let's all go back to the origin again. Why collect and analyze data? That's because we want to maximize profits. In order to make a profit, you collect the materials (data) that are the basis of your judgment, organize the contents, look at them, and connect them to your actions to make a profit.
In other words, data analysis has a clear motive (purpose) that leads to profit, and value is created only when the results of analysis are linked to actions.
Let's talk about investment here.
There is no right investment law for everyone, but there are some prerequisites for gaining a market advantage in equities, for example.
I think this has something to do with classic in-game social games. That is, 1. It is a game of probability, 2. There is distortion (bias), and 3. Inflation occurs. Whether or not you can get the desired item by charging is a probability. In other words, even if the same amount of "charge" is made, there is a bias in the strength of the resulting connection with the "item" obtained. Also, sooner or later all players will become stronger, causing inflation, and if strong new items are introduced, the value of past items will decline relatively. That's right. However, there are differences depending on the system and management policy.
In any case, there are strategies to increase the odds without relying on accidental jackpots. For example, take a strategy to increase the number of trials. Even if there are a few hits or a few hits, if you continue for a long time, it will converge to the average value. This is similar to a long-term holding strategy in terms of stocks. By holding one stock in the stock for a long period of time, the valuation will increase as a result, even if there is a temporary rise or fall. [Warren Buffett](http://d.hatena.ne.jp/keyword/%A5%A6%A5%A9%A1%BC%A5%EC%A5%F3%A1%A6%A5%D0%A5 I think% D5% A5% A7% A5% C3% A5% C8) is famous. However, this is a story of "Let's hold a good stock for a long time", and I think that it is a major premise that the company will "develop in the long term" by carefully examining the stock. Since scrutiny is important, for example, it is a bad idea to hold long-term stocks in a fast-changing, ups and downs industry. Even if that is not the case, I think that the number of companies that can be confident that they will develop in the long term in this uncertain era will be quite limited.
By the way, even if you take a strategy to increase the number of trials, it costs money each time you turn the gacha, and there is room for ingenuity in how to charge efficiently. The easiest way to understand this is to limit the investment of various resources to the limit of achieving borders in rankings, etc., so that the investment is kept to the minimum necessary. Instead of investing blindly, it is important to identify borders.
Let's recall the previous graph. This is the increase in the score of each month's event up to last month. The purpose is to determine trends from this and predict the score for the February event.
Looking at the visualized data, we can see that there are some points.
For example, in November there is a vertically bent cliff at the top reward ranks. Also, in January, there is a cliff at the lower reward. Some hypotheses can be made by adding this and the content of the event. Of course, this hypothesis must be considered by humans. If it's a consumer-oriented world event, you can use information on social media and bulletin boards to get people's opinions.
For example, in November
Or in January
And so on.
In any case, it can be read that there is a large difference in score at the breaks where the content of the reward changes, but if such a remarkable change appears due to qualitative factors, the regression equation cannot be derived well.
So, this time, after the event started, we will aggregate the daily scores for the current month (February) and examine how they were compared to the past three months.
Exploratory data analysis is to look at data from various angles with the purpose of "acquiring appropriate purposes and hypotheses."
Originally, I have repeatedly emphasized that data analysis requires a clear purpose and hypothesis, but in the first place, in the situation where the analysis target does not have the prerequisite knowledge or the actual condition of the target is not well understood, in the first place. It is necessary to take a look at the data in order to obtain the purpose and hypothesis of. This is an approach called exploratory data analysis in the statistical world.
The breakdown of each month was as follows.
November
December
January
Blue represents Saturday and red represents Sunday.
First of all, let's narrow down the target to the top reward as an aim. We will use the transition of the border score, which is the top reward for each month, as a data set. IPython is still useful for advancing exploratory data analysis.
Here, February is assumed to have passed until the middle of the event (4th day).
#Cut out the top reward data for each month
df201411 = pd.read_csv("201411.csv", index_col=0)
df201412 = pd.read_csv("201412.csv", index_col=0)
df201501 = pd.read_csv("201501.csv", index_col=0)
df201502 = pd.read_csv("201502.csv", index_col=0)
#Extract the top reward lines into a series
s201411 = df201411.ix[700, :]
s201412 = df201412.ix[800, :]
s201501 = df201501.ix[1000, :]
s201502 = df201502.ix[1000, :]
#Index to numbers for simplicity
s201411.index = np.arange(1, len(s201411) + 1)
s201412.index = np.arange(1, len(s201412) + 1)
s201501.index = np.arange(1, len(s201501) + 1)
s201502.index = np.arange(1, len(s201502) + 1)
#Concatenate each month's series into a data frame
df = pd.concat([s201411, s201412, s201501, s201502], axis=1)
#Make columns numbers
df.columns = [11, 12, 1, 2]
#Visualize
df.plot()
Next, let's find the basic statistics.
#Basic statistics
df.describe()
#=>
# 11 12 1 2
# count 7.000000 7.000000 7.000000 4.000000
# mean 2040017.285714 2166375.142857 3143510.857143 1716607.750000
# std 1466361.613186 1444726.064645 1897020.173703 993678.007807
# min 326615.000000 401022.000000 640897.000000 539337.000000
# 25% 1031257.000000 1181755.500000 1940483.500000 1136160.000000
# 50% 1836312.000000 2020470.000000 3044127.000000 1751315.500000
# 75% 2727857.000000 2862782.500000 4055898.000000 2331763.250000
# max 4598966.000000 4654058.000000 6326789.000000 2824463.000000
#Correlation coefficient
df.corr()
#=>
# 11 12 1 2
# 11 1.000000 0.999157 0.996224 0.996431
# 12 0.999157 1.000000 0.998266 0.997345
# 1 0.996224 0.998266 1.000000 0.999704
# 2 0.996431 0.997345 0.999704 1.000000
#Covariance
df.cov()
#=>
# 11 12 1 2
# 11 2.150216e+12 2.116705e+12 2.771215e+12 6.500842e+11
# 12 2.116705e+12 2.087233e+12 2.735923e+12 6.893663e+11
# 1 2.771215e+12 2.735923e+12 3.598686e+12 1.031584e+12
# 2 6.500842e+11 6.893663e+11 1.031584e+12 9.873960e+11
What can we learn from here?
However, four days after the event in February actually started, when I opened the lid ... the score dropped.
This is something that has never been done, including the data that was released last time. It has been rising (inflation) for a long time, but it has fallen, and since this is the first phenomenon, it has finally become difficult to predict.
Again, we will set some hypotheses.
There are many possibilities.
However, since every growth in each month has a high correlation coefficient, it can be judged that it seems good to predict the score at the end of the event by referring to the daily growth in the past 3 months.
Therefore, we can capture the growth of this score by ** percentage change **.
Rise and fall rate is an investment term, and is an indicator of price fluctuations. Compare the two points in time to see what percentage the value of the fund is up or down. Here, we will calculate the rate of increase / decrease by regarding the increase in score as a pattern of fund price changes.
#Growth rate from the previous day
df.pct_change()
#=>
# 11 12 1 2
# 1 NaN NaN NaN NaN
# 2 1.315821 1.288455 1.386508 1.475449
# 3 0.726815 0.575413 0.537399 0.623495
# 4 0.405916 0.397485 0.294568 0.303079
# 5 0.295578 0.281922 0.207571 NaN
# 6 0.293197 0.210570 0.206691 NaN
# 7 0.494807 0.484321 0.426303 NaN
#When transposed, the rate of increase compared to the previous month
df.T.pct_change()
#=>
# 1 2 3 4 5 6 7
# 11 NaN NaN NaN NaN NaN NaN NaN
# 12 0.227813 0.213304 0.106925 0.100287 0.088689 0.019129 0.011979
# 1 0.598159 0.666635 0.626419 0.506643 0.419258 0.414710 0.359413
# 2 -0.158465 -0.127103 -0.078220 -0.072160 NaN NaN NaN
The growth rate from the previous day tends to be the same in every month. We can see that the growth rate varies slightly depending on whether it is a holiday or a weekday, but the growth rate is the highest on the second day, then slows down, and grows by 40 to 50 percent on the last day's spurt.
Next, looking at the growth from the previous month on the same day, February has turned negative.
pct_change = df.T.pct_change() #Growth rate from the previous month
def estimated_from_reference(day):
return df.ix[7, 1] * (1 + df.T.pct_change().ix[2, day])
estimated = [estimated_from_reference(x) for x in range(1, 7)]
print(estimated)
#=>
[5324211.8451061565, 5522634.3150592418, 5831908.3162212763, 5870248.3304103278, nan, nan]
#Expected final border score based on day 1, day 2, day 3, and day 4
I got it like this.
Or you can ask for the scores for the 5th, 6th, and final days that will come.
def estimated_from_perchange(criteria, day):
return df.ix[criteria, 2] * (1 + df.pct_change().ix[day, 1])
#The score on the 5th day of February is calculated from the rate of increase / decrease on the 4th to 5th days of January.
df.ix[5, 2] = estimated_from_perchange(4, 5)
#=> 3410740.086731
#Also on the 6th day
df.ix[6, 2] = estimated_from_perchange(5, 6)
#=> 4115709.258368
#Last day
df.ix[7, 2] = estimated_from_perchange(6, 7)
#=> 5870248.330410
This fills in the missing values in the data frame. From this, it was possible to predict that 5.87 million points would be the border of the top compensation as of the 4th day.
By the way, the correct answer data was 3487426 points on the 5th day (102.2% of the predicted value on the 4th day), 4094411 points (99.5%) on the 6th day, and 5728959 points (97.5%) on the final day, so 5.87 million points should be affordable. The result is that you have earned a higher reward.
Days | Forecast | result | difference |
---|---|---|---|
Day 5 | 3410740 | 3487426 | 102.2% |
Day 6 | 4115709 | 4094411 | 99.5% |
Last day | 5870248 | 5728959 | 97.5% |
The results of exploratory data analysis can be programmed and saved in Powerful Features of IPython, so it is calculated daily while following the progress of the event. It has been demonstrated that borders can be predicted with extremely high accuracy.
In the world of investment [Rise and fall ratio](http://ja.wikipedia.org/wiki/%E9%A8%B0%E8%90%BD%E3%83%AC%E3%82%B7%E3%82 There is an index called% AA). This is a short-term indicator of the ratio of the rate of increase / decrease to all stocks. You can read from this technical indicator whether the stock is overbought or not.
Let's take a look at the Nikkei 225 on March 4, 2015 as a result.
Rise and fall ratio Nikkei average comparison chart http://nikkei225jp.com/data/touraku.html
On this day, sales preceded and the Nikkei average fell by 200 yen in the morning. In fact, if you look at the ups and downs ratio just before this, you can see a very high number from 130 to 140. This, in essence, points to the market being bullish and overbought, and is a sign that the Nikkei average will plummet after that. After the actual drop of 200 yen, the ratio value returned to the normal range, and repurchases occurred and rebounded.
In this way, it can be said that common methods can be applied to the basic parts of both numerical analysis of a small world such as a game and analysis of a large world of financial economy. Of course, there may be unforeseen circumstances due to sudden cataclysms, but the attitude of trying to analyze things scientifically on a daily basis is very important.
Recommended Posts