Sometimes I feel like trimming and taking a moving average. I also want to take a weighted average because I want to increase the influence of the latest numerical value. Basically, the trend is the same, but I would like to take a moving average without reference to the time-series data when the numerical value rises occasionally and peaky, or when the numerical value rises sharply. You may think so. Eh, no? I still write. I had a little trouble with rolling (). Apply, so I want to write it down.
A method of averaging by arranging the numerical values of the data group to be averaged in order of size and excluding N% on one side or both sides.
It is assumed that there are 10 data below.
If you take the average normally,
Trim (pruning) average is to remove and average like this. If it is a moving average, it is an image that trims and averages the numerical value in the window size. The merit is that outliers can be eliminated. It is possible to prevent the average from being pulled toward the outlier.
Since explanations are written in various places, details are omitted. Weighting numbers and averaging them. Weighted moving averages are often weighted so that the influence of the latest numerical value is large.
Since we want to take the trim-weighted moving average this time, we take the weighted moving average after trimming with the numerical value within the window size of the moving average.
import.py
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import datetime as dt
from dateutil.relativedelta import relativedelta
sns.set()
Create data based on the transition of the number of active users of smartphone apps
make_data.py
#Decay curve
def exp_func(x, a, b):
return b*(x**a)
x=np.linspace(1,36,36)
data=exp_func(x, -0.5, 100000)
#Data frame
df=pd.DataFrame({'x':x,'y':data})
df=df.astype(int)
#Create a month column because it is an assumption of time series data
init_mon=dt.datetime(2017,df['x'].values[0],1)
months=[init_mon]
for mon in range(1,len(df)):
months.append(init_mon + relativedelta(months=mon))
df['month']=months
df.index=months
display(df.head())
# plot
fig=plt.figure(figsize=(12,6))
ax=plt.subplot(1,1,1)
df.plot.bar('month','y',ax=ax)
plt.show()
It seems that the number of active users will temporarily increase considerably depending on the campaign or event of the smartphone application, so let's change the numerical value of the data assuming such a situation.
change_data.py
df2=df.copy()
df2.loc[(df2.index.month==1)&(df2.index.year>=2018), 'y']=df2['y']*1.6
df2.loc[(df2.index.month==2)&(df2.index.year>=2018), 'y']=df2['y']*1.4
df2.loc[(df2.index.month==3)&(df2.index.year>=2018), 'y']=df2['y']*1.2
fig=plt.figure(figsize=(12,6))
ax=plt.subplot(1,1,1)
df2.plot.bar('month','y',ax=ax)
plt.show()
Data that seems to have increased the number of active users in campaigns such as the 1st anniversary has been completed.
This time, let's build a code with the policy of using the trim-weighted moving average as the predicted value. The predicted value of the numerical value after 3 months is used as the value of the trim-weighted moving average. (Example: Forecast of the numerical value of 2018-08 is the value obtained by taking the trim-weighted moving average as of 2018-05)
First, the trim simple moving average function
sma.py
def sma(roll_list):
# roll_Remove if Nan is in list
roll_list=roll_list[~np.isnan(roll_list)]
# roll_Arrange list in ascending order
sorted_roll_list=sorted(roll_list)
#Roll arranged in ascending order_Define half the length of list
harf_span=round(len(sorted_roll_list)/2)
if harf_span > 0:
# roll_Get the numbers below the median of list and take the average
harf_index=np.where(roll_list < sorted_roll_list[harf_span])
roll_list_harf=roll_list[harf_index]
sma = np.sum(roll_list_harf) / len(roll_list_harf)
else:
# roll_Since the length of list is 1 or less, the median cannot be taken.
# roll_Use the value of list as it is
roll_list_harf=roll_list[0]
sma = roll_list_harf
return sma
Then the trim-weighted moving average function
sma.py
def wma(roll_list):
# roll_Remove if Nan is in list
roll_list=roll_list[~np.isnan(roll_list)]
# roll_Arrange list in ascending order
sorted_roll_list=sorted(roll_list)
#Roll arranged in ascending order_Define half the length of list
harf_span=round(len(sorted_roll_list)/2)
# roll_Get the number below the median of list
harf_index=np.where(roll_list < sorted_roll_list[harf_span])
roll_list_harf=roll_list[harf_index]
# roll_Calculate the weights of numbers below the median of list and take the weighted moving average
weight = np.arange(len(roll_list_harf)) + 1
wma = np.sum(weight * roll_list_harf) / weight.sum()
return wma
Next, using the above function, the window size is 6 months, and the windows other than the 3 months with the lowest numerical values are trimmed and the weighted moving average is taken. You can easily get a moving average by using pandas rolling, and you can apply your own function by using apply. By the way, the roll_list that appears in the above functions sma and wma refers to an array of data acquired in the window size (period) specified by rolling of pandas. It seems that the array of data can be put into the function as a Series type if nothing is done. Since the sma and wma functions were built assuming ndarray, an error will occur if it is a Series type. In order to prevent an error, raw = True is put in the argument of apply to make it ndarray type.
moving_mean.py
#Create a column of numbers 3 months ago
df2['y_shift'] = df2['y'].shift(3)
#SMA of the value 3 months ago
df2['y_sma'] = df2['y_shift'].rolling(6,min_periods=1).apply(sma, raw = True)
#WMA of 3 months ago
df2['y_wma'] = df2['y_shift'].rolling(6,min_periods=1).apply(wma, raw = True)
#Unable to calculate WMA Set NULL to SMA value
df2.loc[pd.isna(df2['y_wma']), 'y_wma']=df2['y_sma']
display(df2.head(10))
# plot
fig=plt.figure(figsize=(12,6))
ax=plt.subplot(1,1,1)
df2.plot.bar('month','y',ax=ax,color='b',alpha=0.9)
df2.plot.bar('month','y_wma',ax=ax,color='r',alpha=0.6)
plt.show()
Data frame The yellow line is the actual value, the light blue line is the value for the last 3 months, and the green line is the value for the lowest 3 months of the last 6 months. The blue of the graph is the actual value, and the red is the value obtained by taking the trim-weighted moving average of the data up to 3 months ago (predicted value).
For example, take a look at June 2018.
·Actual value:
Trimming and taking a weighted moving average will eliminate the impact of the sharp rise in January 2018 on subsequent calculations.
I think it is convenient when you want to take a moving average that eliminates the influence of abnormal values in time-series data that frequently contains abnormal values.
Recommended Posts