Introduction

Sometimes I feel like trimming and taking a moving average. I also want to take a weighted average because I want to increase the influence of the latest numerical value. Basically, the trend is the same, but I would like to take a moving average without reference to the time-series data when the numerical value rises occasionally and peaky, or when the numerical value rises sharply. You may think so. Eh, no? I still write. I had a little trouble with rolling (). Apply, so I want to write it down.

Truncated mean

A method of averaging by arranging the numerical values of the data group to be averaged in order of size and excluding N% on one side or both sides. It is assumed that there are 10 data below. [10,24,31,34,65,86,87,88,99,101]

If you take the average normally, (10+24+31+34+65+86+87+88+99+101)\div10=62.50‬ Assuming that 10% of data on one side and 10% on one side are removed, (24+31+34+65+86+87+88+99)\div8=64.25

Trim (pruning) average is to remove and average like this. If it is a moving average, it is an image that trims and averages the numerical value in the window size. The merit is that outliers can be eliminated. It is possible to prevent the average from being pulled toward the outlier.

weighted average

Since explanations are written in various places, details are omitted. Weighting numbers and averaging them. Weighted moving averages are often weighted so that the influence of the latest numerical value is large.

Since we want to take the trim-weighted moving average this time, we take the weighted moving average after trimming with the numerical value within the window size of the moving average.

Try it out in Python

Package import

`import.py`


import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import datetime as dt
from dateutil.relativedelta import relativedelta
sns.set()

Data creation

Create data based on the transition of the number of active users of smartphone apps

`make_data.py`


#Decay curve
def exp_func(x, a, b):
    return b*(x**a)
x=np.linspace(1,36,36)
data=exp_func(x, -0.5, 100000)

#Data frame
df=pd.DataFrame({'x':x,'y':data})
df=df.astype(int)

#Create a month column because it is an assumption of time series data
init_mon=dt.datetime(2017,df['x'].values[0],1)
months=[init_mon]
for mon in range(1,len(df)):
    months.append(init_mon + relativedelta(months=mon))
df['month']=months
df.index=months
display(df.head())

# plot
fig=plt.figure(figsize=(12,6))
ax=plt.subplot(1,1,1)
df.plot.bar('month','y',ax=ax)
plt.show()

Data processing (entering abnormal values)

It seems that the number of active users will temporarily increase considerably depending on the campaign or event of the smartphone application, so let's change the numerical value of the data assuming such a situation.

`change_data.py`


df2=df.copy()
df2.loc[(df2.index.month==1)&(df2.index.year>=2018), 'y']=df2['y']*1.6
df2.loc[(df2.index.month==2)&(df2.index.year>=2018), 'y']=df2['y']*1.4
df2.loc[(df2.index.month==3)&(df2.index.year>=2018), 'y']=df2['y']*1.2

fig=plt.figure(figsize=(12,6))
ax=plt.subplot(1,1,1)
df2.plot.bar('month','y',ax=ax)
plt.show()

Data that seems to have increased the number of active users in campaigns such as the 1st anniversary has been completed.

Try to take the trim-weighted moving average

This time, let's build a code with the policy of using the trim-weighted moving average as the predicted value. The predicted value of the numerical value after 3 months is used as the value of the trim-weighted moving average. (Example: Forecast of the numerical value of 2018-08 is the value obtained by taking the trim-weighted moving average as of 2018-05)

First, the trim simple moving average function

`sma.py`


def sma(roll_list):
    # roll_Remove if Nan is in list
    roll_list=roll_list[~np.isnan(roll_list)]
    # roll_Arrange list in ascending order
    sorted_roll_list=sorted(roll_list)
    #Roll arranged in ascending order_Define half the length of list
    harf_span=round(len(sorted_roll_list)/2)
    if harf_span > 0:
        # roll_Get the numbers below the median of list and take the average
        harf_index=np.where(roll_list < sorted_roll_list[harf_span])
        roll_list_harf=roll_list[harf_index]
        sma = np.sum(roll_list_harf) / len(roll_list_harf)
    else:
        # roll_Since the length of list is 1 or less, the median cannot be taken.
        # roll_Use the value of list as it is
        roll_list_harf=roll_list[0]
        sma = roll_list_harf
    return sma

Then the trim-weighted moving average function

`sma.py`


def wma(roll_list):
    # roll_Remove if Nan is in list
    roll_list=roll_list[~np.isnan(roll_list)]
    # roll_Arrange list in ascending order
    sorted_roll_list=sorted(roll_list)
    #Roll arranged in ascending order_Define half the length of list
    harf_span=round(len(sorted_roll_list)/2)
    # roll_Get the number below the median of list
    harf_index=np.where(roll_list < sorted_roll_list[harf_span])
    roll_list_harf=roll_list[harf_index]
    # roll_Calculate the weights of numbers below the median of list and take the weighted moving average
    weight = np.arange(len(roll_list_harf)) + 1
    wma = np.sum(weight * roll_list_harf) / weight.sum()
    return wma

Next, using the above function, the window size is 6 months, and the windows other than the 3 months with the lowest numerical values are trimmed and the weighted moving average is taken. You can easily get a moving average by using pandas rolling, and you can apply your own function by using apply. By the way, the roll_list that appears in the above functions sma and wma refers to an array of data acquired in the window size (period) specified by rolling of pandas. It seems that the array of data can be put into the function as a Series type if nothing is done. Since the sma and wma functions were built assuming ndarray, an error will occur if it is a Series type. In order to prevent an error, raw = True is put in the argument of apply to make it ndarray type.

`moving_mean.py`


#Create a column of numbers 3 months ago
df2['y_shift'] = df2['y'].shift(3)
#SMA of the value 3 months ago
df2['y_sma'] = df2['y_shift'].rolling(6,min_periods=1).apply(sma, raw = True)
#WMA of 3 months ago
df2['y_wma'] = df2['y_shift'].rolling(6,min_periods=1).apply(wma, raw = True)
#Unable to calculate WMA Set NULL to SMA value
df2.loc[pd.isna(df2['y_wma']), 'y_wma']=df2['y_sma']

display(df2.head(10))

# plot
fig=plt.figure(figsize=(12,6))
ax=plt.subplot(1,1,1)
df2.plot.bar('month','y',ax=ax,color='b',alpha=0.9)
df2.plot.bar('month','y_wma',ax=ax,color='r',alpha=0.6)
plt.show()

Data frame The yellow line is the actual value, the light blue line is the value for the last 3 months, and the green line is the value for the lowest 3 months of the last 6 months. The blue of the graph is the actual value, and the red is the value obtained by taking the trim-weighted moving average of the data up to 3 months ago (predicted value).

For example, take a look at June 2018. ·Actual value: 　23,570 ・ When taking a weighted moving average for 3 months without trimming: January 2018-March 2018 weighted average 　(3 \times 30,983 + 2 \times 37,416 + 1 \times 44,376) \div (3+2+1) = 35,359 ・ When trimming and taking the weighted moving average for 3 months: October 2017-March 2018 The weighted average for 3 months when the value is low 　(3 \times 30,982 + 2 \times 28,867 + 1 \times 30,151) \div (3+2+1) = 30,139

Trimming and taking a weighted moving average will eliminate the impact of the sharp rise in January 2018 on subsequent calculations.

Summary

I think it is convenient when you want to take a moving average that eliminates the influence of abnormal values in time-series data that frequently contains abnormal values.

Trim and take a weighted moving average

Introduction

Truncated mean

weighted average

Try it out in Python

Package import

import.py

Data creation

make_data.py

Data processing (entering abnormal values)

change_data.py

Try to take the trim-weighted moving average

sma.py

sma.py

moving_mean.py

Summary

`import.py`

`make_data.py`

`change_data.py`

`sma.py`

`sma.py`

`moving_mean.py`