This article introduces Donald Knuth's algorithm, which is a good variance calculation algorithm. I usually rely on the library to calculate the variance, and I haven't considered the algorithm in particular, so I hope I can take this opportunity to deepen my knowledge.
First, I'll explain why you need such an algorithm. The unbiased variance is expressed by the following formula.
\sigma^2 = \frac{1}{n(n-1)} \biggl(\sum_{i=1}^{n}x_i^2-\Bigl(\sum_{i=1}^{n}x_i\Bigr)^2\biggr) =\frac{1}{n-1}\Bigl(E[X^2]-E[X]^2\Bigr)
Now, here are two cumulative sums. The cumulative sum of 1. $ x $ itself and the cumulative sum of 2. $ x ^ 2 $. There are two main problems at this time. First, the accuracy of numerical calculations is not guaranteed when $ x $ is a very large number. Second, you have to do large calculations again when you have more samples. To solve these challenges, let's look at Donald Knuth's algorithm in the next section.
To solve the above-mentioned problems, this algorithm was designed to calculate the variance sequentially. The algorithm is shown below. $ M_k $ represents the mean and $ s_k $ is the numerator of the variance.
Initialization: $ M_1 = x_1, S_1 = 0 $ Calculate the following recurrence formula in $ k \ geq 2 $
\begin{align}
M_k &= M_{k-1} + \frac{(x_k - M_{k-1})}{k}\\
S_k &= S_{k-1} + (x_k - M_{k-1})*(x_k - M_k)
\end{align}
The estimator of the unbiased variance to be calculated is $ s ^ 2 = S_k / (k-1) $ in $ k \ geq 2 $.
Let's implement it in Python to make sure that the above algorithm actually works. Since it is a recurrence formula, we will implement it with a recursive function. I made it possible to check each print statement so that you can see how it is being calculated sequentially.
def calc_var(x, k=0, M=0, s=0):
print('k=', k)
print('M=', M)
print('s=', s)
print('-----------------------')
if k == 0:
M = x[0]
s = 0
delta = x[k] - M
M += delta/(k+1)
s += delta*(x[k]-M)
k += 1
if k == len(x):
return M, s/(k-1)
else:
return calc_var(x, k=k, M=M, s=s)
x = [3.3, 5, 7.2, 12, 4, 6, 10.3]
print(calc_var(x))
Execution result
k= 0
M= 0
s= 0
-----------------------
k= 1
M= 3.3
s= 0.0
-----------------------
k= 2
M= 4.15
s= 1.4449999999999996
-----------------------
k= 3
M= 5.166666666666667
s= 7.646666666666666
-----------------------
k= 4
M= 6.875
s= 42.6675
-----------------------
k= 5
M= 6.3
s= 49.279999999999994
-----------------------
k= 6
M= 6.25
s= 49.355
-----------------------
(6.828571428571428, 10.56904761904762)
Let's actually calculate and compare the mean and variance with numpy. The keyword argument ddof = 1 is for calculating as unbiased variance.
import numpy as np
x_arr = np.array(x)
print((x_arr.mean(), x_arr.var(ddof=1)))
Output result
(6.828571428571428, 10.569047619047621)
I was able to calculate sequentially like this. Actually, I don't think that python recursion is faster than numpy, so I don't think I have a chance to implement it myself. I received it. I hope it will be helpful to everyone.
Accurately computing running variance https://www.johndcook.com/blog/standard_deviation/
Recommended Posts