When doing statistical numerical experiments, you may want to generate correlated random numbers. A quick search will bring up a method using Cholesky decomposition. This is a way to handle any number of variables, but I only wanted the $ 2 $ variable, so I'll explain it in a slightly more muddy way [^ 1]. [^ 1]: After all, the result is the same as doing the Cholesky decomposition for the $ 2 $ variable.
Suppose you have independent random numbers $ X $ and $ Y $ with a mean of zero and an equal variance. At this time, the random number $ Z $, which correlates with $ X $ by the correlation coefficient $ \ rho $ and has the same variance, is
Z = \rho X + \sqrt{1 - \rho^2} Y
You can make it like this.
In this article
I will write about.
For the independent random numbers $ X $ and $ Y $, the mean is zero and the variance is the same, $ \ sigma ^ 2 $. Since $ X $ and $ Y $ are independent, the covariance and correlation coefficient are zero.
Also, for the time being, the random number $ Z $ that you want to correlate with $ X $ is superimposed on $ X $ and $ Y $.
Z = a X + b Y
I will define it as. $ a $ and $ b $ are constants.
The reason for doing this is that the following is true for the following extreme examples.
Looking at this table, I have a feeling that a free correlation coefficient can be created by changing the balance between $ a $ and $ b $ [^ 2]. So, the expression of $ Z $ above is promising as a random number that has an arbitrary correlation with $ X $.
Now let's actually calculate the correlation coefficient $ \ rho $ between $ X $ and $ Z $. The correlation coefficient is
\rho = \frac{\mathrm{Cov}[X, Z]}{\sqrt{\mathrm{Var}[X]} \sqrt{\mathrm{Var}[Z]}}
It can be calculated with. Here, $ \ mathrm {Cov} $ and $ \ mathrm {Var} $ are covariance and distribution, respectively. Let's start by calculating from the covariance.
\begin{align}
\mathrm{Cov}[X, Z] &= \mathrm{Cov}[X, a X + b Y] & (\because Z \Definition of)
\\
&= a \ \mathrm{Cov}[X, X] + b \ \mathrm{Cov}[X, Y] & (\because \mathrm{Cov} \Nature of)
\\
&= a \ \sigma^2 & (\because \mathrm{Cov}[X, X] = \mathrm{Var}[X] = \sigma^2, \mathrm{Cov}[X, Y] = 0)
\end{align}
It's a confusing calculation, but if you don't understand it, you should be able to reach it by going back to the definition and calmly calculating. If it's a hassle, you don't have to follow the calculation.
Also, regarding the variance of $ Z $,
\begin{align}
\mathrm{Var}[Z] &= \mathrm{Var}[a X + b Y] & (\because Z \Definition of)
\\
&= a^2 \mathrm{Var}[X] + b^2 \mathrm{Var}[Y] & (\because \mathrm{Var} \Nature of)
\\
&= a^2 \sigma^2 + b^2 \sigma^2 &
\end{align}
Therefore, the correlation coefficient that I wanted to calculate is
\begin{align}
\rho &= \frac{\mathrm{Cov}[X, Z]}{\sqrt{\mathrm{Var}[X]} \sqrt{\mathrm{Var}[Z]}}
\\
&= \frac{a \ \sigma^2}{\sqrt{\sigma^2} \sqrt{a^2 \sigma^2 + b^2 \sigma^2}}
\\
&= \frac{a}{\sqrt{a^2 + b^2}}
\end{align}
And, it was a pretty beautiful ceremony. Solving this equation for $ b $
b = \frac{\sqrt{1 - \rho^2}}{\rho} a
It can be obtained. If you put $ a = c \ rho $ ($ c $ is a positive constant) just to clean the expression,
Z = c \left( \rho X + \sqrt{1 - \rho^2} Y \right)
It can be obtained. This $ c $ is a parameter for controlling the variance of $ Z $, and in fact,
\begin{align}
\mathrm{Var}[Z] &= \mathrm{Var}\left[{c \left( \rho X + \sqrt{1 - \rho^2} Y \right)} \right]
\\
&= c^2 \left[ \rho^2 \sigma^2 + (1 - \rho) \sigma^2 \right]
\\
&= c^2 \sigma^2
\end{align}
And has the effect of multiplying the variance by $ c ^ 2 $ (that is, multiplying the standard deviation by $ c $). By setting this to $ 1 $, the opening formula
Z = \rho X + \sqrt{1 - \rho^2} Y
Can be obtained.
[^ 2]: Freedom is in the range of $ -1 $ to $ 1 $.
Now let's verify that the random numbers actually obtained using Python 3 are what we want. For now, let's use uniform random numbers for $ X $ and $ Y $.
import numpy as np
import numpy.random as rand
import matplotlib.pyplot as plt
size = int(1e3) # Size of the vector
rho_in = 0.6 # Correlation coefficient between two random variables
# Generate uniform random numbers with mean = 0
x = rand.rand(size) - 0.5
y = rand.rand(size) - 0.5
# Generate random numbers correlated with x
z = rho_in * x + (1 - rho_in ** 2) ** 0.5 * y
# Calculate stats
variance = np.var(z)
rho_out = np.corrcoef(x, z)[1][0]
print("variance: {}, correlation: {}".format(variance, rho_out))
# Plot results
fig, ax = plt.subplots()
ax.scatter(x, z, s=1)
ax.set_xlabel('X')
ax.set_ylabel('Z')
plt.show()
The result looks like this.
variance: 0.08332907524577293, correlation: 0.5952102586280674
Oh? The variance value is halfway. The correlation coefficient seems to be closer to the target ($ 0.6 $). ..
That should be it. The variance of the generated $ Z $ was the same as the variance of $ X $. Now $ X $ is a uniform random number in the interval $ [-0.5, 0.5] $, and its variance is [1/12](http://www.geocities.jp/jun930/etc/varianceofrandom.html "1" Since the variance of random numbers is "), a value close to $ 1/12 = 0.0833 ... $ is coming out.
The resulting $ Z
If you don't like this, go to y
y = rand.randn(size) * (1 / 12) ** 0.5
You can change it to a normal random number like this. The standard deviation of uniform random numbers is applied to make the variances uniform at $ X $ and $ Y $.
The output is below.
variance: 0.0823649369923887, correlation: 0.5983078612793685
This looks pretty good, so I used it.
However, one caveat is that the random numbers $ Z $ created by these two methods are not uniform random numbers. If both $ X $ and $ Y $ use uniform random numbers, the histogram will look like a trapezoid like the one below.
This is also the case when using a normal random number for $ Y $, which makes the mountain edge look a little rounded.
This is because adding two random variables that follow different uniform distributions does not always result in a uniform distribution (it's normal), and in statistical terms, "[reproducibility](https: /) /ja.wikipedia.org/wiki/%E5%86%8D%E7%94%9F%E6%80%A7 "No reproducibility-Wikipedia") ". So, forcibly again There is also an attempt to return to a uniform random number.
What kind of distribution is reproducible is, for example, the normal distribution that everyone loves. So $ X $ and $ Y $ should be random numbers! That is the following version. Well, the only thing that has changed is the upper x
and y
.
import numpy as np
import numpy.random as rand
import matplotlib.pyplot as plt
size = int(1e4) # Size of the vector
rho_in = 0.6 # Correlation coefficient between two random variables
# Generate normal random numbers
x = rand.randn(size)
y = rand.randn(size)
# Generate random numbers correlated with x
z = rho_in * x + (1 - rho_in ** 2) ** 0.5 * y
# Calculate stats
variance = np.var(z)
rho_out = np.corrcoef(x, z)[1][0]
print("variance: {}, correlation: {}".format(variance, rho_out))
# Plot results
fig, ax = plt.subplots()
ax.scatter(x, z, s=1)
ax.set_xlabel('X')
ax.set_ylabel('Z')
plt.show()
The output is like this.
variance: 0.9884508169450733, correlation: 0.5936089719053868
I tried to get a correlation of $ 0.6 $ using a normal random number with a variance of $ 1 $, so it's pretty close! Of course, if the correlation coefficient is within the range of $ [-1, 1] $, it can be negative or $ 1 $ [^ 3].
Click here for the plot.
I will also post a histogram for the time being. The dotted red line is the theoretically predicted normal distribution with mean zero and variance $ 1 $. It seems to match the good feeling!
[^ 3]: By the way, if you specify a value other than $ [-1, 1] $ for $ \ rho $, something strange will happen. If you are interested, think or try.
How can we create correlated random numbers? And why? I have seen. People who are not accustomed to it may be confused, but I think that there is no loss in calculating the variance of a new random number from the nature of the variance (because it is a simpler calculation than it looks). If you want to learn statistics from now on, you may want to follow the calculations in this article.
We also found that a simple combination of uniform random numbers results in a strange distribution. I used $ X $: uniform random number, $ Y $: normal random number version because it was for a light experiment that doesn't care much about the shape of the distribution, but those who care about the distribution use normal random numbers. It may be quick after all.
** Generation of n correlated pseudo-random numbers (with Python sample) --Qiita ** This is the sequel. How can $ n $ random numbers be created? I am explaining that.
Recommended Posts