When doing statistical numerical experiments, you may want to generate correlated random numbers. A quick search will bring up a method using Cholesky decomposition. This is a way to handle any number of variables, but I only wanted the $ 2 $ variable, so I'll explain it in a slightly more muddy way [^ 1]. [^ 1]: After all, the result is the same as doing the Cholesky decomposition for the $ 2 $ variable.

Overview

Suppose you have independent random numbers $ X $ and $ Y $ with a mean of zero and an equal variance. At this time, the random number $ Z $, which correlates with $ X $ by the correlation coefficient $ \ rho $ and has the same variance, is

Z = \rho X + \sqrt{1 - \rho^2} Y

You can make it like this.

In this article

Theory: Why can this generate correlated random numbers?
Verification: Try it out in Python

I will write about.

theory

For the independent random numbers $ X $ and $ Y $, the mean is zero and the variance is the same, $ \ sigma ^ 2 $. Since $ X $ and $ Y $ are independent, the covariance and correlation coefficient are zero.

Also, for the time being, the random number $ Z $ that you want to correlate with $ X $ is superimposed on $ X $ and $ Y $.

Z = a X + b Y

I will define it as. $ a $ and $ b $ are constants.

The reason for doing this is that the following is true for the following extreme examples.

a	b	\rho
> 0	0	1
=0	\neq 0	0
< 0	0	-1

Looking at this table, I have a feeling that a free correlation coefficient can be created by changing the balance between $ a $ and $ b $ [^ 2]. So, the expression of $ Z $ above is promising as a random number that has an arbitrary correlation with $ X $.

Now let's actually calculate the correlation coefficient $ \ rho $ between $ X $ and $ Z $. The correlation coefficient is

\rho = \frac{\mathrm{Cov}[X, Z]}{\sqrt{\mathrm{Var}[X]} \sqrt{\mathrm{Var}[Z]}}

It can be calculated with. Here, $ \ mathrm {Cov} $ and $ \ mathrm {Var} $ are covariance and distribution, respectively. Let's start by calculating from the covariance.

\begin{align}
\mathrm{Cov}[X, Z] &= \mathrm{Cov}[X, a X + b Y] & (\because Z \Definition of)
\\
&= a \ \mathrm{Cov}[X, X] + b \ \mathrm{Cov}[X, Y] & (\because \mathrm{Cov} \Nature of)
\\
&= a \ \sigma^2 & (\because \mathrm{Cov}[X, X] = \mathrm{Var}[X] = \sigma^2, \mathrm{Cov}[X, Y] = 0)
\end{align}

It's a confusing calculation, but if you don't understand it, you should be able to reach it by going back to the definition and calmly calculating. If it's a hassle, you don't have to follow the calculation.

Also, regarding the variance of $ Z $,

\begin{align}
\mathrm{Var}[Z] &= \mathrm{Var}[a X + b Y] & (\because Z \Definition of)
\\
&= a^2 \mathrm{Var}[X] + b^2 \mathrm{Var}[Y] & (\because \mathrm{Var} \Nature of)
\\
&= a^2 \sigma^2 + b^2 \sigma^2 &
\end{align}

Therefore, the correlation coefficient that I wanted to calculate is

\begin{align}
\rho &= \frac{\mathrm{Cov}[X, Z]}{\sqrt{\mathrm{Var}[X]} \sqrt{\mathrm{Var}[Z]}}
\\
&= \frac{a \ \sigma^2}{\sqrt{\sigma^2} \sqrt{a^2 \sigma^2 + b^2 \sigma^2}}
\\
&= \frac{a}{\sqrt{a^2 + b^2}}
\end{align}

And, it was a pretty beautiful ceremony. Solving this equation for $ b $

b = \frac{\sqrt{1 - \rho^2}}{\rho} a

It can be obtained. If you put $ a = c \ rho $ ($ c $ is a positive constant) just to clean the expression,

Z = c \left( \rho X + \sqrt{1 - \rho^2} Y \right)

It can be obtained. This $ c $ is a parameter for controlling the variance of $ Z $, and in fact,

\begin{align}
\mathrm{Var}[Z] &= \mathrm{Var}\left[{c \left( \rho X + \sqrt{1 - \rho^2} Y \right)} \right]
\\
&= c^2 \left[ \rho^2 \sigma^2 + (1 - \rho) \sigma^2 \right]
\\
&= c^2 \sigma^2
\end{align}

And has the effect of multiplying the variance by $ c ^ 2 $ (that is, multiplying the standard deviation by $ c $). By setting this to $ 1 $, the opening formula

Z = \rho X + \sqrt{1 - \rho^2} Y

Can be obtained.

[^ 2]: Freedom is in the range of $ -1 $ to $ 1 $.

Verification

Now let's verify that the random numbers actually obtained using Python 3 are what we want. For now, let's use uniform random numbers for $ X $ and $ Y $.

import numpy as np
import numpy.random as rand
import matplotlib.pyplot as plt

size = int(1e3) # Size of the vector
rho_in = 0.6 # Correlation coefficient between two random variables

# Generate uniform random numbers with mean = 0
x = rand.rand(size) - 0.5
y = rand.rand(size) - 0.5

# Generate random numbers correlated with x
z = rho_in * x + (1 - rho_in ** 2) ** 0.5 * y

# Calculate stats
variance = np.var(z)
rho_out = np.corrcoef(x, z)[1][0]
print("variance: {}, correlation: {}".format(variance, rho_out))

# Plot results
fig, ax = plt.subplots()
ax.scatter(x, z, s=1)
ax.set_xlabel('X')
ax.set_ylabel('Z')
plt.show()

The result looks like this.

variance: 0.08332907524577293, correlation: 0.5952102586280674

Oh? The variance value is halfway. The correlation coefficient seems to be closer to the target ($ 0.6 $). ..

That should be it. The variance of the generated $ Z $ was the same as the variance of $ X $. Now $ X $ is a uniform random number in the interval $ [-0.5, 0.5] $, and its variance is [1/12](http://www.geocities.jp/jun930/etc/varianceofrandom.html "1" Since the variance of random numbers is "), a value close to $ 1/12 = 0.0833 ... $ is coming out.

The resulting $ Z - X $ plot looks like this. Since the uniform random number is added to the uniform random number, it looks like a parallelogram. uniform_uniform_rho0.6.png

If you don't like this, go to y

y = rand.randn(size) * (1 / 12) ** 0.5

You can change it to a normal random number like this. The standard deviation of uniform random numbers is applied to make the variances uniform at $ X $ and $ Y $.

The output is below. variance: 0.0823649369923887, correlation: 0.5983078612793685 uniform_normal_rho0.6.png

This looks pretty good, so I used it.

However, one caveat is that the random numbers $ Z $ created by these two methods are not uniform random numbers. If both $ X $ and $ Y $ use uniform random numbers, the histogram will look like a trapezoid like the one below. histo_uniform_uniform_rho0.6.png

This is also the case when using a normal random number for $ Y $, which makes the mountain edge look a little rounded.

This is because adding two random variables that follow different uniform distributions does not always result in a uniform distribution (it's normal), and in statistical terms, "[reproducibility](https: /) /ja.wikipedia.org/wiki/%E5%86%8D%E7%94%9F%E6%80%A7 "No reproducibility-Wikipedia") ". So, forcibly again There is also an attempt to return to a uniform random number.

What kind of distribution is reproducible is, for example, the normal distribution that everyone loves. So $ X $ and $ Y $ should be random numbers! That is the following version. Well, the only thing that has changed is the upper x and y.

import numpy as np
import numpy.random as rand
import matplotlib.pyplot as plt

size = int(1e4) # Size of the vector
rho_in = 0.6 # Correlation coefficient between two random variables

# Generate normal random numbers
x = rand.randn(size)
y = rand.randn(size)

# Generate random numbers correlated with x
z = rho_in * x + (1 - rho_in ** 2) ** 0.5 * y

# Calculate stats
variance = np.var(z)
rho_out = np.corrcoef(x, z)[1][0]
print("variance: {}, correlation: {}".format(variance, rho_out))

# Plot results
fig, ax = plt.subplots()
ax.scatter(x, z, s=1)
ax.set_xlabel('X')
ax.set_ylabel('Z')
plt.show()

The output is like this. variance: 0.9884508169450733, correlation: 0.5936089719053868

I tried to get a correlation of $ 0.6 $ using a normal random number with a variance of $ 1 $, so it's pretty close! Of course, if the correlation coefficient is within the range of $ [-1, 1] $, it can be negative or $ 1 $ [^ 3].

Click here for the plot. normal_normal_rho0.6.png

I will also post a histogram for the time being. The dotted red line is the theoretically predicted normal distribution with mean zero and variance $ 1 $. It seems to match the good feeling! histo_normal_normal_rho0.6.png

[^ 3]: By the way, if you specify a value other than $ [-1, 1] $ for $ \ rho $, something strange will happen. If you are interested, think or try.

Summary

How can we create correlated random numbers? And why? I have seen. People who are not accustomed to it may be confused, but I think that there is no loss in calculating the variance of a new random number from the nature of the variance (because it is a simpler calculation than it looks). If you want to learn statistics from now on, you may want to follow the calculations in this article.

We also found that a simple combination of uniform random numbers results in a strange distribution. I used $ X $: uniform random number, $ Y $: normal random number version because it was for a light experiment that doesn't care much about the shape of the distribution, but those who care about the distribution use normal random numbers. It may be quick after all.

I want to read it together

** Generation of n correlated pseudo-random numbers (with Python sample) --Qiita ** This is the sequel. How can $ n $ random numbers be created? I am explaining that.

Generate two correlated pseudo-random numbers (with Python sample)

Overview

theory

Verification

Summary

I want to read it together