TL;DR
I'm doing distributed processing of numpy and pandas, and they don't match, so why? I will leave a note because it became.
Test with a simple, randomly generated matrix. It doesn't really match.
import numpy as np
import pandas as pd
X = np.random.randn(10, 10)
df = pd.DataFrame(data=X)
np.allclose(X, df.values)
# True
X_var = np.var(X, axis=1)
df_var = df.var(axis=1)
np.allclose(X_var, df_var.values)
# False
When I actually check the documentation, the default is ddof = 0
in numpy.var. , Pandas.DataFrame.var defaults to ddof = 1
..
If you align the default values, the results will match.
X_var_ddof1 = np.var(X, ddof=1, axis=1)
df_var_ddof1 = df.var(axis=1)
np.allclose(X_var_ddof1, df_var_ddof1.values)
# True
I thought that the calculation results wouldn't match, but in fact there was a slight difference between numpy and pandas. I'd like you to unify it, but I'll publish a memo in case someone was addicted to it.
Recommended Posts