Differences between numpy and pandas methods for finding variance

TL;DR

I'm doing distributed processing of numpy and pandas, and they don't match, so why? I will leave a note because it became.

The result of the method for finding var in numpy and pandas does not match the default value

Test with a simple, randomly generated matrix. It doesn't really match.

import numpy as np
import pandas as pd

X = np.random.randn(10, 10)
df = pd.DataFrame(data=X)

np.allclose(X, df.values)
# True

X_var = np.var(X, axis=1)
df_var = df.var(axis=1)

np.allclose(X_var, df_var.values)
# False

When I actually check the documentation, the default is ddof = 0 in numpy.var. , Pandas.DataFrame.var defaults to ddof = 1 ..

If you align the default values, the results will match.

X_var_ddof1 = np.var(X, ddof=1, axis=1)
df_var_ddof1 = df.var(axis=1)

np.allclose(X_var_ddof1, df_var_ddof1.values)
# True

I thought that the calculation results wouldn't match, but in fact there was a slight difference between numpy and pandas. I'd like you to unify it, but I'll publish a memo in case someone was addicted to it.

Recommended Posts

Differences between numpy and pandas methods for finding variance

Differences between Numpy 1D array [x] and 2D array [x, 1]

Correspondence between pandas and SQL

To go back and forth between standard python, numpy, pandas ①

Performance comparison between 2D matrix calculation and for with numpy

Differences between Windows and Linux directories

Differences between yum commands and APT commands

Difference between Numpy randint and Random randint

Differences between Python, stftime and strptime

Differences in authenticity between Python and JavaScript

Differences in syntax between Python and Java