When you start data analysis, [summary statistics] such as data mean and variance (https://ja.wikipedia.org/wiki/%E8%A6%81%E7%B4%84%E7%B5%B1% I think you should check E8% A8% 88% E9% 87% 8F). However, sometimes it is not enough to just check the summary statistics.
For example, in the case of Data like this [^ 1]
import pandas as pd
import seaborn as sns
#Data reading
df = pd.read_csv('https://git.io/vD7ui')
#Scatter plot
sns.lmplot(x='x', y='y', col='data', hue='data', col_wrap=2, fit_reg=False, data=df)
If you look at the scatter plot, you can see that the data are different, but the mean and standard deviation take the same value.
#average
df.groupby('data').mean()
data | x | y |
---|---|---|
0 | 9 | 7.500909 |
1 | 9 | 7.500909 |
2 | 9 | 7.500000 |
3 | 9 | 7.500909 |
#standard deviation
df.groupby('data').std()
data | x | y |
---|---|---|
0 | 3.316625 | 2.031568 |
1 | 3.316625 | 2.031657 |
2 | 3.316625 | 2.030424 |
3 | 3.316625 | 2.030579 |
You can see that the fine values are different, but they are almost the same.
Also, the regression line will be exactly the same.
#Scatter plot+Regression line
sns.lmplot(x='x', y='y', col='data', hue='data', col_wrap=2, fit_reg=True, data=df)
In pandas, you can display summary statistics together with the describe
method.
#Summary statistics
df.groupby('data').describe()
x y
data
0 count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.031568
min 4.000000 4.260000
25% 6.500000 6.315000
50% 9.000000 7.580000
75% 11.500000 8.570000
max 14.000000 10.840000
1 count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.031657
min 4.000000 3.100000
25% 6.500000 6.695000
50% 9.000000 8.140000
75% 11.500000 8.950000
max 14.000000 9.260000
2 count 11.000000 11.000000
mean 9.000000 7.500000
std 3.316625 2.030424
min 4.000000 5.390000
25% 6.500000 6.250000
50% 9.000000 7.110000
75% 11.500000 7.980000
max 14.000000 12.740000
3 count 11.000000 11.000000
mean 9.000000 7.500909
std 3.316625 2.030579
min 8.000000 5.250000
25% 8.000000 6.170000
50% 8.000000 7.040000
75% 8.000000 8.190000
max 19.000000 12.500000
The mean and standard deviation are as you saw earlier, but you can see that the quartiles are slightly different. Especially data3 is very different.
In this way, data with different scatter plots but the same statistics and regression line Anscombe's example It is called B3% E3% 82% B9% E3% 82% B3% E3% 83% A0% E3% 81% AE% E4% BE% 8B). Therefore, it is important to draw a scatter plot as well as statistics.
However, in actual data, it is rare that it is two-dimensional. In that case, [Principal Component Analysis (PCA)](https://ja.wikipedia.org/wiki/%E4%B8%BB%E6%88%90%E5%88%86%E5%88%86%E6 It is necessary to devise such as using% 9E% 90) to reduce the dimension to 2 dimensions and visualize it.
[^ 1]: Rows with the same value in the data column represent the same data
Recommended Posts