An article on statistics was posted on President Online.

Is there a "correlation" between breakfast and work hours and business performance? http://president.jp/articles/-/12416

In the above article, mathematical formulas are certainly not mentioned, so it is easy to understand and the explanation is detailed, so it is perfect for getting started with statistics. However, it is assumed that it will be calculated manually in Excel, which is a bit of a hassle.

So I would like to calculate these problems with the Python I have been using so far.

Problem and its solution

The content of the problem is to investigate whether each employee has a correlation with the probability of having eaten breakfast (= breakfast rate), the time of arrival at work, and the business performance as three variables. Examining the correlation between variables in this way can be said to be the basis of various statistics.

Let's call each variable X Y Z so that it can be handled by a calculator. First, I prepared this as CSV file data.

Calculate basic statistics

First, find the statistics that appear on Page 2. Read the above data to find basic statistics such as mean and standard deviation. This is easy with pandas and can be found in a matter of seconds.

data = pd.read_csv("data.csv", names=['X', 'Y', 'Z'])
data.describe()
# =>
#                 X          Y           Z
# count    7.000000   7.000000    7.000000
# mean    42.571429  -8.571429   98.714286
# std     42.968427  14.920424    8.440266
# min      0.000000 -40.000000   88.000000
# 25%      5.000000 -10.000000   92.000000
# 50%     33.000000  -5.000000  100.000000
# 75%     77.500000   0.000000  104.500000
# max    100.000000   5.000000  110.000000

Draw a scatterplot matrix

In the original article, I drew a scatter plot to examine the correlation. Let's do this in Python as well. To find out the correlation of each variable collectively, it is quick to draw a scatter plot matrix.

from pandas.tools.plotting import scatter_matrix
plt.figure()
scatter_matrix(data)
plt.savefig("image.png ")

Find the correlation coefficient

The correlation coefficient can be obtained by dividing the covariance by the standard deviation of two variables, but using pandas, it can be easily obtained with a single function.

data.corr()
#=>
#           X         Y         Z
# X  1.000000  0.300076  0.550160
# Y  0.300076  1.000000 -0.545455
# Z  0.550160 -0.545455  1.000000

I was able to find the correlation matrix in page 5 in one shot. As a general guideline, it is said that there is a strong correlation when it is 0.7 or more, so it can be said that it is a delicate correlation as described in the original article.

Do regression analysis

Finally, find the regression equation that appears at the end of 4th page. This is one of the statistical functions of SciPy [scipy.stats.linregress](http://docs.scipy.org/doc/scipy-0.14.0/reference/ It can be obtained by simple regression analysis using generated / scipy.stats.linregress.html).

#Retrieve value
x = data.ix[:,0].values
y = data.ix[:,1].values
z = data.ix[:,2].values

#Regression equation for X and Z
slope, intercept, r_value, p_value, std_err = sp.stats.linregress(x, z)
print(slope, intercept, r_value)
#=> 0.108067677706 94.113690292 0.550160142939

#Regression equation for Y and Z
slope, intercept, r_value, p_value, std_err = sp.stats.linregress(y, z)
print(slope, intercept, r_value)
#=> -0.308556149733 96.0695187166 -0.545455364632

Note that slope is the slope, intercept is the intercept, and r_value is the correlation coefficient. With the slope as a and the intercept as b, the linear equation y = ax + b is obtained.

For example, a linear regression equation for X and Z regresses to the equation y = 0.11x + 94.11 (up to two decimal places).

Summary

Using Python made statistical analysis even easier than in Excel. Examining the correlation between two variables is one of the basics of statistics, so it is often applied to real problems, and once you get used to it, you will be able to perform these analyzes in a very short time.

Try to calculate a statistical problem in Python