An article on statistics was posted on President Online.
Is there a "correlation" between breakfast and work hours and business performance? http://president.jp/articles/-/12416
In the above article, mathematical formulas are certainly not mentioned, so it is easy to understand and the explanation is detailed, so it is perfect for getting started with statistics. However, it is assumed that it will be calculated manually in Excel, which is a bit of a hassle.
So I would like to calculate these problems with the Python I have been using so far.
The content of the problem is to investigate whether each employee has a correlation with the probability of having eaten breakfast (= breakfast rate), the time of arrival at work, and the business performance as three variables. Examining the correlation between variables in this way can be said to be the basis of various statistics.
Let's call each variable X Y Z so that it can be handled by a calculator. First, I prepared this as CSV file data.
First, find the statistics that appear on Page 2. Read the above data to find basic statistics such as mean and standard deviation. This is easy with pandas and can be found in a matter of seconds.
data = pd.read_csv("data.csv", names=['X', 'Y', 'Z'])
data.describe()
# =>
# X Y Z
# count 7.000000 7.000000 7.000000
# mean 42.571429 -8.571429 98.714286
# std 42.968427 14.920424 8.440266
# min 0.000000 -40.000000 88.000000
# 25% 5.000000 -10.000000 92.000000
# 50% 33.000000 -5.000000 100.000000
# 75% 77.500000 0.000000 104.500000
# max 100.000000 5.000000 110.000000
In the original article, I drew a scatter plot to examine the correlation. Let's do this in Python as well. To find out the correlation of each variable collectively, it is quick to draw a scatter plot matrix.
from pandas.tools.plotting import scatter_matrix
plt.figure()
scatter_matrix(data)
plt.savefig("image.png ")
The correlation coefficient can be obtained by dividing the covariance by the standard deviation of two variables, but using pandas, it can be easily obtained with a single function.
data.corr()
#=>
# X Y Z
# X 1.000000 0.300076 0.550160
# Y 0.300076 1.000000 -0.545455
# Z 0.550160 -0.545455 1.000000
I was able to find the correlation matrix in page 5 in one shot. As a general guideline, it is said that there is a strong correlation when it is 0.7 or more, so it can be said that it is a delicate correlation as described in the original article.
Finally, find the regression equation that appears at the end of 4th page. This is one of the statistical functions of SciPy [scipy.stats.linregress](http://docs.scipy.org/doc/scipy-0.14.0/reference/ It can be obtained by simple regression analysis using generated / scipy.stats.linregress.html).
#Retrieve value
x = data.ix[:,0].values
y = data.ix[:,1].values
z = data.ix[:,2].values
#Regression equation for X and Z
slope, intercept, r_value, p_value, std_err = sp.stats.linregress(x, z)
print(slope, intercept, r_value)
#=> 0.108067677706 94.113690292 0.550160142939
#Regression equation for Y and Z
slope, intercept, r_value, p_value, std_err = sp.stats.linregress(y, z)
print(slope, intercept, r_value)
#=> -0.308556149733 96.0695187166 -0.545455364632
Note that slope is the slope, intercept is the intercept, and r_value is the correlation coefficient. With the slope as a and the intercept as b, the linear equation y = ax + b is obtained.
For example, a linear regression equation for X and Z regresses to the equation y = 0.11x + 94.11 (up to two decimal places).
Using Python made statistical analysis even easier than in Excel. Examining the correlation between two variables is one of the basics of statistics, so it is often applied to real problems, and once you get used to it, you will be able to perform these analyzes in a very short time.
Recommended Posts