Data Scientist Training Course Chapter 3 Day 3

I don't have much time, but I will proceed little by little.

By yesterday, I went close to the correlation. Yes, it's Pearson

Pearsonr

sp.stats.pearsonr(student_data_math.G1,student_data_math.G3) (0.8014679320174141, 9.001430312276602e-90)

As a result, the closer the value of 0.801 that appears is to 1, the stronger the correlation between the two variables.

Well, what happened to the second 9.001 ... so check the reference

Returns r : float Pearson's correlation coefficient p-value : float 2-tailed p-value

The p-value roughly indicates the probability of an uncorrelated system producing datasets that have a Pearson correlation at least as extreme as the one computed from these datasets.

scipy.stats.pearsonr

Well, I'm not sure, so I'll rely on Japanese

Python: Check the correlation of features with SciPy

If you refer to this, it seems that the p-value is the superior probability, so investigate further

Probability of superiority This is the standard for rejecting the null hypothesis and adopting the alternative hypothesis in the statistical hypothesis test. Also called the significance level. Generally 5% and 1% are used.

Yup. Is it really Japanese? It's unclear as much as I think, but if the probability of dominance is less than 5%, it means that the obtained correlation coefficient is a product of chance and must be credited. I'm not confident that my understanding is correct.

However, it should be noted that pearsonr is effective only when there is a linear correlation, so it is not useful when the correlation is non-linear. It's not always good to do it with pearsonr. Perhaps that will come up in future Chapters.

PairPlot

The syntax is as follows

seaborn.pairplot( DataFrame )

This will graphically display the correlation between the numeric elements in the DataFrame. In the above example, 4 elements in DataFrame are displayed.

A hist graph is displayed at the intersection of the axes, and a scatter plot between the two variables is displayed at other points so that the correlation can be seen.

When I tried pairPlot without processing the DataFrame that was in the example, it became like this

It was too big to capture properly. By the way, this was enough to save the displayed figure to a file

plot = sns.pairplot( DataFrame ) plot.savefig("output.png ")

When I investigated how to do it, I got stuck with savefig after callingget_figure (), but it seems to be the method when the version was old, and now it is an error.

Simple regression analysis

I'll do the details in the following Chapters, so I want to understand the meaning of the words.

Objective variable: Numerical value and variable to be obtained Explanatory variable: A variable to obtain the objective variable. Variables used to explain

Simple regression analysis seems to be solved by assuming an equation in which the relationship between the objective variable / explanatory variable consists of only one variable.

To proceed with these, we will use sklearn.

I've got a rough idea, but let's see the overall problem again tomorrow. Well, it's slow, but it can't be helped.