Hello. This is Hayashi @ Ienter.
Last time, I talked about regression analysis of mathematical approach in Blog.
Recently, Python is increasingly used in the fields of statistical analysis and machine learning. Python has a powerful library of mathematical operations and data visualization. I think this is one of the reasons.
This time, I will introduce a python library called "scikit-learn" that is often used in machine learning. A very powerful library. In fact, let's perform a regression analysis of the sample data from the previous blog in the library.
Install the Python package "Anaconda". This is a package that allows you to install Python itself and libraries that are often used in science, technology, mathematics, and data analysis all at once. Windows / MacOS / Linux packages are available. I will install the Python version 3.5 this time.
When you install Anaconda, there is also a library called "Jupyter Notebook" It will be installed. This time we will use this framework to proceed.
By the way, "Jupyter Notebook" is a framework that extends "IPython" that can execute python interactively in a command line environment to a browser environment.
For the time being, create a working directory "jupyter_work" and start "Jupyter Notebook" from there.
$ mkdir jupyter_work
$ cd jupyter_work
$ jupyter notebook
The browser will start and you will see a screen like the one below.
To create a new Notebook, select the item "Python [Root]" from the "New" combo box on the left side of the screen.
Then, the following interactive input screen will be displayed. Now you are ready to code.
In the "In []:" input field, enter the python code. Of course, you can enter multiple lines with the return key.
To execute the input code, click "![Screenshot 2016-08-01 13.35.34.png](https://qiita-image-store.s3.amazonaws.com/0/134453/3f3fb05b-7f55-aa8a" on the toolbar. You can do this by pressing the -3a12-25c3158f37e4.png) button or by holding down the shift key and pressing the return key.
First, let's perform regression analysis using the simple data from the previous blog. The data was as follows. The result of the regression line was the following equation. This time, we will also check the above analysis results with scikit-learn.
First, import the required libraries.
numpy is a numerical calculation library. matplotlib is a library for drawing graphs. pyplot provides a procedural interface to matplotlib's object-oriented library. pandas is a library that supports data analysis such as spreadsheets, and this time we will use a two-dimensional array called DataFrame (a function like an Excel table). In addition, sklearn is a machine learning library that uses the function of a linear regression model called linear_model. The last line "% matplotlib inline" is a command to draw a graph in the browser.
Prepare the previous X and Y data as DataFrame format data. In statistical terms, X is the explanatory variable and Y is the objective variable.
Next, create an instance of the linear regression model and execute the training process with the fit function.
In addition, prepare the prediction source data (px). px is an array in which the minimum value (0) and the maximum value (5) of the X data are split in 0.01 increments.
However, this time, px to be passed to the predictive function (predict) of linear_model needs to pass the following two-dimensional array due to the specifications of the function.
[[0.00],[0.01],[0.02],[0.03].....]
Here, [:, np.newaxis] is used to convert a one-dimensional array to a two-dimensional array.
Substitute the prediction source data px in the prediction function (predict) and store the prediction result in py. Plot that data on a graph. With plt.scatter (), plot the X and Y of the original data with red dots, and draw the result predicted with plt.plot () with the blue straight lines of px and py.
Make sure that the slope a of this drawn straight line and the intercept b of the Y axis are the expected values of 1.4 and 2.0, respectively. These values are stored in model.coef_ and model.intercept_.
You can confirm that the expected value is output.
Sample data for machine learning is also available in scikit learn. One of the samples is Boston's home price data. The original data seems to refer to the here site. Here is the code that is included in the data and is a regression analysis of the correlation between the number of rooms purchased and the price of the house.
Naturally, we can see that the house price tends to increase as the number of rooms increases.
That's all for this story!