[Python3] Let's analyze data using machine learning! (Regression)

Target

--People who are interested in AI --People who are interested but don't know what to do --People who want to know what AI can do

Premise

--I have a Google account --Know the basic usage of Google Drive --Know the basic usage of Python3 --Does not deal with the theory of machine learning

See below for an overview of machine learning

Machine learning starting from scratch (overview of machine learning) --Qiita

goal

--You can predict the value of the target column of the table data with some accuracy. --Understand a few technical terms required for data analysis --Data can be displayed as a graph

ランダムフォレスト02.png

Previous knowledge

The model used depends on the type of data to be predicted.

--Classification Predicting two types of labels, 0 or 1. Example) Positive / negative virus test, whether or not a customer cancels the service

--Regression Predicting numbers Example) Sales, number of visitors

This time we are dealing with regression

flow

  1. Preparing for Google Colaboratory
  1. Read data
  2. Check the data --Top display (head) --Column description --Detailed information (describe) --Missing value and type confirmation (info) of each column --Histogram --Correlation confirmation (scatter_matrix) --Check the median house price by location (Basemap) --What you learned so far --part1
  3. Model training and prediction-part1 --Split data for learning / verification / testing ――Why do you prepare verification data? --Define a function to draw the accuracy of the prediction result --Linear regression learning and prediction accuracy visualization (no preprocessing) --XGBoost learning and prediction accuracy visualization (no preprocessing) --Random forest learning and visualization of prediction accuracy (no preprocessing) --What you learned so far --part2
  4. Pretreatment --Confirmation of outliers --Delete outliers --Split all_data into train and test
  5. Model training and prediction-part2 --Split data for learning / verification / testing
  1. Summary
  2. Other references

Preparing for Google Colaboratory

  1. Access Google Drive
  2. Create a folder for work (this time create a folder named handson)
  3. Move to the created folder
  4. Right-click => Other => Select Google Colaboratory

initial screen

It should look like the following (partially omitted) 初期画面.png

Runtime connection

Open sample_data from the tab on the left and if you can confirm that california_housing_ {train/test} .csv exists, it's OK 左側.png

Screen description

セルの説明.png red frame : text cell (markdown notation) blue frame : code cell (python) green frame : output result

You can execute a cell by pressing "Shift" + "Enter" or the start icon on the left side of the cell with the cell selected. Execution is done cell by cell, and the value of the variable is retained even if the cell is moved. The order in which the code cells are executed can be seen by the number on the left side of the blue frame . Since the last variable of the code cell is output, there is no need to write print (hi) (super convenient!)

Add cell

You can add it with the "+ code", "+ text" on the upper left of the image below or the button displayed when you move the cursor to the center of the cell. コードorテキストの表示.png

See below for detailed usage

-How to use Google Colab – Summary of advantages / disadvantages and basic operations of Google Colaboratory -How to use Google Colaboratory [Complete Manual] -Summary of how to use Google Colab --Qiita -Colaboratory Shortcut Memorandum --Qiita

From here, add cells as appropriate and execute

Data reading

This time, we will use the sample data california housing data. Sample data is prepared in advance in Colaboratory

import pandas as pd
pd.options.display.precision = 3

train = pd.read_csv('sample_data/california_housing_train.csv')
test = pd.read_csv('sample_data/california_housing_test.csv')

What is pandas? -wiki

pandas is a library that provides functions to support data analysis in the programming language Python. In particular, it provides data structures and operations for manipulating mathematical tables and time series data.

The number of digits after the decimal point is set in pd.options.display.precision (this time, for convenience of explanation, it is set to 3 digits after the decimal point). You can read csv files with read_csv

Data confirmation

From here, let's look at the data

Top display (head)

train.head()

train_head.png You can check the table data up to the first 5 rows with head (if you enter a value in the argument, that number will be displayed)

Let's check test as well!

Description of each column

Column name Explanation
longitude longitude
latitude latitude
housing_median_age Median age of blocks
total_rooms Total number of rooms in the block
total_bedrooms Total number of bedrooms in the block
population Total number of people living in the block
households Number of households per block
median_income Median household income in a block[$ 10,000]
median_house_value Median house prices in blocks[Dollar]

Details of each column are as follows

California Housing Data Set Description

The column to be predicted this time is median_house_value, which is called ** objective variable **. The other columns are the data required for prediction and are called ** explanatory variables **.

Review detailed information (describe)

train.describe()

train_describe.png

describe shows the following information for each column

name meaning
count Number of lines
mean Average value
std standard deviation
min minimum value
25% First quartile
50% Median
75% Third quartile
max Maximum value

Let's check test as well!

Missing value and type confirmation (info) of each column

train.info()

train_info.png Since the value of each item of Non-Null Count is 17000, which is the same as count confirmed in describe, it can be seen that there is no missing value. If there is a missing value, it is necessary to complete it with some value or delete the line and do not use it for training data.

Let's check test as well!

Histogram

Installation of libraries required for graph display

import matplotlib.pyplot as plt
%matplotlib inline

Histogram display

train.hist(bins=50, figsize=(10, 10))

train_hist.png

What is matplotlib? -wiki

Matplotlib is a graph drawing library for the programming language Python and its scientific computing library NumPy.

--% matplotlib inline is like magic --Histogram display with hist --bins is the number of bars --Change the graph size with figsize (please choose if you like)

Let's check test as well!

Correlation confirmation (scatter_matrix)

pd.plotting.scatter_matrix(train, diagonal='kde', figsize=(10, 10))

train_scatterMatrix.png Select kernel density estimation with diagonal ='kde'

References

Let's check test as well!

Check median home prices by location (Basemap)

Basemap installation

!apt-get install libgeos-3.5.0
!apt-get install libgeos-dev
!pip install https://github.com/matplotlib/basemap/archive/master.zip

When you see the following, restart the runtime from the button below basemap_restart.png After rebooting, run from above, including installing Basemap OK when Successfully built base map is displayed

Install Basemap and define a function to display plots on the map

from mpl_toolkits.basemap import Basemap

def basemap(x, y, target):
  fig = plt.figure(dpi=300)
  xyrange = 0.5
  m = Basemap(projection="cyl", resolution="i", llcrnrlat=y.min()-xyrange, urcrnrlat=y.max()+xyrange, llcrnrlon=x.min()-xyrange, urcrnrlon=x.max()+xyrange)
  m.drawparallels(np.arange(-90, 90, 0.5), labels=[True, False, True, False],linewidth=0.0, fontsize=4)
  m.drawmeridians(np.arange(0, 360, 0.5), labels=[True, False, False, True],linewidth=0.0, rotation=45, fontsize=4)
  m.drawcoastlines(color='k')
  m.arcgisimage(service='World_Street_Map', verbose=True, xpixels=1000, dpi=300)
  plt.scatter(x, y, c=target, s=5, cmap=plt.get_cmap('jet'), alpha=0.4, linewidths=0)
  plt.colorbar()

Plot data around California

basemap(train['longitude'], train['latitude'], train['median_house_value'])

train_basemap.png References

-Visualize location information with Basemap-Qiita -How to install and use basemap on Google Colab?

Let's check test as well!

What I learned so far --part1

  1. The distribution of train and test data is almost the same (because each item and histogram of describe are almost similar)
  2. The maximum values ​​of housing_median_age and median_house_value seem to be outliers (from the histogram).
  3. total_rooms, total_bedrooms, population, and households are likely to be correlated (from the correlation graph).
  4. Median home prices for seaside properties appear to be high (from Basemap)

What are outliers? -wiki

Outliers (outliers) are values ​​that deviate significantly from other values ​​in statistics.

Based on these, the next pretreatment should be performed However, in order to ** understand the importance of preprocessing **, we will work on learning machine learning models without dare to do it.

Model training and prediction-part1

Try learning without preprocessing The model used this time is as follows

--Linear regression (linear model) --XGBoost (decision tree model) --Random forest (decision tree model)

** Features of linear model **

--Features are numerical values --Cannot handle missing values --Standardization required ――The accuracy is not high by itself, and it is almost impossible to beat GBDT and neural networks.

** Features of decision tree model **

--Features are numerical values --Can handle missing values --Easy to get accuracy without parameter tuning ――The accuracy does not drop easily even if you add unnecessary features. --No need to scale (standardize or normalize) features

If you want to make predictions with a certain degree of accuracy, the decision tree model is easier and easier to handle. This time, as a comparison, we also deal with linear models. The linear model requires standardization of features, but in part1, we do not dare to standardize, and in part2 we check how much the accuracy changes.

What is standardization? -Data Science Handbook

Standardization in statistics is the operation of converting given data into data with a mean of 0 and a variance of 1.

Divide data for learning / verification / testing

from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

#Objective variable
obj_var = 'median_house_value'

#Explanatory variables for testing
x_test = test.drop(obj_var, axis=1)
#Objective variable for testing
y_test = test[obj_var]

#Explanatory variables for learning
x = train.drop(obj_var, axis=1)
#Objective variable for learning
y = train[obj_var]

#Divide the training data into training and verification
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.2, random_state=144, shuffle=True)

--Split data for training and verification with train_test_split --Divided learning and verification at a ratio of 8: 2 with test_size = 0.2 --random_state is a pseudo-random number, otherwise the result will change each time it is executed.

Please use your favorite numbers

--Shuffle the split with shuffle = True

If not specified (False), 80% of the data is divided for learning and the remaining 20% ​​is for verification.

--mean_squared_error will be used later to check the prediction accuracy

Why prepare verification data?

If the prediction accuracy is evaluated using only the training data and the test data, a model specialized for predicting the test data will be created (called overfitting).

For example, if you want to predict the weather using data from January to May of a certain year. If the prediction accuracy is improved only by evaluation using test data (overfitting), the weather from January to May of that year can be predicted well. However, the weather from January to May of the following year will be different from the previous year and the prediction accuracy will be worse. This is because it is unpredictable for unknown data.

If verification data is prepared so that the two data for verification and testing can be predicted with the same accuracy, it is possible to predict unknown data while maintaining a certain degree of accuracy (). Prevention of overfitting)

What is overfitting? --wiki

Overfitting (English: overfitting), overfitting (Kakigo) and overtraining (Kakushu, English: overtraining) are learned for training data in statistics and machine learning. However, it refers to a state in which unknown data (test data) cannot be adapted or generalized. Due to lack of generalization ability.

This time, we will use the simplest hold-out method. Besides, various such as cross validation and k-fold

Define a function to draw the accuracy of the prediction result

def predict_visualizer(valid, valid_predict, test, test_predict, title):
  valid_array = np.array(valid)
  test_array = np.array(test)
  va_pred_array = np.array(valid_predict)
  ts_pred_array = np.array(test_predict)
  y_values = np.concatenate([valid_array, va_pred_array]).flatten()
  ymin, ymax = np.amin(y_values), np.amax(y_values)
  fig = plt.figure(figsize=(10, 10))
  plt.scatter(valid_array, va_pred_array, label='valid', color='aqua', alpha=0.4)
  plt.scatter(test_array, ts_pred_array, label='test', color='tomato', alpha=0.4)
  plt.plot([ymin, ymax], [ymin, ymax])
  plt.xlabel('test', fontsize=24)
  plt.ylabel('predict', fontsize=24)
  plt.title(f'{title}-Test-Predict Plot', fontsize=24)
  plt.tick_params(labelsize=16)
  plt.legend()
  plt.show()

Linear regression learning and prediction accuracy visualization (no preprocessing)

from sklearn.linear_model import LinearRegression

lr = LinearRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_valid)
y_pred_test = lr.predict(x_test)
print('='*20)
print('LinearRegression')
print(f'valid_RMSE: {np.sqrt(mean_squared_error(y_valid, y_pred))}')
print(f'test_RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test))}')
predict_visualizer(y_valid, y_pred, y_test, y_pred_test, 'lr')

線形回帰01.png

--Train the model with fit --Predict the objective variable with predict

This time we predict two data for verification and two for testing respectively In order to evaluate the prediction accuracy, this time we use an average squared error called RMSE (Root Mean Squared Error). The calculation formula is as follows


RMSE=\sqrt{\displaystyle\frac{1}{N}\sum_{i=1}^{n}(y_i-\hat{y_i})^2}

$ N $: Total number of forecasted data $ y_i $: $ i $ th actual value $ \ hat {y_i} $: $ i $ th predicted value


Find the mean squared error with mean_sqared_error and take the square root with np.sqrt From the result, RMSE was about 70,000. When forecasting home prices with this model, it means that you can forecast with an average error of $ 70,000.

As a way of looking at the scatter plot, the closer the point is to the blue straight line, the more accurate the prediction. Looking at this result, when test is 500000, it is not well predicted This is thought to be due to the strong influence of outliers.

Let's look at other models in the same way

XGBoost learning and visualization of prediction results (no preprocessing)

import xgboost as XGB

xgb = XGB.XGBRegressor(random_state=144)
xgb.fit(x_train, y_train)
y_pred = xgb.predict(x_valid)
y_pred_test = xgb.predict(x_test)
print('='*20)
print('XGBRegressor')
print(f'valid_RMSE: {np.sqrt(mean_squared_error(y_valid, y_pred))}')
print(f'test_RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test))}')
predict_visualizer(y_valid, y_pred, y_test, y_pred_test, 'xgbr')

XGBoost01.png From the result, RMSE was about 55,000. It can be predicted with higher accuracy than the linear regression model. However, it is the same that it is not well predicted when test is 500000.

Random forest training and visualization of prediction results (no preprocessing)

from sklearn.ensemble import RandomForestRegressor

rfr = RandomForestRegressor(random_state=144, n_estimators=100)
rfr.fit(x_train, y_train)
y_pred = rfr.predict(x_valid)
y_pred_test = rfr.predict(x_test)
print('='*20)
print('RandomForestClassifier')
print(f'valid_RMSE: {np.sqrt(mean_squared_error(y_valid, y_pred))}')
print(f'test_RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test))}')
predict_visualizer(y_valid, y_pred, y_test, y_pred_test, 'rfr')

ランダムフォレスト01.png You can set the number of trees with n_estimators (100 this time) From the result, RMSE was about 49,000. Predicts with higher accuracy than XGBoost However, it is the same that it is not well predicted when test is 500000.

What I learned so far-part2

model RMSE for verification RMSE for testing
Linear regression 70139 69766
XGBoost 55100 55100
Random forest 48458 49924
  1. Random forest makes the most accurate predictions
  2. None of the models are good at predicting when test is 500000

It can be expected that the outliers seem to be bad, as mentioned in "What I learned so far-part 1". From here, pre-processing eliminates the effects of outliers.

Preprocessing

Combine all data (train and test) into one before preprocessing

train_mid = train.copy()
test_mid = test.copy()

train_mid['train_test'] = 'train'
test_mid['train_test'] = 'test'

all_data = pd.concat([train_mid, test_mid]).reset_index(drop=True)
all_data.head()

--Do not pollute the data of train and test with copy () --Added train_test column --Combine multiple DataFrames with pd.concat --Reset index with reset_index --Delete the original index with drop = True (otherwise the original index will be included)

Check outliers

Item 2. of "What I learned so far --part 1".

The maximum value of housing_median_age and median_house_value seems to be outliers (from the histogram)

To be specific

all_data['housing_median_age'].value_counts()

(Part of the result omitted) housing_median_age.png You can check the frequency of occurrence of each value with value_counts () in descending order. From the results, it can be seen that the maximum value (see describe) of 52.0 is the highest.

Also check the median_house_value column

all_data['median_house_value'].value_counts()

(Part of the result omitted) median_house_value.png From the results, it can be seen that the maximum value of 500001.0 is the highest.

From the two results, it was found that the maximum value of each was an outlier, which is considered to have an adverse effect on the prediction accuracy.

Remove outliers

Delete each maximum value

all_data = all_data.query(' housing_median_age < 52 and median_house_value < 500000 ').reset_index(drop=True)

Only rows that match the conditions can be extracted with query Don't forget reset_index as the index will be skipped by extraction

Let's see how it changed with describe and hist!

Split all_data into train and test

Since the pre-processing to delete the outliers is finished, split it into train and test

train = all_data.query(" train_test == 'train' ").drop('train_test', axis=1)
test = all_data.query(" train_test == 'test' ").drop('train_test', axis=1)

Model training and prediction-part2

Proceed in much the same way as "Model Learning and Prediction-part 1"

Divide data for learning / verification / testing

Same as the processing of part1

x_test = test.drop(obj_var, axis=1)
y_test = test[obj_var]

x = train.drop(obj_var, axis=1)
y = train[obj_var]

x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.2, random_state=144, shuffle=True)

Standardization

Linear model requires standardization I didn't dare to do it in "Model Learning and Prediction-part 1"

from sklearn.preprocessing import StandardScaler

std_scl = StandardScaler()
x_train = std_scl.fit_transform(x_train)
x_valid = std_scl.transform(x_valid)
x_test = std_scl.transform(x_test)

Details of the StandardScalar method are below

Scale conversion (preprocessing) of data with Scikit-learn --Hatena Blog

Learning of linear regression and visualization of prediction accuracy (with preprocessing)

Same as the processing of part1

lr = LinearRegression()
lr.fit(x_train, y_train)
y_pred = lr.predict(x_valid)
y_pred_test = lr.predict(x_test)
print('='*20)
print('LinearRegression')
print(f'valid_RMSE: {np.sqrt(mean_squared_error(y_valid, y_pred))}')
print(f'test_RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test))}')
predict_visualizer(y_valid, y_pred, y_test, y_pred_test, 'lr')

線形回帰02.png From the result, RMSE was about 60,000. Since it was about 70,000 in part1, the accuracy has improved.

XGBoost learning and visualization of prediction accuracy (with preprocessing)

Same as the processing of part1

xgb = XGB.XGBRegressor(random_state=144)
xgb.fit(x_train, y_train)
y_pred = xgb.predict(x_valid)
y_pred_test = xgb.predict(x_test)
print('='*20)
print('XGBRegressor')
print(f'valid_RMSE: {np.sqrt(mean_squared_error(y_valid, y_pred))}')
print(f'test_RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test))}')
predict_visualizer(y_valid, y_pred, y_test, y_pred_test, 'xgbr')

XGBoost02.png From the result, RMSE was about 50,000. Since it was about 55,000 in part1, the accuracy has improved.

Random forest learning and prediction accuracy visualization (with preprocessing)

Same as the processing of part1

rfr = RandomForestRegressor(random_state=144, n_estimators=100)
rfr.fit(x_train, y_train)
y_pred = rfr.predict(x_valid)
y_pred_test = rfr.predict(x_test)
print('='*20)
print('RandomForestClassifier')
print(f'valid_RMSE: {np.sqrt(mean_squared_error(y_valid, y_pred))}')
print(f'test_RMSE: {np.sqrt(mean_squared_error(y_test, y_pred_test))}')
predict_visualizer(y_valid, y_pred, y_test, y_pred_test, 'rfr')

ランダムフォレスト02.png From the result, RMSE was about 45,000. Since it was about 49,000 in part1, the accuracy has improved.

Summary

--I tried machine learning with Google Colaboratory --Check the california_housing data --Comparison of prediction accuracy with and without pretreatment --It was confirmed that the prediction accuracy of the model learned by preprocessing was improved.

** part1 result **

model RMSE for verification RMSE for testing
Linear regression 70139 69766
XGBoost 55100 55100
Random forest 48458 49924

** part2 result **

model RMSE for verification RMSE for testing
Linear regression 58568 60750
XGBoost 48254 51184
Random forest 42473 45217

The only pre-processing we did this time was to delete outliers. However, data analysis, which is said to account for 80% of the total process of preprocessing, still has problems to be improved.

If you are interested in data analysis even a little, please try other preprocessing for better prediction accuracy.

Other references

-[1st California Home Price Forecast] Confirm accuracy without pretreatment -[2nd California Home Price Forecast] Feature Engineering & Data Cleaning (Data Cleansing) -[3rd California Home Price Forecast] Select and Evaluate the Best Machine Learning Model

Recommended Posts

[Python3] Let's analyze data using machine learning! (Regression)
Let's analyze Covid-19 (Corona) data using Python [For beginners]
Machine learning with python (2) Simple regression analysis
Python: Preprocessing in machine learning: Data acquisition
[Python] First data analysis / machine learning (Kaggle)
Stock price forecast using machine learning (regression)
Python: Preprocessing in machine learning: Data conversion
[Machine learning] Regression analysis using scikit learn
Data supply tricks using deques in machine learning
Machine learning logistic regression
Data analysis using Python 0
[Python] Data analysis, machine learning practice (Kaggle) -Data preprocessing-
Machine learning linear regression
I started machine learning with Python Data preprocessing
Python: Supervised Learning (Regression)
What I learned about AI / machine learning using Python (1)
Coursera Machine Learning Challenges in Python: ex2 (Logistic Regression)
What I learned about AI / machine learning using Python (3)
Coursera Machine Learning Challenges in Python: ex1 (Linear Regression)
Tech-Circle Let's start application development using machine learning (self-study)
Memo for building a machine learning environment using Python
What I learned about AI / machine learning using Python (2)
Data set for machine learning
Machine Learning: Supervised --Linear Regression
Machine learning with Python! Preparation
Understand machine learning ~ ridge regression ~.
Python data analysis learning notes
Python Machine Learning Programming> Keywords
Supervised machine learning (classification / regression)
Beginning with Python machine learning
Data analysis using python pandas
Machine learning stacking template (regression)
Machine learning algorithm (logistic regression)
[Python] Save PDF from Google Colaboratory to Google Drive! -Let's collect data for machine learning-
Build an environment for machine learning using Python on MacOSX
Python beginners publish web applications using machine learning [Part 1] Introduction
What I learned about AI and machine learning using Python (4)
Python learning memo for machine learning by Chainer Chapter 7 Regression analysis
Data acquisition using python googlemap api
Python: Application of supervised learning (regression)
Machine learning with python (1) Overall classification
Machine learning summary by Python beginners
Machine learning beginners try linear regression
Machine learning algorithm (multiple regression analysis)
Machine learning algorithm (simple regression analysis)
Classification and regression in machine learning
<For beginners> python library <For machine learning>
Inversely analyze a machine learning model
Let's see using input in python
Machine learning in Delemas (data acquisition)
Python: Preprocessing in Machine Learning: Overview
Preprocessing in machine learning 2 Data acquisition
"Scraping & machine learning with Python" Learning memo
Let's analyze voice with Python # 1 FFT
Preprocessing in machine learning 4 Data conversion
Basic machine learning procedure: ② Prepare data
Application development using Azure Machine Learning
How to collect machine learning data
[Python machine learning] Recommendation of using Spyder for beginners (as of August 2020)
Try to analyze online family mahjong using Python (PART 1: Take DATA)
[Python] Analyze Splatoon 2 league match data using a correlation coefficient table