Introduction

I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.

This time, I will try to solve the Boston real estate problem of regression problem with PyCaret.

** Previous article: 1. I tried to classify wine quality with PyCaret ** ** 2. I tried to predict Titanic survival with PyCaret **

1. Install PyCaret

Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)

pip install pycaret

2. Data acquisition

PyCaret provides several open source datasets with get_data (). You can check the list of provided datasets at the link below. https://pycaret.org/get-data/#datasets

This time we will use the Boston Real Estate Price Dataset.

from pycaret.datasets import get_data
dataset = get_data('boston')

Results

Let's take a look at the contents of the data using Pandas profile_report ().

import pandas_profiling
dataset.profile_report()

result

A description of the data.

The data size for Boston Real Estate is 506 rows x 14 columns. This data is a description of the explanatory variables.

crim: Crime rate per capita by town
zn: Percentage of residential areas divided into lots over 25,000 square feet.
indus: Percentage of non-retailers per town (area ratio)
chas: Charles River dummy variable (= 1 if the road touches the river; 0 others).
nox: Nitrogen oxide concentration (1/10 million)
rm: average number of rooms per dwelling
age: Percentage of units inhabited by owners built before 1940. (Data set survey year is 1978)
dis: Weighted average of distances to 5 Boston Employment Centers
rad: Accessibility index for ring roads
tax: Property tax rate per $ 10,000
ptratio: Student-teacher ratio by town
black: = 1000 (Bk-0.63) ^ 2, where Bk is the percentage of black people in the town.
lstat: Low population status (%)
medv (** Objective Variable **): Median home of the owner (\ $ 1000s)

3. Data preprocessing

Use sample () to divide the dataset 90% into training data and 10% into test data.

data = dataset.sample(frac=0.9, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)

print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

Result (up to 10 items)

Data for Modeling: (455, 14) Unseen Data For Predictions: (51, 14)

Use setup () to preprocess the data. At that time, specify the objective variable as the argument target = medv.

from pycaret.regression import *
exp_reg101 = setup(data = data, target = 'medv',session_id=12)

Result (up to 10 items)

4. Model comparison

Use compare_models () to analyze the dataset using multiple regression models and summarize the results in a table. This is a very useful feature when considering which regression model to use.

There are more than 10 types of regression models provided by Pycaret, which can be confirmed at the links below.

https://pycaret.org/regression/

compare_models()

The catBoost Regressor had RMSE = 3.1399 and R ^ 2 = 0.859. This time, as it is a performance evaluation of PyCaret, we will proceed with the discussion with Linear Regression (R ^ 2 = 0.6739) in 8th place.

result

5. Generation of analytical model

Select a classification model and model it. Use create_model (). This time, we will use the Linear Regression model.

lr = create_model('lr')

The average for R-2 was 0.6739. (k-fold method, n_fold = 10) result

6. Tuning the analytical model

The model is also tuned using tune_model.

tuned_lr = tune_model('lr')

result

The average of R ^ 2 before tuning was 0.6739, and the average after tuning was 0.6739, which did not improve. For Linear Regression, tuned_model () may not be very promising.

7. Visualization of analytical model

Visualize the analysis results using plot_model.

plot_model(tuned_lr)

result

8. Evaluation of analytical model

It is possible to perform multiple evaluations at the same time using evaluate_model ().

evaluate_model(lr)

If you press the button in the yellow frame, each evaluation result will be displayed.

result

9. Forecast

After finalizing the model with finalize_model (), make a prediction with predict_model (). At the time of prediction, test data (here, unseen_data) is used.

final_lr = finalize_model(tuned_lr)
unseen_predictions = predict_model(final_lr, data=data_unseen)
unseen_predictions.head()

The Label column represents the result of the prediction. The medv column is correct.

result

10. Summary

We analyzed the regression problem with PyCaret.

10.1 List of Pycaret functions used for analysis

Data preprocessing: setup ()
Compare models: compare_models ()
Generate analytical model: create_model ()
Tuning: tune_model ()
Visualization: plot_model ()
Evaluation: evaluate_model ()
Prediction: finalize_model (), predict_model ()

11. References

1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine by PyCaret. https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e 5. I tried to predict the survival of Titanic with PyCaret. https://qiita.com/kotai2003/items/a377f45ddee9829ed2c5

I tried to predict Boston real estate prices with PyCaret

Introduction

1. Install PyCaret

2. Data acquisition

3. Data preprocessing

4. Model comparison

5. Generation of analytical model

6. Tuning the analytical model

7. Visualization of analytical model

8. Evaluation of analytical model

9. Forecast

10. Summary

10.1 List of Pycaret functions used for analysis

11. References