I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.
This time, I will try to solve the Boston real estate problem of regression problem with PyCaret.
** Previous article: 1. I tried to classify wine quality with PyCaret ** ** 2. I tried to predict Titanic survival with PyCaret **
Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)
pip install pycaret
PyCaret provides several open source datasets with get_data (). You can check the list of provided datasets at the link below. https://pycaret.org/get-data/#datasets
This time we will use the Boston Real Estate Price Dataset.
from pycaret.datasets import get_data
dataset = get_data('boston')
Results
Let's take a look at the contents of the data using Pandas profile_report ().
import pandas_profiling
dataset.profile_report()
result
A description of the data.
The data size for Boston Real Estate is 506 rows x 14 columns. This data is a description of the explanatory variables.
crim: Crime rate per capita by town
zn: Percentage of residential areas divided into lots over 25,000 square feet.
indus: Percentage of non-retailers per town (area ratio)
chas: Charles River dummy variable (= 1 if the road touches the river; 0 others).
nox: Nitrogen oxide concentration (1/10 million)
rm: average number of rooms per dwelling
age: Percentage of units inhabited by owners built before 1940. (Data set survey year is 1978)
dis: Weighted average of distances to 5 Boston Employment Centers
rad: Accessibility index for ring roads
tax: Property tax rate per $ 10,000
ptratio: Student-teacher ratio by town
black: = 1000 (Bk-0.63) ^ 2, where Bk is the percentage of black people in the town.
lstat: Low population status (%)
medv (** Objective Variable **): Median home of the owner (\ $ 1000s)
Use sample () to divide the dataset 90% into training data and 10% into test data.
data = dataset.sample(frac=0.9, random_state=786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))
Result (up to 10 items)
Data for Modeling: (455, 14) Unseen Data For Predictions: (51, 14)
Use setup () to preprocess the data. At that time, specify the objective variable as the argument target = medv.
from pycaret.regression import *
exp_reg101 = setup(data = data, target = 'medv',session_id=12)
Result (up to 10 items)
Use compare_models () to analyze the dataset using multiple regression models and summarize the results in a table. This is a very useful feature when considering which regression model to use.
There are more than 10 types of regression models provided by Pycaret, which can be confirmed at the links below.
https://pycaret.org/regression/
compare_models()
The catBoost Regressor had RMSE = 3.1399 and R ^ 2 = 0.859. This time, as it is a performance evaluation of PyCaret, we will proceed with the discussion with Linear Regression (R ^ 2 = 0.6739) in 8th place.
result
Select a classification model and model it. Use create_model (). This time, we will use the Linear Regression model.
lr = create_model('lr')
The average for R-2 was 0.6739. (k-fold method, n_fold = 10) result
The model is also tuned using tune_model.
tuned_lr = tune_model('lr')
result
The average of R ^ 2 before tuning was 0.6739, and the average after tuning was 0.6739, which did not improve. For Linear Regression, tuned_model () may not be very promising.
Visualize the analysis results using plot_model.
plot_model(tuned_lr)
result
It is possible to perform multiple evaluations at the same time using evaluate_model ().
evaluate_model(lr)
If you press the button in the yellow frame, each evaluation result will be displayed.
result
After finalizing the model with finalize_model (), make a prediction with predict_model (). At the time of prediction, test data (here, unseen_data) is used.
final_lr = finalize_model(tuned_lr)
unseen_predictions = predict_model(final_lr, data=data_unseen)
unseen_predictions.head()
The Label column represents the result of the prediction. The medv column is correct.
result
1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine by PyCaret. https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e 5. I tried to predict the survival of Titanic with PyCaret. https://qiita.com/kotai2003/items/a377f45ddee9829ed2c5
Recommended Posts