Introduction

Immediately, I tried using the machine learning library PyCaret released the other day. I realized that anyone can easily model. It was really easy! You can tune and predict from pre-processing without writing 10 lines of code! There are many parts that I haven't figured out yet, such as arguments, but I decided to write the PyCaret article first. If you notice anything, please comment.

0. Environment and version

PyCaret 1.0.0
Google Colaboratory

1. First from the installation

Execute the code below to install it. It was a sensation, but it took only a few minutes. When I installed it locally, I got an error, so I gave up once.

! pip install pycaret

2. Data acquisition

This time we will use the data of boston. You can get the data with the following code.

from pycaret.datasets import get_data
boston_data = get_data('boston')

3. Pretreatment

Perform preprocessing. Data and target variables are defined and initialized in setup (). Since we are solving a regression problem this time, we have specified pycaret.regression. For classification problems, specify pycaret.classification. You can also perform tasks such as natural language processing and clustering.

setup () handles missing values, encodes categorical data, train-test-split, etc. For more information, see here.

from pycaret.regression import *
exp1 = setup(boston_data, target = 'medv')

Run it to complete the setup.

4. Model comparison

Let's compare and select models. You can compare models in one line below. It took a few minutes. It is convenient to check the evaluation index in a list! By default, k-fold is divided into 10 parts. You can specify the number of folds and the index to sort with the argument. (Running is done by default.)

compare_models()

Click here for execution results

5. Modeling

Select a model and model it. This time I'm using Random Forest. (I feel completely.) This function returns a table containing k-folded scores and trained model objects. You can also check the SD, which is very convenient!

rf = create_model('rf')

By specifying a period after the trained object, you can check as follows.

6. Tuning

Tuning can also be done in one line.

tuned_rf = tune_model('rf')

You can get the parameters below.

tuned_rf.get_params

7. Model visualization

Let's visualize the accuracy of the model. The regression plot is shown below, but for classification problems, you can choose the output according to the metric. I regret that I should have selected the classification problem here because there are many variations of visualization of the classification problem. .. ..

plot_model(tuned_rf)

8. Interpretation of the model

The model is interpreted using SHAP. Check SHAP git for how to read the graph and how to interpret the model.

interpret_model(tuned_rf)

9. Forecast

The prediction for the test data is written as follows. The execution result returns the predicted result for 30% of the test data train-test-split by setup ().

rf_holdout_pred = predict_model(rf)

When making predictions for new data, pass the dataset as an argument to data.

This time, the original data is reused.

predictions = predict_model(rf, data=boston_data)

The prediction result is added to the far right.

Finally

Until the end Thank you for reading. If you have any questions, please leave a comment.

I tried using PyCaret at the fastest speed