I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.
This time, I will apply the Titanic survival prediction problem to PyCaret, submit the prediction result to Kaggle, and see the result.
** This is a follow-up article of I tried to classify wine quality with PyCaret published last time. ** **
Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)
pip install pycaret
Train.csv and test.csv are available for download on Kaggle's Titanic site. https://www.kaggle.com/c/titanic/data
import pandas as pd
train_data = pd.read_csv("train.csv")
Let's take a look at the contents of the data using Pandas profile_report ().
import pandas_profiling
Use setup () to preprocess the data. At that time, specify the objective variable as Survived as an argument.
from pycaret.classification import *
exp_titanic = setup(data = train_data, target = 'Survived')
Result (up to 10 items)
Use compare_models () to analyze the dataset using multiple classification models and summarize the results in a table. This is a very useful feature when considering which classification model to use.
There are more than 10 classification models provided by Pycaret, which can be confirmed at the link below.
The accuracy of the catBoost Classifier was 83.63%. This time, we will talk about the performance evaluation of PyCaret with the 9th place Random Forest Classifier.
Select a classification model and model it. Use create_model (). This time, we will use the Random Forest Classifier model.
dt = create_model('rf', round=2)
The model is also tuned using tune_model.
tuned_rf = tune_model('rf',round=2)
The average accuracy before tuning was 0.80, and the average accuracy after tuning was 0.81.
Visualize the analysis results using plot_model.
First, plot the AUC curve.
plot_model(tuned_rf, plot = 'auc')
Then plot the confusion matrix.
plot_model(tuned_lightgbm, plot = 'confusion_matrix')
It is possible to perform multiple evaluations at the same time using evaluate_model ().
If you press the button in the yellow frame, each evaluation result will be displayed.
After finalizing the model with finalize_model (), make a prediction with predict_model (). At the time of prediction, test data (test.csv in this case) is used.
final_rf = finalize_model(tuned_rf)
data_unseen = pd.read_csv('test.csv')
result = predict_model(final_rf, data = data_unseen)
The Label column represents the result of the prediction.
I uploaded this result to Kaggle. The score was 0.76076.
1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine with PyCaret. Https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e
Recommended Posts