I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.
This time, I will apply the Titanic survival prediction problem to PyCaret, submit the prediction result to Kaggle, and see the result.
** This is a follow-up article of I tried to classify wine quality with PyCaret published last time. ** **
Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)
pip install pycaret
Train.csv and test.csv are available for download on Kaggle's Titanic site. https://www.kaggle.com/c/titanic/data
import pandas as pd
train_data = pd.read_csv("train.csv")
train_data.head()
Results
Let's take a look at the contents of the data using Pandas profile_report ().
import pandas_profiling
train_data.profile_report()
Results
Use setup () to preprocess the data. At that time, specify the objective variable as Survived as an argument.
from pycaret.classification import *
exp_titanic = setup(data = train_data, target = 'Survived')
Result (up to 10 items)
Use compare_models () to analyze the dataset using multiple classification models and summarize the results in a table. This is a very useful feature when considering which classification model to use.
There are more than 10 classification models provided by Pycaret, which can be confirmed at the link below.
https://pycaret.org/classification/
compare_models()
The accuracy of the catBoost Classifier was 83.63%. This time, we will talk about the performance evaluation of PyCaret with the 9th place Random Forest Classifier.
result
Select a classification model and model it. Use create_model (). This time, we will use the Random Forest Classifier model.
dt = create_model('rf', round=2)
result
The model is also tuned using tune_model.
tuned_rf = tune_model('rf',round=2)
result
The average accuracy before tuning was 0.80, and the average accuracy after tuning was 0.81.
Visualize the analysis results using plot_model.
First, plot the AUC curve.
plot_model(tuned_rf, plot = 'auc')
result
!
Then plot the confusion matrix.
plot_model(tuned_lightgbm, plot = 'confusion_matrix')
result
!
It is possible to perform multiple evaluations at the same time using evaluate_model ().
evaluate_model(tuned_rf)
If you press the button in the yellow frame, each evaluation result will be displayed.
result
!
After finalizing the model with finalize_model (), make a prediction with predict_model (). At the time of prediction, test data (test.csv in this case) is used.
final_rf = finalize_model(tuned_rf)
data_unseen = pd.read_csv('test.csv')
result = predict_model(final_rf, data = data_unseen)
The Label column represents the result of the prediction.
result
!
I uploaded this result to Kaggle. The score was 0.76076.
1.PyCaret Home Page , http://www.pycaret.org/ 2.PyCaret Classification, https://pycaret.org/classification/ 3. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 4. I tried to classify the quality of wine with PyCaret. Https://qiita.com/kotai2003/items/c8fa7e55230d0fa0cc8e
Recommended Posts