I tried to classify the quality of wine with PyCaret

Introduction

I tried using the machine learning library PyCaret that was released the other day. Data feature analysis and performance comparison work with multiple models will be automated, and I think that the work time of data scientists so far will be significantly reduced.

1. Install PyCaret

Execute the code below to install it. I'm using Anaconda, but I launched and installed a virtual environment dedicated to PyCaret. In an existing virtual environment managed by Conda, an error may occur. (Probably due to a collision between pip and conda)

pip install pycaret

2. Data acquisition

This time we will use the Wine Quality dataset. image.png

The dataset is organized into 11 explanatory variables and 1 objective variable (Quality) that represents the quality of the wine.

** Explanatory variable ** 1 --fixed acidity 2 --volatile acidity 3 --citric acid 4-remaining sugar 5 --chlorides 6 --free sulfur dioxide 7 --total sulfur dioxide 8-density 9 - pH 10-sulphates 11 --alcohol

** Objective variable ** 12 --quality (score between 0 and 10)

The dataset can be downloaded from the following site. http://archive.ics.uci.edu/ml/datasets/Wine+Quality http://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv

from pycaret.datasets
import pandas as pd
dataset = pd.read_csv('winequality-white.csv',sep=";",encoding="utf-8")
dataset.head()

image.png

Then, 95% is divided into training data and 5% is divided into test data (called Unseen Data).

data = dataset.sample(frac =0.95, random_state = 786).reset_index(drop=True)
data_unseen = dataset.drop(data.index).reset_index(drop=True)
print('Data for Modeling: ' + str(data.shape))
print('Unseen Data For Predictions: ' + str(data_unseen.shape))

result

Data for Modeling: (4653, 12) Unseen Data For Predictions: (245, 12)

3. Data preprocessing

Use setup () to preprocess the data. At that time, specify quality as an argument for the objective variable.

from pycaret.classification import *
exp_clf101 = setup(data = data, target = 'quality', session_id=123) 

Result (up to 10 items)

image.png

4. Model comparison

Use compare_models () to analyze the dataset using multiple classification models and summarize the results in a table. This is a very useful feature when considering which classification model to use.

There are more than 10 classification models provided by Pycaret, which can be confirmed at the link below.

https://pycaret.org/classification/

compare_models()

With the Extra Trees Classifier, it was Accuracy 65.95%. For example, if you do not use these functions and analyze only with SVM from the beginning, the accuracy is 32.03%. Wine quality data, SVM analysis case

result

image.png

5. Generation of analytical model

Select a classification model and model it. Use create_model (). This time, we will use the Light Gradient Boosting Machine model.

lightgbm = create_model('lightgbm')

result

image.png

6. Tuning the analytical model

The model is also tuned using tune_model.

tuned_lightgbm = tune_model('lightgbm')

result

The average accuracy before tuning was 0.6393, and the average accuracy after tuning was 0.6414.

image.png

7. Visualization of analytical model

Visualize the analysis results using plot_model.

First, plot the AUC curve.

plot_model(tuned_lightgbm, plot = 'auc')

result

image.png

Then plot the confusion matrix.

plot_model(tuned_lightgbm, plot = 'confusion_matrix')

result

image.png

8. Evaluation of analytical model

It is possible to perform multiple evaluations at the same time using evaluate_model ().

evaluate_model(tuned_lightgbm)

If you press the button in the yellow frame, each evaluation result will be displayed.

result

image.png

9. Forecast

After finalizing the model with finalize_model (), make a prediction with predict_model (). When making predictions, we use test data (here, data_unseen).

final_lightgbm = finalize_model(tuned_lightgbm)
unseen_predictions = predict_model(final_lightgbm, data=data_unseen)
unseen_predictions.head()

The last Label column is the result of the prediction.

result

image.png

10. Summary

  1. We used a wine quality dataset and analyzed it with PyCaret.
  2. Very easy to use. I think that it has a high analysis function that is comparable to the commercial analysis tools Alteryx and DataRobot.
  3. Next time, I would like to tackle the regression problem of PyCaret.

I wrote a follow-up article. If you like, please read this article as well. ** I tried to predict Titanic survival with PyCaret **

10.1 List of Pycaret functions used for analysis

  1. Data preprocessing: setup ()
  2. Compare models: compare_models ()
  3. Generate analytical model: create_model ()
  4. Tuning: tune_model ()
  5. Visualization: plot_model ()
  6. Evaluation: evaluate_model ()
  7. Prediction: finalize_model (), predict_model ()

11. References

1.PyCaret Home Page , http://www.pycaret.org/ 2.Wine Quality Dataset, http://archive.ics.uci.edu/ml/datasets/Wine+Quality 3.PyCaret Classification, https://pycaret.org/classification/ 4. I tried using PyCaret at the fastest speed, https://qiita.com/s_fukuzawa/items/5dd40a008dac76595eea 5. [Python] Judging the quality of wine by machine learning, https://ymgsapo.com/2019/01/06/ai-wine-quality/

Recommended Posts

I tried to classify the quality of wine with PyCaret
I tried to classify the voices of voice actors
I tried to find the entropy of the image with python
I tried to find the average of the sequence with TensorFlow
I tried to automate the watering of the planter with Raspberry Pi
I tried to expand the size of the logical volume with LVM
I tried to improve the efficiency of daily work with Python
I tried to correct the keystone of the image
I tried to predict Titanic survival with PyCaret
I tried to vectorize the lyrics of Hinatazaka46!
I tried to get the authentication code of Qiita API with Python.
I tried to automatically extract the movements of PES players with software
I tried to analyze the negativeness of Nono Morikubo. [Compare with Posipa]
I tried to predict the behavior of the new coronavirus with the SEIR model.
I tried clustering with PyCaret
I tried to learn the sin function with chainer
I tried to extract features with SIFT of OpenCV
I tried to summarize the basic form of GPLVM
I tried to touch the CSV file with Python
I tried to solve the soma cube with python
I tried to visualize the spacha information of VTuber
I tried to solve the problem with Python Vol.1
I tried to summarize the string operations of Python
I tried to easily visualize the tweets of JAWS DAYS 2017 with Python + ELK
The story of making soracom_exporter (I tried to monitor SORACOM Air with Prometheus)
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to automatically send the literature of the new coronavirus to LINE with Python
[Horse Racing] I tried to quantify the strength of racehorses
I tried "gamma correction" of the image with Python + OpenCV
I tried to simulate how the infection spreads with Python
I tried to analyze the whole novel "Weathering with You" ☔️
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
I tried to get the location information of Odakyu Bus
I tried to notify the train delay information with LINE Notify
I tried to automate the article update of Livedoor blog with Python and selenium.
I tried to classify MNIST by GNN (with PyTorch geometric)
[Python] I tried to visualize the follow relationship of Twitter
I tried to visualize the characteristics of new coronavirus infected person information with wordcloud
I tried to implement ListNet of rank learning with Chainer
I tried to fight the Local Minimum of Goldstein-Price Function
I tried to compare the processing speed with dplyr of R and pandas of Python
I tried to move the ball
I tried to estimate the interval.
The 15th offline real-time I tried to solve the problem of how to write with python
I tried to display the point cloud data DB of Shizuoka prefecture with Vue + Leaflet
I tried to automatically post to ChatWork at the time of deployment with fabric and ChatWork Api
I tried to rewrite the WEB server of the normal Linux programming 1st edition with C ++ 14
How to write offline real time I tried to solve the problem of F02 with Python
I tried to visualize the power consumption of my house with Nature Remo E lite
I tried to analyze the data of the soccer FIFA World Cup Russia tournament with soccer action
I wrote a doctest in "I tried to simulate the probability of a bingo game with Python"
I tried to predict the sales of game software with VARISTA by referring to the article of Codexa
I tried scraping the ranking of Qiita Advent Calendar with Python
I tried to describe the traffic in real time with WebSocket
I tried to solve the ant book beginner's edition with python
I tried to get the index of the list using the enumerate function
I tried to build the SD boot image of LicheePi Nano
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
I want to output the beginning of the next month with Python
I tried to create a list of prime numbers with python
I tried to process the image in "sketch style" with OpenCV