Until now, EDA used pandas-profiling, but Sweetviz I sometimes see something like that, so I tried it.
The data used was Titanic data.
EDA stands for ** Explanatory Data Analysis **. When carrying out data analysis work such as machine learning, the following work is performed for the purpose of understanding the data. --Data visualization --Understanding the characteristics of data --Understanding the relationship between data
If you look into the details of EDA, you will find a lot, so I will omit it in this article. Please refer to the following articles. ・ [Introduction to Data Scientists] Let's try basic operations of exploratory data analysis (EDA) using Python ・ What is EDA (Exploratory Data Analysis)? ・ Exploratory Data Analysis (EDA)
Sweetviz is a library that can semi-automatically perform various tasks when performing EDA. I will introduce an execution example with Titanic data immediately.
After execution, the above html will be created. Let's look at the contents in three parts.
In part (1), you can check the ** characteristics of the entire data ** and the ** correlation coefficient **. As a whole, one of the big advantages of Sweetviz is that you can see ** training data and inference data separately **. In the part of the figure ** For each of the training data and the inference data ** You can check the following contents.
You can also check the correlation coefficient by pressing the ** Associations ** button.
The above is an example with training data, but since it can be confirmed with inference data as well, It may be possible to guess whether there is a difference in the distribution by looking at the difference in the correlation coefficient between the training data and the inference data.
In the part of ②, you can confirm the following.
--Distribution of objective variable (Survived) --Distribution of explanatory variables -** Positive rate (ratio where the objective variable is 1) ** -** Comparison of the above three training data and inference data **
It is natural that the distribution can be seen, It is very convenient to see ** "Positive rate" ** and ** "Comparison of training data and inference data" **. By looking at these
-** How likely is the accuracy of prediction by AI? ** -** Is there a problem with the data acquisition method and number of learning data and inference data? (It seems that there are many cases where the timing and users are different, but if the distribution is similar to some extent, it can be judged that there is no problem with the data acquisition flow) ** -** What is the value of the explanatory variable with a high positive rate **
It is possible to predict quite a lot in advance before implementing AI algorithms and calculating predictive accuracy and descriptiveness such as LIME / SHAP. If you can predict in advance, you will not blindly believe the results of AI and it will be easier to consider the results.
In part (3), you can check a little more detailed information about each feature.
For example, in addition to the information in ②, the following contents are displayed.
--Deficiency rate --Features with high correlation coefficient --Frequent list of values --List of values in descending order
It's normal here. ** The list in descending order of value ** can only be seen in the top 5 in pandas-profiling, so if you want to see a little more or if there are 5 or more outliers, Sweetviz is effective. However, after all, it was also displayed in ②
-** Positive rate (ratio where the objective variable is 1) ** -** Comparison of training data and inference data **
Seems like the ** benefits of using Sweetviz **.
The code and the html output by Sweetviz are placed in the following git. You can just look at the html, and it's pretty easy to move the code. https://github.com/yuomori0127/sweetviz_titanic
The formula is below. Sweetviz
We will also look at ** pandas-profiling **, a library for the same EDA. The Titanic implementation example of ** pandas-profiling ** is published on colab.
https://colab.research.google.com/github/pandas-profiling/pandas-profiling/blob/master/examples/titanic/titanic.ipynb
I think this is more than enough, as it will give you this just by entering the data. Or rather, I used it a lot.
The benefits of pandas-profiling that Sweetviz doesn't have ** It suggests explanatory variables to be deleted when preprocessing data **. As shown in the figure
--Many cardinality (number of types of values) --Many missing values --Many zeros --High correlation coefficient
Etc. ** Suggest explanatory variables to be deleted when preprocessing data **. It is a very convenient function that Sweetviz does not have to suggest these without having to look at the diagram and distribution one by one and draw the threshold value by yourself.
The formula is as follows. pandas-profiling
I made a comparison table of Sweetviz and pandas-profiling.
It has both the basic functions of EDA, The details are a little different. Also, I can't list all the features, so I've extracted quite a bit.
# | Comparison items | Sweetviz | pandas-profiling |
---|---|---|---|
1 | Display of distribution | 〇 | 〇 |
2 | Display of basic statistics | 〇 | 〇 |
3 | Display of loss rate | 〇 | 〇 |
4 | Display of correlation coefficient | 〇 | 〇 |
5 | Data display in order of frequency | 〇 | 〇 |
6 | Data display in order of value | 〇 | △(Only 5) |
7 | Display of positive rate | 〇 | × |
8 | Comparison of training data and inference data | 〇 | × |
9 | Suggest explanatory variables to delete | × | 〇 |
Personally, I recommend ** Sweetviz **.
After all, ** "Display positive rate" ** and ** "Comparison of training data and inference data" ** are very convenient. The advantage of pandas-profiling is ** "Proposal of explanatory variables to be deleted" **, Of course, it's a great feature, but ** it's impossible to execute as suggested and not see the data after all, and not think about it **, and I don't know if the proposal is valid, so I don't refer to it after all. However, I am grateful for the ** potential to prevent oversights **.
It's difficult to decide the superiority or inferiority, but both are easy to move, so please try both and use the one that suits you!
Recommended Posts