Try to process Titanic data with preprocessing library DataLiner (Drop edition)

Introduction

I would like to try various DataLiner libraries that can easily perform data preprocessing with Titanic data, which is also famous in Kaggle. (Data: https://www.kaggle.com/c/titanic/data) By the way, some features have been deleted from Kaggle's Titanic data, and looking at the data here, it seems that there are some other features. (Data: https://www.openml.org/d/40945)

Let's do it now. This time, we will introduce the DropXX series. Finally, I will also introduce the processing using Pipeline.

Installation

! pip install -U dataliner

Titanic data

First, load train.csv.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]

The read train data looks like this. image.png

DropColumns Simply deletes the specified column. For example, PassengerId, which represents a passenger number, is not very useful in modeling, so let's delete it.

trans = dl.DropColumns('PassengerId')
trans.fit_transform(X)

image.png The PassengerId has been removed. It is also possible to pass a list and delete multiple at the same time.

For example, you can combine train data and test data and delete them normally with df.drop, but the above method makes mistakes less likely and makes it clear what you are doing. For example, it can be applied to test data as follows.

X_test = pd.read_csv('test.csv')
trans.transform(X_test)

image.png

In addition, as will be described later, by combining it with the scikit-learn pipeline, it will be possible to efficiently and abstractly assemble the flow of data preprocessing and feature quantity engineering.

DropNoVariance Removes features that have no variance and only a single value in the column. Titanic data does not have such features, so let's create it first.

X['Test_Feature'] = 1

image.png

Now, let's apply DropNoVariance.

trans = dl.DropNoVariance()
trans.fit_transform(X)

image.png

And so on. It can be used for both numeric columns and category columns, and if there are multiple columns, delete all of them.

DropHighCardinality Delete columns with a very large number of categories. If there are too many categories, it will be difficult to properly encode and it will be difficult, so it is also a good idea to drop it quickly in order to pass one-pass in the initial action. (In Kaggle etc., we will go to lower the cardinality by extracting information from it and grouping it)

trans = dl.DropHighCardinality()
trans.fit_transform(X)

image.png

If you want, you can also see which columns have been deleted as follows:

trans.drop_columns

array(['Name', 'Ticket', 'Cabin'], dtype='<U6')

You can see that the features with a very large number of categories have been deleted.

DropLowAUC Logistic regression of a single feature with the objective variable results in the removal of features whose AUC is below a certain threshold. It can be used for feature selection. In the case of categorical variables, logistic regression is performed after making them into dummy variables internally. Normally, about 0.55 is recommended, but this time we will set a higher threshold for clarity.

trans = dl.DropLowAUC(threshold=0.65)
trans.fit_transform(X, y)

image.png

All that remains is Pclass, Sex, and Fare, all of which are known to have a strong correlation with the objective variable in the Titanic data. It is effective to use it after oneHotEncoding etc. is performed and the number of dimensions explodes.

DropHighCorrelation Based on Pearson's correlation coefficient, features with high correlation are identified, and then only the features with higher correlation with the objective variable are left and the others are deleted. Even if there are multiple features with exactly the same content, they will be deleted at the same time. With boosting trees, you don't have to worry about it, but if you want to do linear regression, you should delete it even if it is regular.

trans = dl.DropHighCorrelation(threshold=0.5)
trans.fit_transform(X, y)

image.png

Fare has been erased.

Used with Pipeline

This is the biggest advantage of preprocessing with this method. Let's process the 5 types of Drop introduced this time with a pipeline.

from sklearn.pipeline import make_pipeline

process = make_pipeline(
    dl.DropColumns('PassengerId'),
    dl.DropNoVariance(),
    dl.DropHighCardinality(),
    dl.DropLowAUC(threshold=0.65),
    dl.DropHighCorrelation(),
)

process.fit_transform(X, y)

All the processes introduced above have been performed, and the result is as follows. image.png

Let's apply this to the test data.

X_test = pd.read_csv('test.csv')
process.transform(X_test)

image.png I was able to apply the same processing as during learning to the test data in no time. After that, if you save it as Pickle, it will be perfect for the next use or service deployment!

in conclusion

So, this time I introduced the Drop items of DataLiner. Next, I would like to introduce the Encoding system.

Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/

Recommended Posts

Try to process Titanic data with preprocessing library DataLiner (Drop edition)
Try to process Titanic data with preprocessing library DataLiner (Append)
Try to process Titanic data with preprocessing library DataLiner (Encoding)
Try to process Titanic data with preprocessing library DataLiner (conversion)
Try to aggregate doujin music data with pandas
20200329_Introduction to Data Analysis with Python Second Edition Personal Summary
Try to extract Azure SQL Server data table with pyodbc
Try to get data while port forwarding to RDS with anaconda.
Try to extract the features of the sensor data with CNN
Try to factorial with recursion
Generate error correction code to restore data corruption with zfec library
Try to solve the shortest path with Python + NetworkX + social data
Try to get CloudWatch metrics with re: dash python data source
Try to operate Facebook with Python
How to deal with imbalanced data
Try to put data in MongoDB
How to Data Augmentation with PyTorch
Process Pubmed .xml data with python
Try data parallelism with Distributed TensorFlow
[Updated Ver1.3.1] I made a data preprocessing library DataLiner for machine learning.
Try to image the elevation data of the Geographical Survey Institute with Python