I would like to try various DataLiner libraries that can easily perform data preprocessing with Titanic data, which is also famous in Kaggle. (Data: https://www.kaggle.com/c/titanic/data) By the way, some features have been deleted from Kaggle's Titanic data, and looking at the data here, it seems that there are some other features. (Data: https://www.openml.org/d/40945)
Let's do it now. This time, we will introduce the DropXX series. Finally, I will also introduce the processing using Pipeline.
! pip install -U dataliner
First, load train.csv.
import pandas as pd
import dataliner as dl
df = pd.read_csv('train.csv')
target_col = 'Survived'
X = df.drop(target_col, axis=1)
y = df[target_col]
The read train data looks like this.
DropColumns Simply deletes the specified column. For example, PassengerId, which represents a passenger number, is not very useful in modeling, so let's delete it.
trans = dl.DropColumns('PassengerId')
trans.fit_transform(X)
The PassengerId has been removed. It is also possible to pass a list and delete multiple at the same time.
For example, you can combine train data and test data and delete them normally with df.drop, but the above method makes mistakes less likely and makes it clear what you are doing. For example, it can be applied to test data as follows.
X_test = pd.read_csv('test.csv')
trans.transform(X_test)
In addition, as will be described later, by combining it with the scikit-learn pipeline, it will be possible to efficiently and abstractly assemble the flow of data preprocessing and feature quantity engineering.
DropNoVariance Removes features that have no variance and only a single value in the column. Titanic data does not have such features, so let's create it first.
X['Test_Feature'] = 1
Now, let's apply DropNoVariance.
trans = dl.DropNoVariance()
trans.fit_transform(X)
And so on. It can be used for both numeric columns and category columns, and if there are multiple columns, delete all of them.
DropHighCardinality Delete columns with a very large number of categories. If there are too many categories, it will be difficult to properly encode and it will be difficult, so it is also a good idea to drop it quickly in order to pass one-pass in the initial action. (In Kaggle etc., we will go to lower the cardinality by extracting information from it and grouping it)
trans = dl.DropHighCardinality()
trans.fit_transform(X)
If you want, you can also see which columns have been deleted as follows:
trans.drop_columns
array(['Name', 'Ticket', 'Cabin'], dtype='<U6')
You can see that the features with a very large number of categories have been deleted.
DropLowAUC Logistic regression of a single feature with the objective variable results in the removal of features whose AUC is below a certain threshold. It can be used for feature selection. In the case of categorical variables, logistic regression is performed after making them into dummy variables internally. Normally, about 0.55 is recommended, but this time we will set a higher threshold for clarity.
trans = dl.DropLowAUC(threshold=0.65)
trans.fit_transform(X, y)
All that remains is Pclass, Sex, and Fare, all of which are known to have a strong correlation with the objective variable in the Titanic data. It is effective to use it after oneHotEncoding etc. is performed and the number of dimensions explodes.
DropHighCorrelation Based on Pearson's correlation coefficient, features with high correlation are identified, and then only the features with higher correlation with the objective variable are left and the others are deleted. Even if there are multiple features with exactly the same content, they will be deleted at the same time. With boosting trees, you don't have to worry about it, but if you want to do linear regression, you should delete it even if it is regular.
trans = dl.DropHighCorrelation(threshold=0.5)
trans.fit_transform(X, y)
Fare has been erased.
This is the biggest advantage of preprocessing with this method. Let's process the 5 types of Drop introduced this time with a pipeline.
from sklearn.pipeline import make_pipeline
process = make_pipeline(
dl.DropColumns('PassengerId'),
dl.DropNoVariance(),
dl.DropHighCardinality(),
dl.DropLowAUC(threshold=0.65),
dl.DropHighCorrelation(),
)
process.fit_transform(X, y)
All the processes introduced above have been performed, and the result is as follows.
Let's apply this to the test data.
X_test = pd.read_csv('test.csv')
process.transform(X_test)
I was able to apply the same processing as during learning to the test data in no time. After that, if you save it as Pickle, it will be perfect for the next use or service deployment!
So, this time I introduced the Drop items of DataLiner. Next, I would like to introduce the Encoding system.
Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/
Recommended Posts