Introduction

I made DataLiner, a data preprocessing library for machine learning.

When performing machine learning modeling, the processes used in the data processing / feature engineering part are summarized as a preprocessing list. Since it is compliant with scikit-learn's transformer, it can be fit_transformed by itself or poured into pipeline. Since there are functions and pre-processing that have not been fully packed yet, we will update it regularly in the future, but it will be encouraging if you can get other bug reports, FIX, new functions, new pre-processing pull requests, etc.

GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/ Document: https://shallowdf20.github.io/dataliner/preprocessing.html

Installation

Install using pip. If you are building a Python environment using Anaconda, try running the following code with Anaconda Prompt.

! pip install -U dataliner

Data preparation

Let's take Titanic's Datasets, which everyone loves, as an example. Note that X must be pandas.DataFrame and y must be pandas.Series, and if the data type is different, an error will be thrown. Now, prepare the data to be processed.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'
X = df.drop(target_col, axis=1)
y = df[target_col]

Familiar data like this is stored in X.

How to use

Then, I will use it immediately. First, try using DropHighCardinality, which automatically erases features with too many categories.

dhc = dl.DropHighCardinality()
dhc.fit_transform(X)

You can see that features with a large number of categories such as Name and Ticket have been deleted. As an aside, in the case of Titanic, I think that we will twist information from these columns to improve accuracy.

Next, let's try the familiar Target Encoding. It is a version that smoothes using the average of y as a prior probability.

tme = dl.TargetMeanEncoding()
tme.fit_transform(X, y)

It automatically recognized the columns of categorical variables and encoded each category using the objective variable.

I also think that many data scientists are using Pipeline for efficiency. Of course, each DataLiner class can be used in the same way.

from sklearn.pipeline import make_pipeline

process = make_pipeline(
    dl.DropNoVariance(),
    dl.DropHighCardinality(),
    dl.BinarizeNaN(),
    dl.ImputeNaN(),
    dl.OneHotEncoding(),
    dl.DropHighCorrelation(),
    dl.StandardScaling(),
    dl.DropLowAUC(),
)

process.fit_transform(X, y)

As a result of various processing, it became like this.

In Titanic, there is data called test.csv that is held out in advance for evaluation, so read it and try the same processing.

X_test = pd.read_csv('test.csv')
process.transform(X_test)

That's it.

What is included

At the moment it is as follows. We would like to expand the functions and process in future updates.

** 5/3 postscript: ** I wrote an introductory article for each class. Try processing Titanic data with the preprocessing library DataLiner (Drop) Try processing Titanic data with the preprocessing library DataLiner (Encoding) Try processing Titanic data with the preprocessing library DataLiner (conversion) Try processing Titanic data with the preprocessing library DataLiner (Append)

** BinarizeNaN **-Finds a column that contains missing values and creates a new feature that tells if the column was missing ** ClipData ** --Separate numerical data with the q percentile and replace values above and below the upper limit with upper and lower limits. ** CountRowNaN ** --Creates a new feature that is the sum of the missing values in the row direction for each data. ** DropColumns ** --Drops the specified columns ** DropHighCardinality **-Drop columns with a large number of categories ** DropHighCorrelation ** --Removes features whose Pearson correlation coefficient exceeds the threshold. When deleting, it leaves features that are more correlated with the objective variable. ** DropLowAUC ** --For all features, logistic regression with y as the objective variable is performed for each feature, and the features whose AUC is below the threshold are deleted. ** DropNoVariance ** --Deletes features that contain only one type of data. ** GroupRareCategory ** --Groups the less frequently occurring categories in the category column. ** ImputeNaN **-Complements missing values. By default, numeric data is complemented with the mean and categorical variables are complemented with the mode. ** OneHotEncoding ** --Make categorical variables dummy variables. ** TargetMeanEncoding ** --Replaces each category of the categorical variable with a smoothed mean of the objective variable. ** StandardScaling ** --Converts numeric data to mean 0 variance 1. ** MinMaxScaling **-Scales numeric data from 0 to 1 ** CountEncoding ** --Replaces category values with the number of occurrences ** RankedCountEncoding ** --Create a ranking based on the number of occurrences of the category value and replace it with that ranking. This is effective when multiple categories appear the same number of times. ** FrequencyEncoding ** --Replaces category values by frequency of occurrence. The ranking version is the same as RankedCountEncoding, so it is not available. ** RankedTargetMeanEncoding ** --This is a version in which a ranking is created by averaging the objective variables for each category value, and the ranking is further replaced. ** AppendAnomalyScore ** --Adds an outlier score from Isolation Forest as a feature. ** AppendCluster ** --Perform KMeans and add the resulting cluster as a feature. Data scaling recommended ** AppendClusterDistance ** --Perform KMeans and add the resulting distance to each cluster as a feature. Data scaling recommended ** AppendPrincipalComponent ** --Perform principal component analysis and add principal components as features. Data scaling recommended ** AppendArithmeticFeatures ** --Performs four arithmetic operations on the features contained in the data, and adds a new feature with a higher evaluation index than the features used in the calculation. (ver1.2.0) ** RankedEvaluationMetricEncoding ** --After making each category a dummy variable, perform logistic regression with each category column and objective variable. Create a ranking using the resulting metric (AUC by default) and encode the original category with that ranking. (ver1.2.0) ** AppendClassificationModel ** --Trains the classifier for the input and adds the label or score of the prediction result as a feature. (ver1.2.0) ** AppendEncoder ** --Adds the processing result of each Encoder included in DataLiner as a feature instead of replacing it. (ver1.2.0) ** AppendClusterTargetMean ** --Adds the average objective variable in each cluster after clustering as a feature. (ver1.2.0) ** PermutationImportanceTest ** --Feature quantity selection is performed from the viewpoint of how much the evaluation index of the model prediction result deteriorates when the data of a certain feature quantity is shuffled at random or not. (ver1.2.0) ** UnionAppend ** --Append ○○ class of DataLiner is processed in parallel instead of serial, and the features of the output result are combined and added to the original features. Classes must be given in a list. (ver1.3.1) ** load_titanic ** --Load titanic data. (ver1.3.1)

at the end

I've put together a process that I've been repeating over and over, but maybe other people have similar needs? I made it public. Of the myriad of pre-processing, I hope that the main ones will be a cohesive library.

Again, we're waiting for bug reports, FIXes, new features and new processing pull requests! !!