Introduction

This is the third article that introduces each process of Python's preprocessing library DataLiner. This time I would like to introduce the conversion system.

Release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37

Installation

! pip install -U dataliner

Data preparation

Prepare Titanic data as usual.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	NaN	S
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	NaN	S

StandardScaling / StandardizeData(deprecated) Converts the data to mean 0 variance 1. Unlike libraries such as Sklearn, even if category columns are included, only numeric columns are automatically determined, and since they are returned by pandas DataFrame, subsequent processing is easy. Since StandardizeData has been renamed to StandardScaling, a deprecation warning will be issued, and it will be deleted in ver.1.3.0.

trans = dl.StandardScaling() 
Xt = trans.fit_transform(X)

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
-1.729	0.8269	Braund, Mr. Owen Harris	male	-0.5300	0.4326	-0.4734	A/5 21171	-0.5022	NaN	S
-1.725	-1.5652	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	0.5714	0.4326	-0.4734	PC 17599	0.7864	C85	C
-1.721	0.8269	Heikkinen, Miss. Laina	female	-0.2546	-0.4743	-0.4734	STON/O2. 3101282	-0.4886	NaN	S
-1.717	-1.5652	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	0.3649	0.4326	-0.4734	113803	0.4205	C123	S
-1.714	0.8269	Allen, Mr. William Henry	male	0.3649	-0.4743	-0.4734	373450	-0.4861	NaN	S

MinMaxScaling Converts the data so that it fits between 0 and 1.

trans = dl.MinMaxScaling() 
Xt = trans.fit_transform(X)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0.000000	1	Braund, Mr. Owen Harris	male	0.2712	0.125	A/5 21171	0.01415	NaN	S
0.001124	0	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	0.4722	0.125	PC 17599	0.13914	C85	C
0.002247	1	Heikkinen, Miss. Laina	female	0.3214	0.000	STON/O2. 3101282	0.01547	NaN	S
0.003371	0	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	0.4345	0.125	113803	0.10364	C123	S
0.004494	1	Allen, Mr. William Henry	male	0.4345	0.000	373450	0.01571	NaN	S

BinarizeNaN Finds the column that contains the missing value and creates a new binary column that tells if the column was missing.

trans = dl.BinarizeNaN() 
Xt = trans.fit_transform(X)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Cabin_NaNFlag
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	NaN	S	1
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C	0
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S	1
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S	0
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	NaN	S	1

CountRowNaN For each data point (row), count how many missing values are included and add the sum of the missing values as a new feature.

trans = dl.CountRowNaN() 
Xt = trans.fit_transform(X)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	NaN_Totals
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	NaN	S	1
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C	0
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S	1
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S	0
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	NaN	S	1

ImputeNaN Complements missing values. The default arguments are that the numeric column is complemented by the average and the category column is complemented by the mode. It can be changed with num_strategy and cat_strategy.

trans = dl.ImputeNaN() 
Xt = trans.fit_transform(X)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	B96 B98	S
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	B96 B98	S
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	B96 B98	S

ClipData Define the X quantile and replace the data above and below the upper limit with the upper and lower limits. You can adjust how much you want to clip with the threshold argument, which defaults to 1%: 99%.

trans = dl.ClipData() 
Xt = trans.fit_transform(X)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
9.9	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	NaN	S
9.9	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C
9.9	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S
9.9	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S
9.9	3	Allen, Mr. William Henry	male	35	0	373450	8.050	NaN	S

GroupRareCategory In the categorical variables, the infrequently occurring categories are collectively replaced with the string "RareCategory". Helps reduce cardinality. It is effective to use it before applying OneHotEncoding. With the argument threshold, you can change what percentage or less of the number of data to replace. The default is 1%.

trans = dl.GroupRareCategory() 
Xt = trans.fit_transform(X)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	3	RareCategory	male	22	1	RareCategory	7.250	NaN	S
2	1	RareCategory	female	38	1	RareCategory	71.283	RareCategory	C
3	3	RareCategory	female	26	0	RareCategory	7.925	NaN	S
4	1	RareCategory	female	35	1	RareCategory	53.100	RareCategory	S
5	3	RareCategory	male	35	0	RareCategory	8.050	NaN	S

in conclusion

So, this time I introduced the items of the conversion system of DataLiner. As the pre-processing implemented at the moment (ver.1.1.6), it will be the last in the Append system to be introduced next time.

Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/

Try to process Titanic data with preprocessing library DataLiner (conversion)

Introduction

Installation

Data preparation

in conclusion