Introduction

This is the 4th article that introduces each process of Python's preprocessing library DataLiner. This time I would like to introduce the Append system. This completes all the pre-processing currently implemented.
We are planning to release Ver1.2 with some pre-processing added after GW, so I would like to write an introductory article again at that time.

Release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 document: https://shallowdf20.github.io/dataliner/preprocessing.html

Installation

! pip install -U dataliner

Data preparation

Prepare Titanic data as usual.

import pandas as pd
import dataliner as dl

df = pd.read_csv('train.csv')
target_col = 'Survived'

X = df.drop(target_col, axis=1)
y = df[target_col]

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.250	NaN	S
2	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	female	38	1	PC 17599	71.283	C85	C
3	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.925	NaN	S
4	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.100	C123	S
5	3	Allen, Mr. William Henry	male	35	0	373450	8.050	NaN	S

AppendAnomalyScore The Isolation Forest is trained based on the data, and the outlier score is added as a new feature. Missing value completion and categorical variable processing are required before use.

trans = dl.AppendAnomalyScore()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    trans
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Anomaly_Score
1	3	640	2	22	1	141	7.250	144	3	0.04805
2	1	554	1	38	1	351	71.283	101	1	-0.06340
3	3	717	1	26	0	278	7.925	144	3	0.04050
4	1	803	1	35	1	92	53.100	33	3	-0.04854
5	3	602	2	35	0	113	8.050	144	3	0.06903

AppendCluster The data is clustered in KMeans ++, and as a result, the number of the cluster to which each data belongs is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.

trans = dl.AppendCluster()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Cluster_Number
-1.729	0.8269	0.7538	0.7373	-0.5921	0.4326	-0.4734	-0.8129	-0.5022	0.4561	0.5856	5
-1.725	-1.5652	0.4197	-1.3548	0.6384	0.4326	-0.4734	0.1102	0.7864	-0.6156	-1.9412	2
-1.721	0.8269	1.0530	-1.3548	-0.2845	-0.4743	-0.4734	-0.2107	-0.4886	0.4561	0.5856	4
-1.717	-1.5652	1.3872	-1.3548	0.4077	0.4326	-0.4734	-1.0282	0.4205	-2.3103	0.5856	0
-1.714	0.8269	0.6062	0.7373	0.4077	-0.4743	-0.4734	-0.9359	-0.4861	0.4561	0.5856	5

AppendClusterDistance The data is clustered in KMeans ++, and as a result, the distance from each data to each cluster is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.

trans = dl.AppendClusterDistance()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Cluster_Distance_0	Cluster_Distance_1	Cluster_Distance_2	Cluster_Distance_3	Cluster_Distance_4	Cluster_Distance_5	Cluster_Distance_6	Cluster_Distance_7
-1.729	0.8269	0.7538	0.7373	-0.5921	0.4326	-0.4734	-0.8129	-0.5022	0.4561	0.5856	4.580	2.794	3.633	4.188	3.072	2.363	4.852	5.636
-1.725	-1.5652	0.4197	-1.3548	0.6384	0.4326	-0.4734	0.1102	0.7864	-0.6156	-1.9412	3.434	4.637	3.374	4.852	3.675	4.619	6.044	3.965
-1.721	0.8269	1.0530	-1.3548	-0.2845	-0.4743	-0.4734	-0.2107	-0.4886	0.4561	0.5856	4.510	3.410	3.859	3.906	2.207	2.929	5.459	5.608
-1.717	-1.5652	1.3872	-1.3548	0.4077	0.4326	-0.4734	-1.0282	0.4205	-2.3103	0.5856	2.604	5.312	4.063	5.250	4.322	4.842	6.495	4.479
-1.714	0.8269	0.6062	0.7373	0.4077	-0.4743	-0.4734	-0.9359	-0.4861	0.4561	0.5856	4.482	2.632	3.168	4.262	3.097	2.382	5.724	5.593

AppendPrincipalComponent Principal component analysis is performed on the data, and the principal component is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.

trans = dl.AppendPrincipalComponent()
process = make_pipeline(
    dl.ImputeNaN(),
    dl.RankedTargetMeanEncoding(),
    dl.StandardScaling(),
    trans
)
process.fit_transform(X, y)

PassengerId	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Principal_Component_0	Principal_Component_1	Principal_Component_2	Principal_Component_3	Principal_Component_4
-1.729	0.8269	0.7538	0.7373	-0.5921	0.4326	-0.4734	-0.8129	-0.5022	0.4561	0.5856	-1.0239	0.1683	0.2723	-0.7951	-1.839
-1.725	-1.5652	0.4197	-1.3548	0.6384	0.4326	-0.4734	0.1102	0.7864	-0.6156	-1.9412	2.2205	0.1572	1.3115	-0.9589	-1.246
-1.721	0.8269	1.0530	-1.3548	-0.2845	-0.4743	-0.4734	-0.2107	-0.4886	0.4561	0.5856	-0.6973	0.2542	0.6843	-0.5943	-1.782
-1.717	-1.5652	1.3872	-1.3548	0.4077	0.4326	-0.4734	-1.0282	0.4205	-2.3103	0.5856	2.7334	0.2536	-0.2722	-1.5439	-1.530
-1.714	0.8269	0.6062	0.7373	0.4077	-0.4743	-0.4734	-0.9359	-0.4861	0.4561	0.5856	-0.7770	-0.7732	0.2852	-0.9750	-1.641

in conclusion

Introduced Append items of DataLiner. In the future, I would like to write an introductory article about the function when updating DataLiner.

Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 Documentation: https://shallowdf20.github.io/dataliner/preprocessing.html GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/

Try to process Titanic data with preprocessing library DataLiner (Append)

Introduction

Installation

Data preparation

in conclusion