This is the 4th article that introduces each process of Python's preprocessing library DataLiner.
This time I would like to introduce the Append system. This completes all the pre-processing currently implemented.
We are planning to release Ver1.2 with some pre-processing added after GW, so I would like to write an introductory article again at that time.
Release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 document: https://shallowdf20.github.io/dataliner/preprocessing.html
! pip install -U dataliner
Prepare Titanic data as usual.
import pandas as pd
import dataliner as dl
df = pd.read_csv('train.csv')
target_col = 'Survived'
X = df.drop(target_col, axis=1)
y = df[target_col]
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | NaN | S |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | NaN | S |
AppendAnomalyScore The Isolation Forest is trained based on the data, and the outlier score is added as a new feature. Missing value completion and categorical variable processing are required before use.
trans = dl.AppendAnomalyScore()
process = make_pipeline(
dl.ImputeNaN(),
dl.RankedTargetMeanEncoding(),
trans
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Anomaly_Score |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 640 | 2 | 22 | 1 | 0 | 141 | 7.250 | 144 | 3 | 0.04805 |
2 | 1 | 554 | 1 | 38 | 1 | 0 | 351 | 71.283 | 101 | 1 | -0.06340 |
3 | 3 | 717 | 1 | 26 | 0 | 0 | 278 | 7.925 | 144 | 3 | 0.04050 |
4 | 1 | 803 | 1 | 35 | 1 | 0 | 92 | 53.100 | 33 | 3 | -0.04854 |
5 | 3 | 602 | 2 | 35 | 0 | 0 | 113 | 8.050 | 144 | 3 | 0.06903 |
AppendCluster The data is clustered in KMeans ++, and as a result, the number of the cluster to which each data belongs is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.
trans = dl.AppendCluster()
process = make_pipeline(
dl.ImputeNaN(),
dl.RankedTargetMeanEncoding(),
dl.StandardScaling(),
trans
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Cluster_Number |
---|---|---|---|---|---|---|---|---|---|---|---|
-1.729 | 0.8269 | 0.7538 | 0.7373 | -0.5921 | 0.4326 | -0.4734 | -0.8129 | -0.5022 | 0.4561 | 0.5856 | 5 |
-1.725 | -1.5652 | 0.4197 | -1.3548 | 0.6384 | 0.4326 | -0.4734 | 0.1102 | 0.7864 | -0.6156 | -1.9412 | 2 |
-1.721 | 0.8269 | 1.0530 | -1.3548 | -0.2845 | -0.4743 | -0.4734 | -0.2107 | -0.4886 | 0.4561 | 0.5856 | 4 |
-1.717 | -1.5652 | 1.3872 | -1.3548 | 0.4077 | 0.4326 | -0.4734 | -1.0282 | 0.4205 | -2.3103 | 0.5856 | 0 |
-1.714 | 0.8269 | 0.6062 | 0.7373 | 0.4077 | -0.4743 | -0.4734 | -0.9359 | -0.4861 | 0.4561 | 0.5856 | 5 |
AppendClusterDistance The data is clustered in KMeans ++, and as a result, the distance from each data to each cluster is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.
trans = dl.AppendClusterDistance()
process = make_pipeline(
dl.ImputeNaN(),
dl.RankedTargetMeanEncoding(),
dl.StandardScaling(),
trans
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Cluster_Distance_0 | Cluster_Distance_1 | Cluster_Distance_2 | Cluster_Distance_3 | Cluster_Distance_4 | Cluster_Distance_5 | Cluster_Distance_6 | Cluster_Distance_7 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-1.729 | 0.8269 | 0.7538 | 0.7373 | -0.5921 | 0.4326 | -0.4734 | -0.8129 | -0.5022 | 0.4561 | 0.5856 | 4.580 | 2.794 | 3.633 | 4.188 | 3.072 | 2.363 | 4.852 | 5.636 |
-1.725 | -1.5652 | 0.4197 | -1.3548 | 0.6384 | 0.4326 | -0.4734 | 0.1102 | 0.7864 | -0.6156 | -1.9412 | 3.434 | 4.637 | 3.374 | 4.852 | 3.675 | 4.619 | 6.044 | 3.965 |
-1.721 | 0.8269 | 1.0530 | -1.3548 | -0.2845 | -0.4743 | -0.4734 | -0.2107 | -0.4886 | 0.4561 | 0.5856 | 4.510 | 3.410 | 3.859 | 3.906 | 2.207 | 2.929 | 5.459 | 5.608 |
-1.717 | -1.5652 | 1.3872 | -1.3548 | 0.4077 | 0.4326 | -0.4734 | -1.0282 | 0.4205 | -2.3103 | 0.5856 | 2.604 | 5.312 | 4.063 | 5.250 | 4.322 | 4.842 | 6.495 | 4.479 |
-1.714 | 0.8269 | 0.6062 | 0.7373 | 0.4077 | -0.4743 | -0.4734 | -0.9359 | -0.4861 | 0.4561 | 0.5856 | 4.482 | 2.632 | 3.168 | 4.262 | 3.097 | 2.382 | 5.724 | 5.593 |
AppendPrincipalComponent Principal component analysis is performed on the data, and the principal component is added as a new feature. Missing value completion and categorical variable processing are required before use. Scaling is also recommended.
trans = dl.AppendPrincipalComponent()
process = make_pipeline(
dl.ImputeNaN(),
dl.RankedTargetMeanEncoding(),
dl.StandardScaling(),
trans
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Principal_Component_0 | Principal_Component_1 | Principal_Component_2 | Principal_Component_3 | Principal_Component_4 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-1.729 | 0.8269 | 0.7538 | 0.7373 | -0.5921 | 0.4326 | -0.4734 | -0.8129 | -0.5022 | 0.4561 | 0.5856 | -1.0239 | 0.1683 | 0.2723 | -0.7951 | -1.839 |
-1.725 | -1.5652 | 0.4197 | -1.3548 | 0.6384 | 0.4326 | -0.4734 | 0.1102 | 0.7864 | -0.6156 | -1.9412 | 2.2205 | 0.1572 | 1.3115 | -0.9589 | -1.246 |
-1.721 | 0.8269 | 1.0530 | -1.3548 | -0.2845 | -0.4743 | -0.4734 | -0.2107 | -0.4886 | 0.4561 | 0.5856 | -0.6973 | 0.2542 | 0.6843 | -0.5943 | -1.782 |
-1.717 | -1.5652 | 1.3872 | -1.3548 | 0.4077 | 0.4326 | -0.4734 | -1.0282 | 0.4205 | -2.3103 | 0.5856 | 2.7334 | 0.2536 | -0.2722 | -1.5439 | -1.530 |
-1.714 | 0.8269 | 0.6062 | 0.7373 | 0.4077 | -0.4743 | -0.4734 | -0.9359 | -0.4861 | 0.4561 | 0.5856 | -0.7770 | -0.7732 | 0.2852 | -0.9750 | -1.641 |
Introduced Append items of DataLiner. In the future, I would like to write an introductory article about the function when updating DataLiner.
Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 Documentation: https://shallowdf20.github.io/dataliner/preprocessing.html GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/
Recommended Posts