This is the third article that introduces each process of Python's preprocessing library DataLiner. This time I would like to introduce the conversion system.
Release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37
! pip install -U dataliner
Prepare Titanic data as usual.
import pandas as pd
import dataliner as dl
df = pd.read_csv('train.csv')
target_col = 'Survived'
X = df.drop(target_col, axis=1)
y = df[target_col]
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | NaN | S |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | NaN | S |
StandardScaling / StandardizeData(deprecated) Converts the data to mean 0 variance 1. Unlike libraries such as Sklearn, even if category columns are included, only numeric columns are automatically determined, and since they are returned by pandas DataFrame, subsequent processing is easy. Since StandardizeData has been renamed to StandardScaling, a deprecation warning will be issued, and it will be deleted in ver.1.3.0.
trans = dl.StandardScaling()
Xt = trans.fit_transform(X)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
-1.729 | 0.8269 | Braund, Mr. Owen Harris | male | -0.5300 | 0.4326 | -0.4734 | A/5 21171 | -0.5022 | NaN | S |
-1.725 | -1.5652 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 0.5714 | 0.4326 | -0.4734 | PC 17599 | 0.7864 | C85 | C |
-1.721 | 0.8269 | Heikkinen, Miss. Laina | female | -0.2546 | -0.4743 | -0.4734 | STON/O2. 3101282 | -0.4886 | NaN | S |
-1.717 | -1.5652 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 0.3649 | 0.4326 | -0.4734 | 113803 | 0.4205 | C123 | S |
-1.714 | 0.8269 | Allen, Mr. William Henry | male | 0.3649 | -0.4743 | -0.4734 | 373450 | -0.4861 | NaN | S |
MinMaxScaling Converts the data so that it fits between 0 and 1.
trans = dl.MinMaxScaling()
Xt = trans.fit_transform(X)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
0.000000 | 1 | Braund, Mr. Owen Harris | male | 0.2712 | 0.125 | 0 | A/5 21171 | 0.01415 | NaN | S |
0.001124 | 0 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 0.4722 | 0.125 | 0 | PC 17599 | 0.13914 | C85 | C |
0.002247 | 1 | Heikkinen, Miss. Laina | female | 0.3214 | 0.000 | 0 | STON/O2. 3101282 | 0.01547 | NaN | S |
0.003371 | 0 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 0.4345 | 0.125 | 0 | 113803 | 0.10364 | C123 | S |
0.004494 | 1 | Allen, Mr. William Henry | male | 0.4345 | 0.000 | 0 | 373450 | 0.01571 | NaN | S |
BinarizeNaN Finds the column that contains the missing value and creates a new binary column that tells if the column was missing.
trans = dl.BinarizeNaN()
Xt = trans.fit_transform(X)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Age_NaNFlag | Cabin_NaNFlag | Embarked_NaNFlag |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | NaN | S | 0 | 1 | 0 |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C | 0 | 0 | 0 |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S | 0 | 1 | 0 |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S | 0 | 0 | 0 |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | NaN | S | 0 | 1 | 0 |
CountRowNaN For each data point (row), count how many missing values are included and add the sum of the missing values as a new feature.
trans = dl.CountRowNaN()
Xt = trans.fit_transform(X)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | NaN_Totals |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | NaN | S | 1 |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C | 0 |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S | 1 |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S | 0 |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | NaN | S | 1 |
ImputeNaN Complements missing values. The default arguments are that the numeric column is complemented by the average and the category column is complemented by the mode. It can be changed with num_strategy and cat_strategy.
trans = dl.ImputeNaN()
Xt = trans.fit_transform(X)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | B96 B98 | S |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | B96 B98 | S |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | B96 B98 | S |
ClipData Define the X quantile and replace the data above and below the upper limit with the upper and lower limits. You can adjust how much you want to clip with the threshold argument, which defaults to 1%: 99%.
trans = dl.ClipData()
Xt = trans.fit_transform(X)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
9.9 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | NaN | S |
9.9 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C |
9.9 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S |
9.9 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S |
9.9 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | NaN | S |
GroupRareCategory In the categorical variables, the infrequently occurring categories are collectively replaced with the string "RareCategory". Helps reduce cardinality. It is effective to use it before applying OneHotEncoding. With the argument threshold, you can change what percentage or less of the number of data to replace. The default is 1%.
trans = dl.GroupRareCategory()
Xt = trans.fit_transform(X)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | RareCategory | male | 22 | 1 | 0 | RareCategory | 7.250 | NaN | S |
2 | 1 | RareCategory | female | 38 | 1 | 0 | RareCategory | 71.283 | RareCategory | C |
3 | 3 | RareCategory | female | 26 | 0 | 0 | RareCategory | 7.925 | NaN | S |
4 | 1 | RareCategory | female | 35 | 1 | 0 | RareCategory | 53.100 | RareCategory | S |
5 | 3 | RareCategory | male | 35 | 0 | 0 | RareCategory | 8.050 | NaN | S |
So, this time I introduced the items of the conversion system of DataLiner. As the pre-processing implemented at the moment (ver.1.1.6), it will be the last in the Append system to be introduced next time.
Dataliner release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/
Recommended Posts