We have released DataLiner 1.2.0, a pre-processing library for machine learning. This time, I have added about 6 new preprocessing, so I would like to introduce it.
GitHub: https://github.com/shallowdf20/dataliner PyPI: https://pypi.org/project/dataliner/ Document: https://shallowdf20.github.io/dataliner/preprocessing.html
Install using pip.
! pip install -U dataliner
Use Titanic data as usual.
import pandas as pd
import dataliner as dl
df = pd.read_csv('train.csv')
target_col = 'Survived'
X = df.drop(target_col, axis=1)
y = df[target_col]
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | NaN | S |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | NaN | S |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | NaN | S |
Let's take a look now.
AppendArithmeticFeatures Four arithmetic operations are performed on the features included in the data, and a new feature with a higher evaluation index than the features used in the calculation is newly added. Evaluation is done by logistic regression. By default, multiplication and evaluation index are AUC, but addition, subtraction and division, and Accuracy are also available. It is necessary to fill in the missing values before using.
process = make_pipeline(
dl.ImputeNaN(),
dl.AppendArithmeticFeatures(metric='roc_auc', operation='multiply')
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | PassengerId_multiply_Age | PassengerId_multiply_SibSp | PassengerId_multiply_Parch | Pclass_multiply_Age |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | B96 B98 | S | 22 | 1 | 0 | 66 |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C | 76 | 2 | 0 | 38 |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | B96 B98 | S | 78 | 0 | 0 | 78 |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S | 140 | 4 | 0 | 35 |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | B96 B98 | S | 175 | 0 | 0 | 105 |
In this way, new features are added.
RankedEvaluationMetricEncoding After making each category a dummy variable, logistic regression is performed with each category column and objective variable. Create a ranking using the resulting metric (AUC by default) and encode the original category with that ranking. Since 5 folds of logistic regression are fitted to each category, the amount of calculation will be enormous for features with high cardinality, so in advance It is recommended to lower the cardinality by using Drop High Cardinality or Group Rare Category.
process = make_pipeline(
dl.ImputeNaN(),
dl.RankedEvaluationMetricEncoding()
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 640 | 2 | 22 | 1 | 0 | 288 | 7.250 | 1 | 1 |
2 | 1 | 554 | 1 | 38 | 1 | 0 | 284 | 71.283 | 77 | 2 |
3 | 3 | 717 | 1 | 26 | 0 | 0 | 256 | 7.925 | 1 | 1 |
4 | 1 | 803 | 1 | 35 | 1 | 0 | 495 | 53.100 | 112 | 1 |
5 | 3 | 602 | 2 | 35 | 0 | 0 | 94 | 8.050 | 1 | 1 |
You can also check how important each category in the categorical variable is by outputting the ranking.
process['rankedevaluationmetricencoding'].dic_corr_['Embarked']
Category | Rank | Evaluation_Metric |
---|---|---|
S | 1 | 0.5688 |
C | 2 | 0.5678 |
Q | 3 | 0.4729 |
AppendClassificationModel The classifier is trained based on the input data, and the prediction result is added as a new feature. The model can be any sklearn compliant model. Also, if the predict_proba method is implemented You can add a score instead of a label by giving the argument probability = True. Since the model is trained, missing value completion and categorical variable processing are basically required.
process = make_pipeline(
dl.ImputeNaN(),
dl.TargetMeanEncoding(),
dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=300, max_depth=5),
probability=False)
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Predicted_RandomForestClassifier |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 0.3838 | 0.1889 | 22 | 1 | 0 | 0.3838 | 7.250 | 0.3039 | 0.3390 | 0 |
2 | 1 | 0.3838 | 0.7420 | 38 | 1 | 0 | 0.3838 | 71.283 | 0.3838 | 0.5536 | 1 |
3 | 3 | 0.3838 | 0.7420 | 26 | 0 | 0 | 0.3838 | 7.925 | 0.3039 | 0.3390 | 1 |
4 | 1 | 0.3838 | 0.7420 | 35 | 1 | 0 | 0.4862 | 53.100 | 0.4862 | 0.3390 | 1 |
5 | 3 | 0.3838 | 0.1889 | 35 | 0 | 0 | 0.3838 | 8.050 | 0.3039 | 0.3390 | 0 |
This is the case when probability = True. A score for Class 1 will be awarded.
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Predicted_RandomForestClassifier |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 0.3838 | 0.1889 | 22 | 1 | 0 | 0.3838 | 7.250 | 0.3039 | 0.3390 | 0.1497 |
2 | 1 | 0.3838 | 0.7420 | 38 | 1 | 0 | 0.3838 | 71.283 | 0.3838 | 0.5536 | 0.8477 |
3 | 3 | 0.3838 | 0.7420 | 26 | 0 | 0 | 0.3838 | 7.925 | 0.3039 | 0.3390 | 0.5401 |
4 | 1 | 0.3838 | 0.7420 | 35 | 1 | 0 | 0.4862 | 53.100 | 0.4862 | 0.3390 | 0.8391 |
5 | 3 | 0.3838 | 0.1889 | 35 | 0 | 0 | 0.3838 | 8.050 | 0.3039 | 0.3390 | 0.1514 |
AppendEncoder The various Encoders included in the DataLiner directly replace the category columns with encoded numbers. However, in some cases, you may want to use it as a new feature without replacing it. (TargetMeanEncoder, etc.) In that case, it will be added as a feature by wrapping the Encoder in this class.
process = make_pipeline(
dl.ImputeNaN(),
dl.AppendEncoder(encoder=dl.TargetMeanEncoding())
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Name_TargetMeanEncoding | Sex_TargetMeanEncoding | Ticket_TargetMeanEncoding | Cabin_TargetMeanEncoding | Embarked_TargetMeanEncoding |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | Braund, Mr. Owen Harris | male | 22 | 1 | 0 | A/5 21171 | 7.250 | B96 B98 | S | 0.3838 | 0.1889 | 0.3838 | 0.3039 | 0.3390 |
2 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | female | 38 | 1 | 0 | PC 17599 | 71.283 | C85 | C | 0.3838 | 0.7420 | 0.3838 | 0.3838 | 0.5536 |
3 | 3 | Heikkinen, Miss. Laina | female | 26 | 0 | 0 | STON/O2. 3101282 | 7.925 | B96 B98 | S | 0.3838 | 0.7420 | 0.3838 | 0.3039 | 0.3390 |
4 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35 | 1 | 0 | 113803 | 53.100 | C123 | S | 0.3838 | 0.7420 | 0.4862 | 0.4862 | 0.3390 |
5 | 3 | Allen, Mr. William Henry | male | 35 | 0 | 0 | 373450 | 8.050 | B96 B98 | S | 0.3838 | 0.1889 | 0.3838 | 0.3039 | 0.3390 |
AppendClusterTargetMean Cluster the data and assign a cluster number. (Same as Append Cluster so far) Then replace each cluster number with the average of the objective variables in the cluster and add it as a new feature. Missing value completion and categorical variable processing are required.
process = make_pipeline(
dl.ImputeNaN(),
dl.TargetMeanEncoding(),
dl.AppendClusterTargetMean()
)
process.fit_transform(X, y)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | cluster_mean |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | 3 | 0.3838 | 0.1889 | 22 | 1 | 0 | 0.3838 | 7.250 | 0.3039 | 0.3390 | 0.3586 |
2 | 1 | 0.3838 | 0.7420 | 38 | 1 | 0 | 0.3838 | 71.283 | 0.3838 | 0.5536 | 0.3586 |
3 | 3 | 0.3838 | 0.7420 | 26 | 0 | 0 | 0.3838 | 7.925 | 0.3039 | 0.3390 | 0.3586 |
4 | 1 | 0.3838 | 0.7420 | 35 | 1 | 0 | 0.4862 | 53.100 | 0.4862 | 0.3390 | 0.3586 |
5 | 3 | 0.3838 | 0.1889 | 35 | 0 | 0 | 0.3838 | 8.050 | 0.3039 | 0.3390 | 0.3586 |
PermutationImportanceTest This is a type of feature selection method. With or without randomly shuffling data for a feature Feature selection is performed from the viewpoint of how much the evaluation index of the model prediction result deteriorates. If shuffling the data randomly does not have much effect on the metric, the feature is considered ineffective and deleted.
process = make_pipeline(
dl.ImputeNaN(),
dl.TargetMeanEncoding(),
dl.PermutationImportanceTest()
)
process.fit_transform(X, y)
Pclass | Sex | Age | SibSp | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|
3 | 0.1889 | 22 | 1 | 0.3838 | 7.250 | 0.3039 | 0.3390 |
1 | 0.7420 | 38 | 1 | 0.3838 | 71.283 | 0.3838 | 0.5536 |
3 | 0.7420 | 26 | 0 | 0.3838 | 7.925 | 0.3039 | 0.3390 |
1 | 0.7420 | 35 | 1 | 0.4862 | 53.100 | 0.4862 | 0.3390 |
3 | 0.1889 | 35 | 0 | 0.3838 | 8.050 | 0.3039 | 0.3390 |
Name, PassengerId and Parch have been removed. You can also check the deleted features as follows.
process['permutationimportancetest'].drop_columns_
['PassengerId', 'Name', 'Parch']
You can also adjust the sensitivity by adjusting the threshold threshold. See Document for details.
The above is the newly added preprocessing. RankedEvaluationMetricEncoding is sometimes more accurate than TargetMeanEncoding, so I often try it. Also, the Permutation Importance Test can be executed faster than the Boruta and Step-wise methods, but there is no difference unexpectedly. I think that it may be used when you want to select the (?) Feature more seriously than DropLowAUC.
Release article: [Updated Ver1.1.9] I made a data preprocessing library DataLiner for machine learning
Pre-processing before 1.2 is introduced below. Try processing Titanic data with the preprocessing library DataLiner (Drop) Try processing Titanic data with the preprocessing library DataLiner (Encoding) Try processing Titanic data with the preprocessing library DataLiner (conversion) Try processing Titanic data with the preprocessing library DataLiner (Append)