We have implemented all the main functions that were expected for the release of DataLiner 1.3.1. In the future, the development pace will be about adding bug fix and preprocessing frequently.
Release article: https://qiita.com/shallowdf20/items/36727c9a18f5be365b37 GitHub: https://github.com/shallowdf20/dataliner Document: https://shallowdf20.github.io/dataliner/preprocessing.html
! pip install -U dataliner
There are the following four.
--UnionAppend implementation --StandardizeData abolished (renamed to Standard Scaling) --ArithmeticFeatureGenerator abolished (renamed to AppendArithmeticFeatures) --load_titanic implementation
Then, I will introduce the specific usage.
First, import the package to be used this time.
import dataliner as dl
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier
From this version, titanic data is included in the package, so it is easy to try. Use the load_titanic method to load the sample titanic data.
X, X_test, y = dl.load_titanic()
Now X is the data excluding'Survived'in train.csv, X_test is test.csv, and y is the'Survived' column in train.csv.
This time, I will introduce the process called ** Union Append **.
In DataLiner pre-processing, basically all pre-processing that adds new features from existing features is given the class name Append 〇〇.
That's why we changed ArithmeticFeatureGenerator to AppendArithmeticFeatures in this version. The exceptions are BinarizeNaN and CountRowNaN, but these are the processes that are performed before the missing value completion / category processing in principle, so we have given this name.
Here, for example, let's say you want to add features as a whole, and suppose you build a pipeline as follows.
process = make_pipeline(
dl.ImputeNaN(),
dl.TargetMeanEncoding(),
dl.StandardScaling(),
dl.AppendCluster(),
dl.AppendAnomalyScore(),
dl.AppendPrincipalComponent(),
dl.AppendClusterTargetMean(),
dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=100, max_depth=5)),
dl.AppendClusterDistance(),
dl.AppendArithmeticFeatures(),
)
process.fit_transform(X, y)
In this method, for example, the features added by Append Cluster are used as the original data of the next Append Anomaly Score. (And the features will increase steadily in all Append 〇〇 below)
You may want to process in parallel instead of serial processing like this, and make all the features that are the basis of Append 〇〇 the same. You can use Union Append in that case.
process = make_pipeline(
dl.ImputeNaN(),
dl.TargetMeanEncoding(),
dl.StandardScaling(),
dl.UnionAppend([
dl.AppendCluster(),
dl.AppendAnomalyScore(),
dl.AppendPrincipalComponent(),
dl.AppendClusterTargetMean(),
dl.AppendClassificationModel(model=RandomForestClassifier(n_estimators=100, max_depth=5)),
dl.AppendClusterDistance(),
dl.AppendArithmeticFeatures(),
]),
)
process.fit_transform(X, y)
By giving the class of Append 〇 〇 that you want to apply to UnionAppend as an array, all the base features of the processing in UnionAppend are unified, and the processing result of each Append 〇〇 is combined and returned. The execution result is as follows.
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Cluster_Number | Anomaly_Score | Principal_Component_0 | Principal_Component_1 | Principal_Component_2 | Principal_Component_3 | Principal_Component_4 | cluster_mean | Predicted_RandomForestClassifier | Cluster_Distance_0 | Cluster_Distance_1 | Cluster_Distance_2 | Cluster_Distance_3 | Cluster_Distance_4 | Cluster_Distance_5 | Cluster_Distance_6 | Cluster_Distance_7 | Age_multiply_SibSp | PassengerId_multiply_SibSp | SibSp_multiply_Parch |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-1.729 | 0.8269 | -0.9994 | -0.7373 | -0.5921 | 0.4326 | -0.4734 | -0.1954 | -0.5022 | -0.3479 | -0.5397 | 1 | 0.094260 | -1.4177 | 0.1906 | -0.35640 | -1.398 | -0.5801 | 0.1677 | 0 | 2.861 | 1.265 | 4.352 | 3.466 | 5.616 | 3.461 | 2.782 | 5.667 | -0.2561 | -0.7479 | -0.2048 |
-1.725 | -1.5652 | -0.9994 | 1.3548 | 0.6384 | 0.4326 | -0.4734 | -0.1954 | 0.7864 | 0.1665 | 2.0434 | 5 | -0.047463 | 1.9956 | 0.1777 | -0.14888 | -2.449 | 0.6941 | 0.4874 | 1 | 3.768 | 4.335 | 5.799 | 3.681 | 3.946 | 3.028 | 4.993 | 4.830 | 0.2762 | -0.7463 | -0.2048 |
-1.721 | 0.8269 | -0.9994 | 1.3548 | -0.2845 | -0.4743 | -0.4734 | -0.1954 | -0.4886 | -0.3479 | -0.5397 | 0 | 0.076929 | -0.8234 | 0.2181 | -1.24773 | -1.380 | -1.2529 | 0.7321 | 1 | 1.870 | 2.311 | 4.937 | 3.759 | 5.490 | 3.548 | 3.376 | 5.467 | 0.1349 | 0.8164 | 0.2245 |
-1.717 | -1.5652 | -0.9994 | 1.3548 | 0.4077 | 0.4326 | -0.4734 | 0.2317 | 0.4205 | 0.8250 | -0.5397 | 0 | -0.000208 | 1.5823 | 0.2699 | 0.10503 | -1.536 | -1.6788 | 0.7321 | 1 | 2.835 | 3.547 | 5.352 | 3.058 | 4.090 | 3.970 | 4.338 | 3.846 | 0.1763 | -0.7429 | -0.2048 |
-1.714 | 0.8269 | -0.9994 | -0.7373 | 0.4077 | -0.4743 | -0.4734 | -0.1954 | -0.4861 | -0.3479 | -0.5397 | 1 | 0.106421 | -1.2160 | -0.7344 | -0.09900 | -1.500 | -0.7327 | 0.1677 | 0 | 2.866 | 1.148 | 5.064 | 2.921 | 5.567 | 3.463 | 2.689 | 5.594 | -0.1934 | 0.8127 | 0.2245 |
Processing to test data is as usual.
process.transform(X_test)
PassengerId | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Cluster_Number | Anomaly_Score | Principal_Component_0 | Principal_Component_1 | Principal_Component_2 | Principal_Component_3 | Principal_Component_4 | cluster_mean | Predicted_RandomForestClassifier | Cluster_Distance_0 | Cluster_Distance_1 | Cluster_Distance_2 | Cluster_Distance_3 | Cluster_Distance_4 | Cluster_Distance_5 | Cluster_Distance_6 | Cluster_Distance_7 | Age_multiply_SibSp | PassengerId_multiply_SibSp | SibSp_multiply_Parch |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.733 | 0.8269 | -0.9994 | -0.7373 | 0.3692 | -0.4743 | -0.4734 | -0.1954 | -0.4905 | -0.3479 | 0.06949 | 6 | 0.08314 | -0.92627 | -1.0572 | 0.17814 | 1.514 | 0.78456 | 0.1087 | 0 | 3.095 | 2.724 | 5.232 | 2.986 | 5.397 | 3.045 | 1.1646 | 5.405 | -0.17512 | -0.8219 | 0.2245 |
1.737 | 0.8269 | -0.9994 | 1.3548 | 1.3306 | 0.4326 | -0.4734 | -0.1954 | -0.5072 | -0.3479 | -0.53969 | 0 | 0.01921 | -0.45407 | -0.2239 | 0.40615 | 1.531 | -0.36302 | 0.7321 | 0 | 2.677 | 3.744 | 5.022 | 3.451 | 5.503 | 3.924 | 2.7926 | 5.414 | 0.57556 | 0.7513 | -0.2048 |
1.741 | -0.3692 | -0.9994 | -0.7373 | 2.4843 | -0.4743 | -0.4734 | -0.1954 | -0.4531 | -0.3479 | 0.06949 | 3 | 0.02651 | 0.04527 | -2.0548 | 1.70715 | 1.119 | 0.49872 | 0.2277 | 0 | 4.047 | 3.880 | 6.207 | 2.345 | 5.441 | 3.955 | 2.9554 | 5.527 | -1.17825 | -0.8256 | 0.2245 |
1.745 | 0.8269 | -0.9994 | -0.7373 | -0.2076 | -0.4743 | -0.4734 | -0.1954 | -0.4737 | -0.3479 | -0.53969 | 6 | 0.11329 | -1.17022 | -0.7993 | 0.02809 | 1.770 | 0.37658 | 0.1087 | 0 | 3.011 | 2.615 | 5.099 | 3.238 | 5.522 | 3.420 | 0.9194 | 5.466 | 0.09846 | -0.8275 | 0.2245 |
1.749 | 0.8269 | -0.9994 | 1.3548 | -0.5921 | 0.4326 | 0.7672 | -0.1954 | -0.4008 | -0.3479 | -0.53969 | 0 | 0.02122 | -0.63799 | 1.2879 | -0.38498 | 1.920 | -0.06859 | 0.7321 | 1 | 2.269 | 3.601 | 3.888 | 4.139 | 5.238 | 3.679 | 2.6813 | 5.425 | -0.25613 | 0.7563 | 0.3319 |
This completes the implementation of the initially expected functions and the preprocessing used. In the future, if I find a bug fix and a new pre-processing / I will add it if I can think of it.
Recommended Posts