I wrote it about two years ago and left it on the in-house site, but I will publish it because it is a big deal.
It is a translation of the following slides (adapted rather than translated because the composition has changed considerably?) HJvanVeen's "Feature Engineering" https://www.slideshare.net/HJvanVeen/feature-engineering-72376750
I think that this slide is more complete than a little book if it is comprehensive. However, since it is a slide, there are many parts that are difficult to understand with a fairly simple description. However, it is possible that you have misunderstood your intention because you have not heard the original announcement. For details on each topic, it is basically a level that you can easily understand by google. I haven't written it, and I haven't supplemented it. I've included supplementary explanations only for those that don't have much information or are difficult to understand.
Also, although this slide lists various topics, there is not much written about specific usage situations and advantages and disadvantages. I will leave it as it is this time, but I would like to do something about it.
The following modules can be used when performing feature engineering introduced below in Python.
Note: The module information may be a bit out of date as it was written a long time ago.
pandas
It can be used for general processing of tabular data. However, get_dummies ()
seems to be usable for one-hot encoder, but it cannot handle missing values, so I think it is better to write a ~~ transformer class wrapper for scikit-learn. Masu ~~ ʻOne Hot Encoderis easier to use. scikit-learn Various
transformers of
preprocessing. Basically everything can be converted with the
fit ()and
transform ()` methods, making the source code easier to read.
It is also possible to organize with Pipeline. Ideally, it is organized with this Pipeline, but not many classes are implemented.
http://scikit-learn.org/stable/modules/preprocessing.html
For example,
This area is also related to the story I wrote earlier.
"To make data analysis work in Python smart"
Category Encoders
https://contrib.scikit-learn.org/categorical-encoding/
There are various additional classes that can be linked with scikit-learn.
It also supports pandas.Series
, but on the other hand, it is not implemented with speed in mind, so processing for large data is not efficient (it is very slow).
** Addendum: A recent update makes scikit-learn-contrib but difficult to use in pipelines. I think it's easier to use the scikit-learn native transformer or implement it yourself. **
Mlxtend A more advanced module designed to work with scikit-learn https://rasbt.github.io/mlxtend/
** Addendum: With a recent update, scikit-learn also supports easy hetero-ensemble learning classes. Usability comparison has not been investigated **
imbalance-learn Module created on the premise of cooperation with scikit-learn Processing of imbalanced data http://contrib.scikit-learn.org/imbalanced-learn/stable/
[^ imbpost]: Resampling of imbalanced data distorts posterior probabilities and (obviously) causes a bias in prediction when used as is for classification. This problem and solution have been theoretically shown ([10.1109 /]. SSCI.2015.33](https://doi.org/10.1109/SSCI.2015.33)), but the idea of this paper has been mentioned many times before that. Takuya Kitazawa's personal page has a summary of ideas around here for your reference.
fancyimpute Module for missing value completion https://github.com/iskandr/fancyimpute
(I have never used it)
I will add a little before the main part
[^ one-hot-sparse]: Translation: There are implementations such as pandas.get_dummies
and sklearn.preprocessing.OneHotEncoder
, both of which can be returned as a sparse matrix of scipy
.: https: // pandas .pydata.org/pandas-docs/stable/generated/pandas.get_dummies.html, http://contrib.scikit-learn.org/categorical-encoding/onehot.html, http://scikit-learn.org/stable /modules/generated/sklearn.preprocessing.OneHotEncoder.html ~~ However, since sklearn is numpy compliant, the input data is only numeric type ~~ Currently, it is quite easy to use because it also supports object type.
Also called hashing trick, feature hashing [^ hash-trick].
[^ hash-trick]: In the original paper (DOI: 10.1145/1553374.1553516), hash trick refers to a combination of hash conversion and kernel tricks. It points. Recently, many people call it just hash conversion.
" 1 "
Note that in the case of a linear model, the objective variable and the count must have some linear correlation before they are valid.
[^ yuukou]: Translated by: As usual, it means "valid (sometimes valid)" [^ tie]: What to do in case of ranking tie
[^ embed]: As usual, it's not a universal property, and there are cases like that (arXiv: 1604.06737). It is effective in the case of "more complicated structure". Situations like the so-called Swiss roll problem.
** Translation **: Polynomial-time reduction means using a combination of categorical variables as a feature. It can be two or more pairs, or three or more combinations. Category One-hot transformation of cal variables often results in very large dimensions, so if you add a combination of them, a "combinatorial explosion" will occur. If there are many categorical cal variables, learning that includes polynomial transformation. You can use the algorithm Factorlization Machine (10.1109 / ICDM.2010.127). Would
[^ fs]: Do you want to say feature selection? [^ vw]: I'm not sure about this. VowpalWabbit feature selection algorithm?
Conversion of numerical information such as continuous variables and counts, which is easier for algorithms to recognize than categorical variables.
Converting a numeric variable to fit in a particular range.
[^ pairwise]: Translation: Pairwise removal? [^ estimate]: Translation: Tobit model or multiple assignment method?
** Translation **: Algorithms that can't handle missing values may only work, but the top three methods aren't generally valid. The fourth "model" In "Estimate with", it is written that "Note that the bias of the estimated value is added", but it is also clear that the method of interpolating with the mean value and the median also imposes an arbitrary model on the distribution of features. So, if these methods are effective, it can be said that, for example, a method of uniformly converting missing values to zero is also effective.
[^ intui]: It's like a "belief"
How to improve the fit to a linear model by non-linear transformation [^ nonlin].
** Translation **: There is no specific explanation for ** leafcoding **, but perhaps a decision tree algorithm such as Random Forest is applied, and instead of using the predicted value, information on which leaf node it fell into is featured. Random forest can also learn complex structures, so it is suggested that leaf node information may be able to convert non-linear functions into linear information. I don't know who it is, but for example, 10.1145/2648584.2648589 has a usage example. [^ nonlin]: Most of the methods listed here are computationally intensive or explosively computationally expensive with respect to dimensions, so many are impractical when there are many variables. For the story of this area, for example, Ako "Kernel Multivariate Analysis" can be used as a reference.
Variables that represent dates, etc., often need confirmation. Where mistakes are likely to occur, but often can be significantly improved.
** Translation **: Although not mentioned by the author, there is a classic way of "circular transform" to use the finite Fourier series (trigonometric polynomial expansion) used to approximate periodic functions. For example, prophet This is how periodicity is expressed (if you don't know prophet, what I wrote and its link ). Prophet has many other techniques that you can easily implement on your own, which can be helpful when dealing with time variables. Of course, it's more simply the day, the day of January, the year. In some cases, information such as the number of days of the can be expressed by one-hot conversion as a categorical variable.
Spatial location: GPS coordinates, cities, countries / regions, addresses, etc.
[^ kriging]: Translated Note: Do you want to do that? I think it's okay to try it by adding a variogram to the features.
Scrutinize the data. Find ideas for data quality, outliers, noise, and feature extraction.
Feature engineering is a work that is repeated many times, so make the work flow so that it can be repeated quickly.
The label / target / objective variable itself can be used as a feature, and vice versa.
** Translation **: This item is too vague to understand, but I think it refers to three main techniques.
The first technique is to improve the fit by transforming the objective variable. For example, a simple linear regression model tends to be less fit if the distribution is asymmetric and distorted, such as logarithmic or squared. You may find that the linear regression model also fits better if you transform the objective variable in and adjust it to be asymmetric. A typical example is the ** Generalized Linear Model ** (GLM).
The second mentions that adding an objective variable transformed in another way to such a model may be even better. Of course, because the objective variable is used as a feature. It cannot be used for actual prediction, but it can be used to check where the cause of the bad fit is, in other words, what is the tendency of the objective variable that is not captured by the model currently being tested. In other words, by checking such a model with a residual histogram or qq plot, we can get a hint that we can improve the conversion method of the objective variable mentioned in the first technique. I think it mentions a method called residual diagnosis.
The third is the content mentioned in the last two "binary variables ~" and "cannot be used in test data ~". As the name suggests, binary variables have only two values, so there is little information. , It is said that which value this binary variable becomes is determined by an unknown variable ** latent variable model * It is thought that there is an invisible probability behind it, such as *. Scoring is to create a variable that represents the probability corresponding to the value of a binary variable, and ** replace the classification problem with a regression problem for the score variable. It means **, because this score variable cannot be created without thinking (as usual, choosing only the model that best fits the data is the same as solving a normal classification problem). , It will be necessary to obtain information on how the probability is determined from outside the data, so-called "domain knowledge". "Cannot be used in test data ~" is to predict the value of future features, It means creating another prediction model with other feature quantities and objective variables as feature quantities. These can also be thought of as ** missing value complementation in a broad sense **. These two techniques use another prediction model. It takes time to make it, but of course such a method may be effective.
** Translation **: This section is a very rudimentary list of stories without specific explanations, so you may want to read a decent textbook (such as Corona's Natural Language Processing series).
[^ pcanlp]: Translation: The basis for this number is unknown
[^ dl]: HAHAHA It's not April Fool's Day today! [^ mean-feature]: In the first place, isn't one aspect of feature extraction a process to improve the adverse condition problem and stabilize the optimization calculation?
Leakage/Golden Features
[^ leak]: Leakage is a method to improve the fit of test data by using information that is not originally available at the time of learning. It seems that it sometimes occurs due to a mistake of the questioner in kaggle. Therefore, it is not practical. , The techniques listed below in reverse engineering and rule mining will have the opportunity to be used in any situation.
It is inserted in various places, but I will summarize it here.
Andrew Ng "The book on the application of machine learning that'difficult to capture features, wastes time, and has expertise'is in feature engineering." Domingos "Machine learning has both successes and failures. What's the difference? Simply put, the most important factor is the features used." Locklin: "Feature engineering is something else that isn't enough to be featured in peer-reviewed papers or textbooks, but it's absolutely essential to the success of machine learning .... Many successful machine learning cases are actually features. Return to engineering " Miel: "Make the input data understandable by the algorithm" Hal Daume III "What most papers say: Feature engineering is hard and time-consuming, but we've found a new way to do similar neural networks in this eight-page paper. It is written" Francois Chollet "Developing a good model requires repeating the original idea over and over until near the deadline. There is always the potential to improve the model. The final model usually addresses the problem first. It's almost different from the outlook at the time, because a priori schedules, in principle, can't survive the experimental conflict with reality. "
Recommended Posts