I wanted to incorporate my own function into sklearn.pipeline, and as I researched various things, I came up with the question of what is a __sklearn-compliant model. Below is a summary of the official document [^ 1]. I hope it helps people who are thinking of using sklearn.pipeline, sklearn.model_selection.GridSearchCV, etc.
Objects in the sklearn compliant model need to be configured as follows. Let's look at them in order.
The fit method is a method used for learning training data. In sklearn, the name of the learning method is unified to fit. As a result, even on the pipeline and GridSearchCV side, the model can be trained by calling the fit method on the object of the sklearn compliant model. The set_params method has a similar idea. This method is called when tuning parameters such as GridSearchCV.
Some of the features provided by sklearn (for example, GridSearchCV and cross_val_score) behave differently depending on the model type. For example, when learning a classifier, data is stratified and sampled. An example is shown below. --Category: classifier --Regression: regressor --Clustering: clusterer
The \ _estimator_type attribute is automatically set by inheriting the Mixin class (for example, ClassifierMixin class) in sklearn.base. In addition, sklearn recommends that when creating a sklearn-compliant model, it inherits both sklearn.base.BaseEstimator and the Mixin class suitable for that model. --BaseEstimator: methods such as set_params method that will become boilerplate code if implemented from 0 are described. --Mixin: Describes the methods that will be used in each _estimator_type.
Write the code based on the contents of 1.1. And 1.2. Note that set_params is not described here because it is prepared in the BaseEstimator class.
from sklearn.base import BaseEstimator, ClassifierMixin
class Classifier(BaseEstimator, ClassifierMixin):
def __init__(self):
pass
def fit(self, X, y):
pass
from sklearn.base import BaseEstimator, ClassifierMixin
class Classifier(BaseEstimator, ClassifierMixin):
def __init__(self, params1=0, params2=None):
self.params1 = params1
self.params2 = params2
def fit(self, X, y):
pass
y = None
as the second argument (to enable feature generation by unsupervised learning → supervised learning with pipeline etc.)
--The return value is self
--Attributes estimated from the data are underlined at the end (eg coef_
)
from sklearn.base import BaseEstimator, ClassifierMixin
class Classifier(BaseEstimator, ClassifierMixin):
def __init__(self, params1=0, params2=None):
self.params1 = params1
self.params2 = params2
def fit(self, X, y=None):
print('The process of learning data is described here.')
return self
Items to be noted other than the above are listed.
--X.shape [0]
and y.shape [0]
are the same (check using sklearn.utils.validation.check_X_y).
--set_params takes a dictionary as an argument and the return value is self.
--get_params takes no arguments.
--For classifiers, have a list of labels in the classes_ attribute (use sklearn.utils.multiclass.unique_labels).
from sklearn.base import BaseEstimator, ClassifierMixin
from sklearn.utils.validation import check_X_y
from sklearn.utils.multiclass import unique_labels
class Classifier(BaseEstimator, ClassifierMixin):
def __init__(self, params1=0, params2=None):
self.params1 = params1
self.params2 = params2
def fit(self, X, y=None):
X, y = check_X_y(X, y)
self.classes_ = unique_labels(y)
print('The process of learning data is described here.')
return self
def get_params(self, deep=True):
return {"params1": self.params1, "params1": self.params1}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
It is basically PEP8 compliant, but in addition to that, there is a coding standard for sklearn, so I will describe it.
These are unnecessary if you are not thinking of contributing to sklearn.
--Separate each word with an underscore except for the class name (for example, n_samples
)
--Do not write multiple statements on one line (if statement and for statement will break)
--Import modules in sklearn with relative paths (in test code, write with absolute paths)
--ʻImport * `is not used
--docstring is numpy style [^ 3]
sklearn provides a check_estimator method to check if it is a sklearn compliant model. It depends on the _estimator_type attribute, but it seems to do some testing to make sure it's compliant. If you don't implement the fit method, you will get the error ʻAttributeError:'Classifier' object has no attribute'fit'`. Also, since a template of sklearn compliant model is prepared on github, I think it is better to implement it referring to that and execute check_estimator when it is completed to check it. An execution example is shown below.
from sklearn.utils.estimator_checks import check_estimator
#In the code implemented above so far, an error occurs because the predict method required as a classifier is not defined.
#If you implement it as a Template Estimator without inheriting ClassifierMixin, no error will occur.
class Estimator(BaseEstimator):
def __init__(self, params1=0, params2=None):
self.params1 = params1
self.params2 = params2
def fit(self, X, y=None):
X, y = check_X_y(X, y)
self.classes_ = unique_labels(y)
self.is_fitted_ = True
return self
def get_params(self, deep=True):
return {"params1": self.params1, "params1": self.params1}
def set_params(self, **parameters):
for parameter, value in parameters.items():
setattr(self, parameter, value)
return self
check_estimator(Estimator)