scikit-learn is a convenient machine learning library for Python that you can easily use with Numpy, Scipy, and Matplotlib. scikit-learn becomes much easier to use once you know the API design pattern.
This time, I will explain the charm of scikit-learn based on the paper by the author of scikit-learn.
Scikit-learn objects are designed according to some patterns to keep the API consistent. ** By understanding this pattern, you can use any object without any inconvenience. ** **
Estimator
scikit-learn is based on an interface called Estimator.
** Estimator trains some model (parameter) based on the data. ** **
It always has a method called fit
, and learning is performed by passing data to the argument of fit
.
You can also set the hyperparameters required for training (in the constructor or the set_params
method).
A class called LogisticRegression
that performs logistic regression is also one of the Estimators.
from sklearn.linear_model import LogisticRegession
clf = LogisticRegression(penalty="l1") #Hyperparameter settings
clf.fit(X_train, y_train) #Learn the model based on training data
Predictor
Many Estimators also introduce Predictor interfaces at the same time.
** Predictor makes predictions (outputs) based on the model learned in fit
. ** **
Passing data as an argument to the predict
method returns a prediction.
It also has a method called score
, which allows you to evaluate the model by passing a dataset and a label.
For example, LogisticRegression
is a Predictor, so you can use the predict
and score
methods without any problems.
clf.predict(X_test) #Predict about test data
clf.score(X_test, y_test) #Compare expectations with actual answers about test data
Transformer
In addition to Predictor, there is also a class that introduces an interface called Transformer.
As the name implies, ** Transformer can transform data. ** **
It is used more often in data processing APIs than in machine learning models.
Returns the transformed data using a method called transform
.
Also, by using the method fit_transform
, it is designed so that learning and transformation can be performed at the same time.
The example below implements a transformation with the StandardScaler
that standardizes the dataset.
For StandardScaler
, we will learn the mean and variance of each trace rather than a complex model.
from sklearn.preprocessing import StandardScaler
scaler = StandScaler()
X_train = scaler.fit_transform(X_train) #Learning / transforming training data
X_test = scaler.transform(X_test) #Deformation without learning test data (using the mean / variance of training data)
Also, the Predictor and Transformer interfaces can be installed at the same time.
The hyperparameters you set and the learned parameters are saved in the object. (Names of learned parameters end with an underscore) For information on how to access parameters and hyperparameters, see "Attributes" in the documentation for each object.
Example: Get the mean and variance learned by StandardScaler
#Continue from the front
mean = scaler.mean_
variance = scaler.var_
Therefore, any Estimator
fit
predict
, score
, transform
, etc.
You can easily build a workflow by following the procedure.
Implementation from data processing to model learning / evaluation can all be done using Estimator.Since all Estimators have the same method, ** you can easily combine multiple Estimators. ** **
Use Pipeline
for parallel processing and FeatureUnion
for parallel processing.
For example, if you want to standardize your data and perform logistic regression, you can implement the process neatly by using a pipeline.
from sklearn.pipeline import Pipeline
pipe = Pipeline([
{'std_scaler', StandardScaler()},
{'log_reg', LogisticRegression()} #Receives transformed data of transformer
])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
scikit-learn allows you to validate hyperparameters using classes such as GridSearchCV
and RandomSearchCV
.
These also introduce the Estimator interface and use fit
to learn.
Example: Find the best hyperparameters for logistic regression using Grid Search
from sklearn.model_selection import GridSearchCV
clf = GridSearchCV(
estimator=LogisticRegression(),
param_grid={
'C' = [1, 3, 10, 30, 100]
}
)
clf.fit(X_train, y_train) # param_Train multiple models by applying hyperparameters in grid one by one
best_clf = clf.best_estimator_ #Get the best Estimator!
By creating a class with methods defined in the interface, such as fit
, ** you can easily use it for pipelines and validation. ** **
When creating an Estimator, inherit BaseEstimator
, and when creating a Transformer etc., inherit an appropriate Mixin
at the same time.
Transformer example:
from sklearn.base import BaseEstimator, TransformerMixin
class MyTransformer(BaseEstimator, TransformerMixin):
def __init__(self, param_1, param_2):
#Hyperparameter processing
self.param_1 = param_1
# ...
def fit(self, X, y=None):
#processing
return self
def transform(self, X, y=None):
#Numpy matrix processing
# X = ...
return X
# fit_transform is automatically implemented by Transformer Mixin
transformer = MyTransformer()
X_transformed = transformer.fit_transform(X_train)
Scikit-learn provides objects that implement various machine learning methods, but even if you do not understand the contents, you can use them all if you understand the design patterns of Estimator, Predictor, and Transformer. The scikit-learn API is attractive because it is highly consistent, and you can easily proceed with machine learning.
Recommended Posts