With the release of the Windows95 OS in 1995 and the spread of hardware products to the general public, the Internet has become a tool that anyone can easily use. I think this is described as "Internet infrastructure development."
The same thing is about to happen with machine learning technology. Services like DataRobot and Azure Machine Learning This is a typical example. In the past, data analysis by machine learning was a ** proprietary patent ** only for professional occupations such as engineers and data scientists. However, with the advent of Auto ML, the wave of "democratization of machine learning" has begun.
This time, the purpose is to make it (simple one).
Before talking about AML What is Machine Learning (ML)? I will talk from.
The English version of wikipedia had the following description.
Machine learning (ML) is the study of computer algorithms that improve automatically through experience. It is seen as a subset of artificial intelligence. Machine learning algorithms build a mathematical model based on sample data, known as "training data", in order to make predictions or decisions without being explicitly programmed to do so.
In other words, it can be said that "the future is ** predicted ** from past experience (data) without human intervention". See the figure below. Prepare raw data (row data) and predict the "target" through the entire flow such as "preprocessing" → "feature engineering" → "learning" → "model selection / scoring" ** It's called machine learning **.
For example, when you want to predict the weather tomorrow, it seems that you can predict it from information such as yesterday's and today's weather, temperature, humidity, and wind direction. At this time, "tomorrow's weather" is expressed as "target", and past information such as "yesterday's and today's information" is expressed as "raw data". (In this case, since it is time series data, there are many other things to consider.)
There are two tasks to be solved by machine learning: "classification problem" and "regression problem". This time, I will focus on the "classification problem".
Auto ML
So what is Auto ML? It refers to machine learning that automates "pre-processing"-> "feature engineering" in the above explanation.
There are many ways to say Auto ML in a nutshell, but the goal this time is to develop Auto ML that has the following functions and compares the accuracy of each model. Normally, it is necessary to automate tasks such as parameter tuning, but please forgive me this time> <
--Load data from the data path --onehot encoding --Complement missing values with one of "mean", "median", and "mode" --Selection of features --Grid Research --Random research --Mixed matrix --ROC curve
This code is on github
This time, we will use the familiar titanic dataset.
aml
|----data
| |---train.csv
| |---test.csv
|
|----model
| |---The model is saved here
|
|----myaml.inpynb
I will show you the usage code first. Corresponds to the API example.
model_data = 'data/train.csv'
scoring_data = 'data/test.csv'
aml = MyAML(model_data, scoring_data, onehot_columns=None)
aml.drop_cols (['Name','Ticket','Cabin']) #Do not use Name, Ticket and Cabin information
# Pretreatment and feature engineering (feature selection)
aml.preprocessing(target_col='Survived', index_col='PassengerId', feature_selection=False)
# Learning and model comparison result display (adopts holdout method)
aml.holdout_method(pipelines=pipelines_pca, scoring='auc')
test | train | |
---|---|---|
gb | 0.754200 | 0.930761 |
knn | 0.751615 | 0.851893 |
logistic | 0.780693 | 0.779796 |
rf | 0.710520 | 0.981014 |
rsvc | 0.766994 | 0.837220 |
tree | 0.688162 | 1.000000 |
The pre-processing here refers to the following two. In addition, I hope that the following code can be explained by scraping only the important parts.
--onehot encoding --Complement missing values with one of "mean", "median", and "mode"
def _one_hot_encoding(self, X: pd.DataFrame) -> pd.DataFrame:
...
# one_hot_encoding
if self.ohe_columns is None: # obejct or category columns only one_hot_encoding
X_ohe = pd.get_dummies(X,
dummy_na = True, # NULL is also made into a dummy variable
drop_first = True) # Exclude the first category
else: only the columns specified by # self.ohe_columns one_hot_encoding
X_ohe = pd.get_dummies(X,
dummy_na = True, # NULL is also made into a dummy variable
drop_first = True, # Exclude the first category
columns=self.ohe_columns)
...
In the initialization of MyAML class, ʻonehot_columns` stored in the instance variable receives "column names to be encoded in onehot" in a list. If nothing is specified, the columns of the received data frame of type obejct or category will be onehot encoded.
def _impute_null(self, impute_null_strategy):
"""
Complement missing values with impute_null_strategy
Types of impute_null_strategy
mean ... complemented by mean
median ... complemented by median
most_frequent ... complement with mode
"""
self.imp = SimpleImputer(strategy=impute_null_strategy)
self.X_model_columns = self.X_model.columns.values
self.X_model = pd.DataFrame(self.imp.fit_transform(self.X_model),
columns=self.X_model_columns)
Use the SimpleImputer
class of scikit-learn
to perform missing value completion.
ʻImpute_null_strategy` is an argument that indicates what to complete. The corresponding complement method is as follows.
--mean
... Complemented by average value
-- median
... Complemented by median
--most_frequent
... Complement with mode
Feature engineering is also deep, but this time we will simplify it and consider "feature selection by ** random forest **".
def _feature_selection(self, estimator=RandomForestClassifier(n_estimators=100, random_state=0), cv=5):
"""
Feature selection
@param estimator: Learner for performing feature selection
"""
self.selector = RFECV(estimator=estimator, step=.05, cv=cv)
self.X_model = pd.DataFrame(self.selector.fit_transform(self.X_model, self.y_model),
columns=self.X_model_columns[self.selector.support_])
self.selected_columns = self.X_model_columns[self.selector.support_]
The first line initializes the RFECV
class. In this case, the estimator specifies RandomForestClassifier
as the default.
In the next line, select the features that are most important.
Finally, store the ** selected ones ** in the instance variable selected_columns
.
The holdout method compares the compatibility of the model with the data. The holdout method is a method of dividing training data (data used for training the model) and test data (data for verification not used for training). In this way, the training data is always training data and the test data is always test data.
Cross-validation is also implemented as another way to compare the compatibility of the model with the data, but I will omit the explanation.
def holdout_method(self, pipelines=pipelines_pca, scoring='acc'):
"""
Check the accuracy of the model by the holdout method
@param piplines: Pipeline (dictionary of models to try)
@param scoring: Evaluation index
acc: Correct answer rate
auc: ROC curve area
"""
X_train, X_test, y_train, y_test = train_test_split(self.X_model,
self.y_model,
test_size=.2,
random_state=1)
y_train=np.reshape(y_train,(-1))
y_test=np.reshape(y_test,(-1))
scores={}
for pipe_name, pipeline in pipelines.items():
pipeline.fit(X_train, y_train)
joblib.dump(pipeline, './model/'+ pipe_name + '.pkl')
if scoring == 'acc':
scoring_method = accuracy_score
elif scoring == 'auc':
scoring_method = roc_auc_score
scores[(pipe_name, 'train')] = scoring_method(y_train, pipeline.predict(X_train))
scores[(pipe_name, 'test')] = scoring_method(y_test, pipeline.predict(X_test))
display(pd.Series(scores).unstack())
Here, the variable piplines
has the following format.
make pipelines for PCA
pipelines_pca={
"""
'Model name': Pipeline ([('scl', standardized class))
, ('pca', class for principal component analysis)
, ('est', model)])
"""
'knn': Pipeline([('scl', StandardScaler())
, ('pca', PCA(random_state=1))
, ('est', KNeighborsClassifier())]),
'logistic': Pipeline([('scl', StandardScaler())
, ('pca', PCA(random_state=1))
, ('est', LogisticRegression(random_state=1))]),
...
}
Each of the three classes confined to the Pipeline
class performs the following functions.
--'scl': Standardize --'pca': Principal component analysis --'est': Model
Therefore, at the time of pipeline.fit (X_train, y_train)
, a series of flow of" standardization "→" feature analysis by principal component analysis "→" learning "is performed.
My Dream
I have a dream. "It is the realization of a society where anyone can easily create models for machine learning and deep learning so that the Internet can be used by anyone." As the first step in developing the AI infrastructure, we have implemented a "system in which a series of machine learning processes can be operated simply by passing through a path". There are still many places that I haven't reached yet, but I will continue to do my best.
Recommended Posts