Pipeline is convenient because you can write code concisely when you connect various preprocessing, but this time, ** a method to combine many pipelines into one and clean it up at once ** is very good. It was convenient, so I will leave it as a memorandum.
Download the demo dataset from kaggle's ** HR Analytics **.
Prepare ** input folder, output folder, model folder ** in the current directory, and save the downloaded data set ** HR_comma_sep.csv ** in ** input folder **.

HR_comma_sep.csv is a data set that predicts whether or not a person will leave the company based on the features of 9 items (left column), and there are 14,999 rows in total.
As in the kaggle competition, let's assume that 10,000 lines are trains and the remaining 4,999 lines are tests, and a training model is created with trains to predict the test results.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# -------------Creating a dataset------------------
#Read the dataset
df = pd.read_csv('./input/HR_comma_sep.csv')
#Shuffle rows, reset index, add ID
df = df.sample(frac=1, random_state=1)  
df = df.reset_index(drop=True) 
df = df.reset_index()  
df = df.rename(columns={'index':'ID'})
#Train by number of lines,Split into test
train = df[0:10000]
valid = df[10000:]
#One-hot encoding of categorical variables
df_train = pd.get_dummies(train)
df_valid = pd.get_dummies(valid)
#Divided into correct labels and features
y = df_train['left']
X = df_train.drop(['ID','left'], axis=1)
y_valid = df_valid['left']
X_valid = df_valid.drop(['ID','left'], axis=1)
#Divided into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)    
print('X_train.shape = ', X_train.shape)
print('y_train.shape =  ', y_train.shape)
print('X_test.shape = ', X_test.shape)
print('y_test.shape = ', y_test.shape)
print('X_valid.shape = ', X_valid.shape)
print('y_valid.shape = ', y_valid.shape)
print()
 After shuffling the rows of the dataset, it is divided into train and test, and the categorical variables are one-hot encoded and separated into correct labels (y, y_valid) and features (X, X_valid).
After shuffling the rows of the dataset, it is divided into train and test, and the categorical variables are one-hot encoded and separated into correct labels (y, y_valid) and features (X, X_valid).
Furthermore, X and y for creating a training model are ** train_test_split **, which are divided into training (X_train, y_train) and evaluation (X_test, y_test). This completes the preparation.
This time, ** prepare 8 pipelines of training models with preprocessing ** and combine them into one big pipeline **. By doing this, you can move eight pipelines in sequence.
# --------Pipeline settings-------- 
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
pipelines = {    
    'KNN':
        Pipeline([('scl',StandardScaler()),
                  ('est',KNeighborsClassifier())]), 
    'Logistic':
        Pipeline([('scl',StandardScaler()),
                  ('est',LogisticRegression(solver='lbfgs', random_state=1))]), 
    'SVM':
        Pipeline([('scl',StandardScaler()),
                  ('est',SVC(C=1.0, kernel='linear', class_weight='balanced', random_state=1, probability=True))]),
    'K-SVM':
        Pipeline([('scl',StandardScaler()),
                  ('est',SVC(C=1.0, kernel='rbf', class_weight='balanced', random_state=1, probability=True))]),
    'Tree':
        Pipeline([('scl',StandardScaler()),
                  ('est',DecisionTreeClassifier(random_state=1))]),
    'RandomF':
        Pipeline([('scl',StandardScaler()),
                  ('est',RandomForestClassifier(n_estimators=100, random_state=1))]), 
    'GBoost':
        Pipeline([('scl',StandardScaler()),
                  ('est',GradientBoostingClassifier(random_state=1))]),    
    'MLP':
        Pipeline([('scl',StandardScaler()),
                  ('est',MLPClassifier(hidden_layer_sizes=(3,3),
                                       max_iter=1000,
                                       random_state=1))]), 
    }
After that, if you do ** for pipe_name, pipeline in pipelines.items (): **, the character string at the beginning of each pipeline (for example,'KNN') will be ** pipe_name **, respectively. Instances of the pipeline are sequentially entered into ** pipeline **. In other words
** Create a learning model with pipeline.fit (X_train, y_train) **
** pipeline.predict (X_test) ** predicts with training model
** pickle.dump (pipeline, open (file_name,'wb')) ** to save the training model
It can be used like this and is very convenient.
# -------Pipeline processing------
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
import pickle
scores = {}
for pipe_name, pipeline in pipelines.items():
    
    #Learning
    pipeline.fit(X_train, y_train)
    
    #Indicator calculation
    scores[(pipe_name,'test_log')] = log_loss(y_test, pipeline.predict_proba(X_test))
    scores[(pipe_name,'valid_log')] = log_loss(y_valid, pipeline.predict_proba(X_valid))
    scores[(pipe_name,'test_acc')] = accuracy_score(y_test, pipeline.predict(X_test))
    scores[(pipe_name,'valid_acc')] = accuracy_score(y_valid, pipeline.predict(X_valid))
    
    #Submit save(output folder) 
    ID=df_valid['ID']
    preds = pipeline.predict_proba(X_valid)  #Predicted probability
    submission = pd.DataFrame({'ID': ID, 'left':preds[:, 1]})  
    submission.to_csv('./output/'+pipe_name+'.csv', index=False) 
    
    #Save model(model folder)
    file_name = './model/'+pipe_name+'.pkl'
    pickle.dump(pipeline, open(file_name, 'wb'))
#Display of indicators
df = pd.Series(scores).unstack()
df = df.sort_values('test_acc', ascending=False)
print(df)
 Here, ** learning, index calculation (accuracy, logloss), submit saving (prediction probability), and model saving ** are performed for each of the eight pipelines. ** pipeline ** is super convenient when you want to do similar processing all at once.
Here, ** learning, index calculation (accuracy, logloss), submit saving (prediction probability), and model saving ** are performed for each of the eight pipelines. ** pipeline ** is super convenient when you want to do similar processing all at once.
By the way, in the case of kaggle, y_valid is a secret (or rather, it is kaggle), so valid_acc and valid_loss cannot be calculated, but this time I know it, so I add it. ^^
Recommended Posts