How to put a lot of pipelines together and put them away at once

1.First of all

Pipeline is convenient because you can write code concisely when you connect various preprocessing, but this time, ** a method to combine many pipelines into one and clean it up at once ** is very good. It was convenient, so I will leave it as a memorandum.

2. Preparation

Download the demo dataset from kaggle's ** HR Analytics **.

Prepare ** input folder, output folder, model folder ** in the current directory, and save the downloaded data set ** HR_comma_sep.csv ** in ** input folder **.

スクリーンショット 2020-01-31 09.54.03.png

HR_comma_sep.csv is a data set that predicts whether or not a person will leave the company based on the features of 9 items (left column), and there are 14,999 rows in total.

As in the kaggle competition, let's assume that 10,000 lines are trains and the remaining 4,999 lines are tests, and a training model is created with trains to predict the test results.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# -------------Creating a dataset------------------
#Read the dataset
df = pd.read_csv('./input/HR_comma_sep.csv')

#Shuffle rows, reset index, add ID
df = df.sample(frac=1, random_state=1)  
df = df.reset_index(drop=True) 
df = df.reset_index()  
df = df.rename(columns={'index':'ID'})

#Train by number of lines,Split into test
train = df[0:10000]
valid = df[10000:]

#One-hot encoding of categorical variables
df_train = pd.get_dummies(train)
df_valid = pd.get_dummies(valid)

#Divided into correct labels and features
y = df_train['left']
X = df_train.drop(['ID','left'], axis=1)
y_valid = df_valid['left']
X_valid = df_valid.drop(['ID','left'], axis=1)

#Divided into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)    

print('X_train.shape = ', X_train.shape)
print('y_train.shape =  ', y_train.shape)
print('X_test.shape = ', X_test.shape)
print('y_test.shape = ', y_test.shape)
print('X_valid.shape = ', X_valid.shape)
print('y_valid.shape = ', y_valid.shape)
print()

スクリーンショット 2020-01-30 21.02.47.png After shuffling the rows of the dataset, it is divided into train and test, and the categorical variables are one-hot encoded and separated into correct labels (y, y_valid) and features (X, X_valid).

Furthermore, X and y for creating a training model are ** train_test_split **, which are divided into training (X_train, y_train) and evaluation (X_test, y_test). This completes the preparation.

3. Pipeline settings

This time, ** prepare 8 pipelines of training models with preprocessing ** and combine them into one big pipeline **. By doing this, you can move eight pipelines in sequence.

# --------Pipeline settings-------- 
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline

pipelines = {    
    'KNN':
        Pipeline([('scl',StandardScaler()),
                  ('est',KNeighborsClassifier())]), 
    'Logistic':
        Pipeline([('scl',StandardScaler()),
                  ('est',LogisticRegression(solver='lbfgs', random_state=1))]), 
    'SVM':
        Pipeline([('scl',StandardScaler()),
                  ('est',SVC(C=1.0, kernel='linear', class_weight='balanced', random_state=1, probability=True))]),
    'K-SVM':
        Pipeline([('scl',StandardScaler()),
                  ('est',SVC(C=1.0, kernel='rbf', class_weight='balanced', random_state=1, probability=True))]),
    'Tree':
        Pipeline([('scl',StandardScaler()),
                  ('est',DecisionTreeClassifier(random_state=1))]),
    'RandomF':
        Pipeline([('scl',StandardScaler()),
                  ('est',RandomForestClassifier(n_estimators=100, random_state=1))]), 
    'GBoost':
        Pipeline([('scl',StandardScaler()),
                  ('est',GradientBoostingClassifier(random_state=1))]),    
    'MLP':
        Pipeline([('scl',StandardScaler()),
                  ('est',MLPClassifier(hidden_layer_sizes=(3,3),
                                       max_iter=1000,
                                       random_state=1))]), 
    }

5. Pipeline processing

After that, if you do ** for pipe_name, pipeline in pipelines.items (): **, the character string at the beginning of each pipeline (for example,'KNN') will be ** pipe_name **, respectively. Instances of the pipeline are sequentially entered into ** pipeline **. In other words

** Create a learning model with pipeline.fit (X_train, y_train) ** ** pipeline.predict (X_test) ** predicts with training model ** pickle.dump (pipeline, open (file_name,'wb')) ** to save the training model

It can be used like this and is very convenient.

# -------Pipeline processing------
from sklearn.metrics import accuracy_score
from sklearn.metrics import log_loss
import pickle

scores = {}
for pipe_name, pipeline in pipelines.items():
    
    #Learning
    pipeline.fit(X_train, y_train)
    
    #Indicator calculation
    scores[(pipe_name,'test_log')] = log_loss(y_test, pipeline.predict_proba(X_test))
    scores[(pipe_name,'valid_log')] = log_loss(y_valid, pipeline.predict_proba(X_valid))
    scores[(pipe_name,'test_acc')] = accuracy_score(y_test, pipeline.predict(X_test))
    scores[(pipe_name,'valid_acc')] = accuracy_score(y_valid, pipeline.predict(X_valid))
    
    #Submit save(output folder) 
    ID=df_valid['ID']
    preds = pipeline.predict_proba(X_valid)  #Predicted probability
    submission = pd.DataFrame({'ID': ID, 'left':preds[:, 1]})  
    submission.to_csv('./output/'+pipe_name+'.csv', index=False) 
    
    #Save model(model folder)
    file_name = './model/'+pipe_name+'.pkl'
    pickle.dump(pipeline, open(file_name, 'wb'))

#Display of indicators
df = pd.Series(scores).unstack()
df = df.sort_values('test_acc', ascending=False)
print(df)

スクリーンショット 2020-01-30 21.09.49.png Here, ** learning, index calculation (accuracy, logloss), submit saving (prediction probability), and model saving ** are performed for each of the eight pipelines. ** pipeline ** is super convenient when you want to do similar processing all at once.

By the way, in the case of kaggle, y_valid is a secret (or rather, it is kaggle), so valid_acc and valid_loss cannot be calculated, but this time I know it, so I add it. ^^

Recommended Posts

How to put a lot of pipelines together and put them away at once
Unzip a lot of ZIP-compressed files with Linux commands to UTF8 and stick them together
How to put a line number at the beginning of a CSV file
Get a lot of Twitter tweets at once
How to shuffle a part of a Python list (at random.shuffle)
How to insert a specific process at the start and end of spider with scrapy
To extract the data of a specific column in a specific sheet in multiple Excel files at once and put the data in each column in one row
How to put a symbolic link
Overview of how to create a server socket and how to establish a client socket
How to split and save a DataFrame
[Python] How to put any number of standard inputs in a list
Connect a lot of Python or and and
How to put a half-width space before letters and numbers in Python.
I want to backtest a large number of exchange pairs and strategies at once with Python's backtesting.py
[Python] How to save the installed package and install it in a new environment at once Mac environment
Story of making a virtual planetarium [Until a beginner makes a model with a script and manages to put it together]
One-liner to create a large number of test files at once on Linux
How to count the number of elements in Django and output to a template
How to apply updlock, rowlock, etc. with a combination of SQLAlchemy and SQLServer
A simple example of how to use ArgumentParser
How to save all Instagram photos at once
Introducing Sinatra-style frameworks and how to use them
What to do if pvcreate produces a lot of WARNING and cannot be created
Make a list of latitude and longitude and convert UTM coordinates at once → File output
A memo of how to use AIST supercomputer ABCI
Beginners! Basic Linux commands and how to use them!
How to write a list / dictionary type of Python3
Python + selenium to GW a lot of e-mail addresses
Basics of PyTorch (2) -How to make a neural network-
[Linux] How to put your IP in a variable
[Linux] [C / C ++] How to get the return address value of a function and the function name of the caller
A script that sends a lot of websites to people who regularly visit them every day
I made a tool to get the answer links of OpenAI Gym all at once
How to start the PC at a fixed time every morning and execute the python program
[Python] How to delete rows and columns in a table (list of drop method options)
How to plot a lot of legends by changing the color of the graph continuously with matplotlib
A story about porting the code of "Try and understand how Linux works" to Rust
I tried to make a script that traces the tweets of a specific user on Twitter and saves the posted image at once
[Ubuntu] How to delete the entire contents of a directory
How to put a hyperlink to "file: // hogehoge" with sphinx-> pdf
[Python] How to make a list of character strings character by character
How to run a Python file at a Windows 10 command prompt
How to manage a README for both github and PyPI
[python] Summary of how to retrieve lists and dictionary elements
I want to start a lot of processes from python
NikuGan ~ I want to see a lot of delicious meat! !!
Put the lists together in pandas to make a DataFrame
[Python] Summary of how to use split and join functions
How to develop in a virtual environment of Python [Memo]
How to display a list of installable versions with pyenv
Comparison of how to use higher-order functions in Python 2 and 3
How to register a package on PyPI (as of September 2017)
How to get a list of built-in exceptions in python
How to find the scaling factor of a biorthogonal wavelet
Introduction of DataLiner ver.1.3 and how to use Union Append
How to write a metaclass that supports both python2 and python3
How to get a list of links from a page from wikipedia
How to get a quadratic array of squares in a spiral!
Overview of Python virtual environment and how to create it
How to connect the contents of a list into a string
Get 1000 posts in Python order from all Slack channels and put them together in a txt file