Article summary

If you pass cuML's SVM (SVC) as an estimator to scikit-learn's Gridsearch CV, Since an error has occurred, I will leave a solution. Since this is Qiita's first post, I would appreciate it if you could point out any mistakes or points that are difficult to understand.

Only the conclusion first

Cause of error

--The return value of scikit-learn's SVM.SVC.predict () is an array of numpy --The return value of cuml.svm.SVC.predict () of cuML is Series of cuDF

GridsearchCV of scikit-learn assumes a numpy array as the return value of estimator.predict (). However, since the return value of SVC.predict () of cuML is Series of cuDF, an error occurs inside Gridsearch CV.

If you don't use GridsearchCV, you can solve it by converting the return value to a numpy array each time, but if you use GridsearchCV, you can't use that method. (Because it is necessary to pass each instance of SVC class to GridsearchCV)

solution

--Create a class that inherits cuml.svm.SVC --Override the predict method to convert the return value to a numpy array before outputting --Use an instance of that class as an estimator

Implementation example

This time, as an example, we will use SVM to classify "5" and "8" in the MNIST dataset. The reason is as follows.

――MNIST is easy to obtain and format, and the number of data is just right --cuML's SVC currently only supports two-class classification --It seems that it is the most difficult to classify "5" and "8" (Reference)

Execution environment

OS: Ubuntu 18.04.2 LTS
CPU: Intel Xeon W-2133
GPU: GeForce RTX 2080 Ti
python: 3.6.5 --CUDA version: 10.0.130 --cuML version: 0.12 --scikit-learn version: 0.22.1

Data set creation

First, create a dataset. Take out only MNIST 5 and 8 and Change the label to binary (5 → 0, 8 → 1).

`dataset_maker.py`


import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

def dataset_maker():
    mnist = fetch_openml('mnist_784', version=1)
    data_58 = []
    label_58 =[]
    for data,target in zip(mnist.data, mnist.target):
        if target=='5':
            data_58.append(data/255)
            label_58.append(0)
        elif target=='8':
            data_58.append(data/255)
            label_58.append(1)

    data_58 = np.array(data_58)
    label_58 = np.array(label_58)
    X_train, X_test, y_train, y_test = train_test_split(data_58, label_58)

    return X_train, X_test, y_train, y_test

Difference between sklearn.svm.SVC.predict () and cuml.svm.SVC.predict ()

Check the difference in the return value of the predict method, which is the cause of the error. As shown in the code below, cuML SVC can be treated in the same way as sklearn SVC. I am happy that the introduction is easy.

`sklearn_vs_cuML.py`


from sklearn.svm import SVC as skSVC
from cuml.svm import SVC as cuSVC

def classify_sklearn(X_train, X_test, y_train, y_test):
    clf = skSVC()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print("skSVC output_type:{}".format(type(y_pred)))
    print("skSVC y_pred:{}".format(y_pred[0:10]))

def classify_cuml(X_train, X_test, y_train, y_test):
    clf = cuSVC()
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

    print("cuSVC output_type:{}".format(type(y_pred)))
    print("cuSVC y_pred:{}".format(y_pred[0:10]))


if __name__ == "__main__":
    X_train, X_test, y_train, y_test = dataset_maker()
    classify_sklearn(X_train, X_test, y_train, y_test)
    classify_cuml(X_train, X_test, y_train, y_test)

When you do this, the output will look like this:

skSVC output_type:<class 'numpy.ndarray'>
skSVC y_pred:[0 0 0 1 0 0 0 0 1 0]
cuSVC output_type:<class 'cudf.core.series.Series'>
cuSVC y_pred:0    0.0
1    0.0
2    0.0
3    1.0
4    0.0
5    0.0
6    0.0
7    0.0
8    1.0
9    0.0
dtype: float64

As I wrote above, you can see that the return value is different. In summary, it looks like this.

--Output type is different - sklearn: numpy.ndarray - cuML: cudf.core.series.Series --The element type in the output array is different - sklearn: int - cuML: float64

Of these, due to the former, if the return value of cuml.svm.SVC.predict () is passed to the evaluation function of sklearn as it is, I get angry with ValueError: Expected array-like (array or non-string sequence). [^ 1]

[^ 1]: The latter seems to be cast without permission, and it will work if only the former is fixed. However, it's unpleasant, so the code below explicitly casts it to an int type.

This itself can be solved by converting it to a numpy array, so when classifying with cuML's SVC, Set the return value of the predict method to [cudf.core.series.Series.to_array ()](https://rapidsai.github.io/projects/cudf/en/latest/api.html#cudf.core.series.Series. Convert to a numpy array using to_array) and then Let's pass it to the evaluation function of scikit-learn. [^ 2]

[^ 2]: Of course, the evaluation function of cuML is compatible with the Series of cuDF, but there are few types at present, and I think that the evaluation function of scikit-learn is probably used in many cases in practice.

Use Gridsearch CV with cuML

Now the main subject. If you want to determine the hyperparameters of SVC by grid search Perhaps the first thing that comes to mind is how to use scikit-learn's Gridsearch CV. First, let's try scikit-learn's SVC as an estimator.

`classify_sklearn_grid.py`


def classify_sklearn_grid(X_train, X_test, y_train, y_test):
    parameters = {'kernel': ['linear', 'rbf'],
                  'C': [0.1, 1, 10, 100],
                  'gamma': [0.1, 1, 10]}

    clf = GridSearchCV(skSVC(), parameters, scoring='accuracy', verbose=2)
    clf.fit(X_train, y_train)
    y_pred = clf.best_estimator_.predict()

if __name__ == "__main__":
    X_train, X_test, y_train, y_test = dataset_maker()
    pred_sk_grid = classify_sklearn_grid(X_train, X_test, y_train, y_test)

I think it will be like this.

Since cuML's SVC is a class with necessary methods such as .fit () and .predict (), it meets the requirements of Gridsearch CV as an estimator.

However, in reality, the return value of the predict method is cuDF Series, which causes an error in the process of evaluating the result. Since it is necessary to pass each instance of SVC to GridsearchCV, it is not possible to convert using the to_array method every time the predict method is called.

To solve this problem, you can override the predict method so that the return value is a numpy array.

I will explain in detail. It's easy, just define a new class like this:

`MySVC.py`


from cuml.svm import SVC

class MySVC(SVC):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    def predict(self, X):
        y_pred = super().predict(X).to_array().astype(int)
        return y_pred

You can pass this MySVC to GridsearchCV instead of cuml's SVC. I don't think I need to write it,

`classify_MySVC.py`


from MySVC import MySVC

def classify_cuml_grid(X_train, X_test, y_train, y_test):

    parameters = {'kernel': ['linear', 'rbf'],
                  'C': [0.1, 1, 10, 100],
                  'gamma': [0.1, 1, 10]}

    clf = GridSearchCV(MySVC(), parameters, scoring='accuracy', verbose=2)
    clf.fit(X_train, y_train)
    y_pred = clf.best_estimator_.predict(X_test)

    return y_pred

if __name__ == "__main__":
    X_train, X_test, y_train, y_test = dataset_maker()
    pred_cu_grid = classify_cuml_grid(X_train, X_test, y_train, y_test)

It is like this. You should now be able to use Gridsearch CV with cuML! It's been a long time, but thank you for reading!

bonus

Since it is a big deal, I will post the difference in execution time when using scikit-learn and when using cuML.

--scikit-learn: 1348.87 [s](about 22.5 minutes) --cuML: 270.06 [s](about 4.5 minutes)

Since each trial is only once, it is only for reference, but scikit-learn took about 5 times longer. After all cuML is fast!

How to use cuML SVC as a Gridsearch CV classifier

Article summary

Only the conclusion first

Cause of error

solution

Implementation example

Execution environment

Data set creation

`dataset_maker.py`

Difference between sklearn.svm.SVC.predict () and cuml.svm.SVC.predict ()

`sklearn_vs_cuML.py`

Use Gridsearch CV with cuML

`classify_sklearn_grid.py`

`MySVC.py`

`classify_MySVC.py`

bonus

References

How to use cuML SVC as a Gridsearch CV classifier

Article summary

Only the conclusion first

Cause of error

solution

Implementation example

Execution environment

Data set creation

dataset_maker.py

Difference between __sklearn.svm.SVC.predict () __ and ** cuml.svm.SVC.predict () **

sklearn_vs_cuML.py

Use Gridsearch CV with cuML

classify_sklearn_grid.py

MySVC.py

classify_MySVC.py

bonus

References

`dataset_maker.py`

Difference between sklearn.svm.SVC.predict () and cuml.svm.SVC.predict ()

`sklearn_vs_cuML.py`

`classify_sklearn_grid.py`

`MySVC.py`

`classify_MySVC.py`