If you pass cuML's SVM (SVC) as an estimator to scikit-learn's Gridsearch CV, Since an error has occurred, I will leave a solution. Since this is Qiita's first post, I would appreciate it if you could point out any mistakes or points that are difficult to understand.
--The return value of scikit-learn's SVM.SVC.predict () is an array of numpy --The return value of cuml.svm.SVC.predict () of cuML is Series of cuDF
GridsearchCV of scikit-learn assumes a numpy array as the return value of estimator.predict (). However, since the return value of SVC.predict () of cuML is Series of cuDF, an error occurs inside Gridsearch CV.
If you don't use GridsearchCV, you can solve it by converting the return value to a numpy array each time, but if you use GridsearchCV, you can't use that method. (Because it is necessary to pass each instance of SVC class to GridsearchCV)
--Create a class that inherits cuml.svm.SVC --Override the predict method to convert the return value to a numpy array before outputting --Use an instance of that class as an estimator
This time, as an example, we will use SVM to classify "5" and "8" in the MNIST dataset. The reason is as follows.
――MNIST is easy to obtain and format, and the number of data is just right --cuML's SVC currently only supports two-class classification --It seems that it is the most difficult to classify "5" and "8" (Reference)
First, create a dataset. Take out only MNIST 5 and 8 and Change the label to binary (5 → 0, 8 → 1).
dataset_maker.py
import numpy as np
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
def dataset_maker():
mnist = fetch_openml('mnist_784', version=1)
data_58 = []
label_58 =[]
for data,target in zip(mnist.data, mnist.target):
if target=='5':
data_58.append(data/255)
label_58.append(0)
elif target=='8':
data_58.append(data/255)
label_58.append(1)
data_58 = np.array(data_58)
label_58 = np.array(label_58)
X_train, X_test, y_train, y_test = train_test_split(data_58, label_58)
return X_train, X_test, y_train, y_test
Check the difference in the return value of the predict method, which is the cause of the error. As shown in the code below, cuML SVC can be treated in the same way as sklearn SVC. I am happy that the introduction is easy.
sklearn_vs_cuML.py
from sklearn.svm import SVC as skSVC
from cuml.svm import SVC as cuSVC
def classify_sklearn(X_train, X_test, y_train, y_test):
clf = skSVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("skSVC output_type:{}".format(type(y_pred)))
print("skSVC y_pred:{}".format(y_pred[0:10]))
def classify_cuml(X_train, X_test, y_train, y_test):
clf = cuSVC()
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print("cuSVC output_type:{}".format(type(y_pred)))
print("cuSVC y_pred:{}".format(y_pred[0:10]))
if __name__ == "__main__":
X_train, X_test, y_train, y_test = dataset_maker()
classify_sklearn(X_train, X_test, y_train, y_test)
classify_cuml(X_train, X_test, y_train, y_test)
When you do this, the output will look like this:
skSVC output_type:<class 'numpy.ndarray'>
skSVC y_pred:[0 0 0 1 0 0 0 0 1 0]
cuSVC output_type:<class 'cudf.core.series.Series'>
cuSVC y_pred:0 0.0
1 0.0
2 0.0
3 1.0
4 0.0
5 0.0
6 0.0
7 0.0
8 1.0
9 0.0
dtype: float64
As I wrote above, you can see that the return value is different. In summary, it looks like this.
--Output type is different - sklearn: numpy.ndarray - cuML: cudf.core.series.Series --The element type in the output array is different - sklearn: int - cuML: float64
Of these, due to the former, if the return value of cuml.svm.SVC.predict () is passed to the evaluation function of sklearn as it is,
I get angry with ValueError: Expected array-like (array or non-string sequence)
. [^ 1]
[^ 1]: The latter seems to be cast without permission, and it will work if only the former is fixed. However, it's unpleasant, so the code below explicitly casts it to an int type.
This itself can be solved by converting it to a numpy array, so when classifying with cuML's SVC, Set the return value of the predict method to [cudf.core.series.Series.to_array ()](https://rapidsai.github.io/projects/cudf/en/latest/api.html#cudf.core.series.Series. Convert to a numpy array using to_array) and then Let's pass it to the evaluation function of scikit-learn. [^ 2]
[^ 2]: Of course, the evaluation function of cuML is compatible with the Series of cuDF, but there are few types at present, and I think that the evaluation function of scikit-learn is probably used in many cases in practice.
Now the main subject. If you want to determine the hyperparameters of SVC by grid search Perhaps the first thing that comes to mind is how to use scikit-learn's Gridsearch CV. First, let's try scikit-learn's SVC as an estimator.
classify_sklearn_grid.py
def classify_sklearn_grid(X_train, X_test, y_train, y_test):
parameters = {'kernel': ['linear', 'rbf'],
'C': [0.1, 1, 10, 100],
'gamma': [0.1, 1, 10]}
clf = GridSearchCV(skSVC(), parameters, scoring='accuracy', verbose=2)
clf.fit(X_train, y_train)
y_pred = clf.best_estimator_.predict()
if __name__ == "__main__":
X_train, X_test, y_train, y_test = dataset_maker()
pred_sk_grid = classify_sklearn_grid(X_train, X_test, y_train, y_test)
I think it will be like this.
Since cuML's SVC is a class with necessary methods such as .fit () and .predict (), it meets the requirements of Gridsearch CV as an estimator.
However, in reality, the return value of the predict method is cuDF Series, which causes an error in the process of evaluating the result. Since it is necessary to pass each instance of SVC to GridsearchCV, it is not possible to convert using the to_array method every time the predict method is called.
To solve this problem, you can override the predict method so that the return value is a numpy array.
I will explain in detail. It's easy, just define a new class like this:
MySVC.py
from cuml.svm import SVC
class MySVC(SVC):
def __init__(self, **kwargs):
super().__init__(**kwargs)
def predict(self, X):
y_pred = super().predict(X).to_array().astype(int)
return y_pred
You can pass this MySVC to GridsearchCV instead of cuml's SVC. I don't think I need to write it,
classify_MySVC.py
from MySVC import MySVC
def classify_cuml_grid(X_train, X_test, y_train, y_test):
parameters = {'kernel': ['linear', 'rbf'],
'C': [0.1, 1, 10, 100],
'gamma': [0.1, 1, 10]}
clf = GridSearchCV(MySVC(), parameters, scoring='accuracy', verbose=2)
clf.fit(X_train, y_train)
y_pred = clf.best_estimator_.predict(X_test)
return y_pred
if __name__ == "__main__":
X_train, X_test, y_train, y_test = dataset_maker()
pred_cu_grid = classify_cuml_grid(X_train, X_test, y_train, y_test)
It is like this. You should now be able to use Gridsearch CV with cuML! It's been a long time, but thank you for reading!
Since it is a big deal, I will post the difference in execution time when using scikit-learn and when using cuML.
--scikit-learn: 1348.87 [s](about 22.5 minutes) --cuML: 270.06 [s](about 4.5 minutes)
Since each trial is only once, it is only for reference, but scikit-learn took about 5 times longer. After all cuML is fast!
Recommended Posts