Kaggle / MNIST is a mission to learn images of 28x28 handwritten numbers. When it comes to learning image data, I want to go into a convolutional neural network (CNN), but I dare to do it here. I tried to see how accurate the support vector machine could be.
The accuracy is inferior to CNN (in my machine learning ability), but even if I searched on the net, there were not many cases where I tried Kaggle / MNIST with a support vector machine, so I said, "This accuracy will come out." Keep a record. As a result, the correct answer rate was 0.98375 ***.
There are three things I have tried to improve accuracy: I will explain these.
-People who have touched kaggle (people who have read data, processed something, and submitted) --Someone who somehow knows about support vector machines
First, read the data into `train_data```,
test_data``` as usual, and save the length of the data in ```
train_data_len. `` `Test_data_len
.
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All"
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
# load data
train_data = pd.read_csv("/kaggle/input/digit-recognizer/train.csv")
test_data = pd.read_csv("/kaggle/input/digit-recognizer/test.csv")
train_data_len = len(train_data)
test_data_len = len(test_data)
print("Length of train_data ; {}".format(train_data_len))
print("Length of test_data ; {}".format(test_data_len))
# Length of train_data ; 42000
# Length of test_data ; 28000
continue,
-- train_data_y
; Save label
-- train_data_x
; Save the original data with the label dropped
train_data_y = train_data["label"]
train_data_x = train_data.drop(columns="label")
If you look at the image of numbers (*** it is important to look at the raw data! ***), for example, "1" and "8" have letters written on them and the number of pixels (area) is very different. The *** mean *** of pixel values is considered to be a new feature. Similarly, the *** variance *** of all pixel values is considered to be a new feature.
In addition, the mean and variance of the upper half and the lower half of the 28x28 area can also be features. Therefore, the mean and variance are calculated in the following areas and used as new features.
--28x28 whole --Upper half, lower half --Left half, right half --1/4 area --1/9 area --1/16 area
It's easy to use pandas.DataFrame.query ()` `` to extract pixels in each area, but for that you swap *** rows and columns of the original data *** before processing the data Also, for data processing, add a serially numbered `` `no
column.
df = pd.concat([train_data_x, test_data]) #Process training data and test data at once
df_T = df.T #Swap rows and columns
df_T["no"] = range(len(df_T)) #Add column for no
Then, the mean and variance of the entire 28x28 are calculated to create new features (`` `a_mean,
a_std```).
df_T.loc["a_mean"] = df_T.mean()
df_T.loc["a_std"] = df_T.std()
Next, calculate the mean and variance of the upper and lower halves. The extraction condition is created with a character string (convenient!).
# horizontal
for i in range(2):
q = 'no < 28*28/2' if i == 0 else 'no >= 28*28/2'
df_T.loc["b{}_mean".format(i)] = df_T[:784].query(q).mean()
df_T.loc["b{}_std".format(i)] = df_T[:784].query(q).std()
Then, the mean and variance of the left and right halves are calculated. The extraction conditions become complicated little by little, but depending on whether the remainder of `` `no``` divided by 28 is less than or greater than 14, the left and right halves Is divided.
# vertical
for i in range(2):
q = 'no % 28 < 14' if i == 0 else 'no % 28 >= 14'
df_T.loc["c{}_mean".format(i)] = df_T[:784].query(q).mean()
df_T.loc["c{}_std".format(i)] = df_T[:784].query(q).std()
Calculate the mean and variance for the 1/4, 1/9, and 1/16 areas as well.
# mean and std of 1/4 area
for i in range(2):
qi = 'no < 28*28/2' if i == 0 else 'no >= 28*28/2'
for j in range(2):
qj = 'no % 28 < 14' if j == 0 else 'no % 28 >= 14'
q = qi + " & " + qj
num = i * 2 + j
df_T.loc["d{}_mean".format(num)] = df_T[:784].query(q).mean()
df_T.loc["d{}_std".format(num)] = df_T[:784].query(q).std()
# mean and std of 1/9 area
for i in range(3):
if i == 0:
qi = 'no < 262'
elif i == 1:
qi = "262 <= no < 522"
else:
qi = "522 <= no < 784"
for j in range(3):
if j == 0:
qj = 'no % 28 < 9'
elif j == 1:
qj = '9 <= no % 28 < 18'
else:
qj = '18 <= no % 28'
q = qi + " & " + qj
num = i * 3 + j
df_T.loc["e{}_mean".format(num)] = df_T[:784].query(q).mean()
df_T.loc["e{}_std".format(num)] = df_T[:784].query(q).std()
# mean and std of 1/16 area
for i in range(4):
qi = '{0} <= no < {1}'.format(28*28/4*i, 28*28/4*(i+1))
for j in range(4):
qj = '{0} <= no % 28 < {1}'.format(28/4*j, 28/4*(j+1))
q = qi + " & " + qj
num = i * 4 + j
df_T.loc["f{}_mean".format(num)] = df_T[:784].query(q).mean()
df_T.loc["f{}_std".format(num)] = df_T[:784].query(q).std()
Finally, drop the `` `no``` column to restore the rows and columns
df_T.drop(columns="no", inplace=True)
df = df_T.T
Looking at the image of numbers, for example, the pixel on the upper left seems to be *** any number is 0 ***. Pixels with the same value in all images are judged to have no information. I decided to drop it (in the case of CNN, I can't do this because the relationship with the surrounding pixels is important).
We calculated the maximum and minimum of each pixel and decided to drop the columns with the same value (otherwise, for example, we might drop a column with a small variance). Dropped 65 columns.
# drop columns if all values are the same
drop_col = [] #Prepare an empty list
for c in df.columns: #For each column
col_max = df[c].max()
col_min = df[c].min()
if col_max == col_min: #Drop if the maximum and minimum are equal
drop_col.append(c)
print("# of dropping columns ; {}".format(len(drop_col)))
df.drop(drop_col, axis=1, inplace=True)
# number of dropping columns ; 65
Before adjusting the hyperparameters, convert the value of the data to 0 to 1. Otherwise, it will take a long time to calculate later.
I used sklearn.preprocessing.MinMaxScaler
here. The transform ()` `` method of this returns ndarray, so convert it to` `pandas.DataFrame
Yes.
# scaling
from sklearn import preprocessing
mmscaler = preprocessing.MinMaxScaler()
mmscaler.fit(df)
df_scaled = pd.DataFrame(mmscaler.transform(df), columns=df.columns, index=df.index)
# separate df_scaled into train_data and test data
train_data_x = df_scaled[:train_data_len]
test_data = df_scaled[train_data_len:]
Support vector machines use sklearn.svm.SVC ()` ``. Use
sklearn.model_selection.GridSearchCV ()`` to adjust hyperparameters, using a small amount of data. I gradually narrowed the range of ``
C and `` `gamma
.
In addition to these, there are hyperparameters such as `kernel``` and
`decision_function_shape```. We are using the following values.
kernel="rbf"
decision_function_shape="ovo"
An example script is as follows.
# obtain small size of data
from sklearn.model_selection import train_test_split
train_data_x_sub, x_test, train_data_y_sub, y_test = train_test_split(train_data_x, train_data_y,
train_size=3000, test_size=100,
random_state=1)
from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn import metrics
param_grid = {"C": [10 ** i for i in range(-5, 6, 2)],
"gamma" : [10 ** i for i in range(-5, 6, 2)]
}
model_grid = GridSearchCV(estimator=svm.SVC(kernel="rbf", decision_function_shape="ovo", random_state=1),
param_grid = param_grid,
scoring = "accuracy", # metrics
verbose = 2,
cv = 4) # cross-validation
model_grid.fit(train_data_x_sub, train_data_y_sub)
model_grid_best = model_grid.best_estimator_ # best estimator
print("Best Model Parameter: ", model_grid.best_params_)
# Best model parameter : {'C': 10, 'gamma': 0.001}
print('Train score: {}'.format(model_grid_best.score(train_data_x_sub, train_data_y_sub)))
print('Cross Varidation score: {}'.format(model_grid.best_score_))
# Train score: 0.9593333333333334
# Cross Varidation score: 0.9063333333333333
# check calculation results
means = model_grid.cv_results_['mean_test_score']
stds = model_grid.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, model_grid.cv_results_['params']):
print("%0.4f (+/-%0.04f) for %r" % (mean, std * 2, params))
prediction = model_grid_best.predict(train_data_x_sub)
co_mat = metrics.confusion_matrix(train_data_y_sub, prediction)
print(co_mat)
print('Total Train score: {}'.format(model_grid_best.score(train_data_x, train_data_y)))
# Total Train score: 0.9251904761904762
In the above script, the following part displays the calculation result for each parameter combination. While looking at this, narrow down the range of C
and gamma
in the next calculation. I went.
# check calculation results
means = model_grid.cv_results_['mean_test_score']
stds = model_grid.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, model_grid.cv_results_['params']):
print("%0.4f (+/-%0.04f) for %r" % (mean, std * 2, params))
By the way, the method of narrowing down the parameters in my case was as follows. There may be a way to save more time by using a random search.
No | The number of data | cv | C |
gamma |
OptimalC |
Optimalgamma |
CV score |
---|---|---|---|---|---|---|---|
1 | 3000 | 4 | [10 ** i for i in range(-5, 6, 2)] |
[10 ** i for i in range(-5, 6, 2) |
10 | 0.001 | 0.9063 |
2 | 3000 | 5 | [0.3, 1, 3, 10, 30, 100, 300] |
[0.0001, 0.0003, 0.001, 0.003, 0.01, 0.03] |
3 | 0.03 | 0.9443 |
3 | 8000 | 5 | [3, 10, 30, 100, 300] |
[0.025, 0.03, 0.04, 0.05] |
10 | 0.025 | 0.9695 |
4 | 8000 | 5 | [5, 8, 10, 15, 20] |
[0.015, 0.02, 0.024, 0.028] |
5 | 0.028 | 0.9694 |
Looking now, was the CV score a little better just before it than at the very end ...: -p
Using the optimal `C``` and
`gamma``` obtained, we will train with all the data.
from sklearn import svm
from sklearn import metrics
clf = svm.SVC(C=5, gamma=0.028, decision_function_shape="ovo", kernel="rbf", verbose=2)
clf.fit(train_data_x, train_data_y)
prediction = clf.predict(train_data_x)
accuracy_score = metrics.accuracy_score(train_data_y, prediction)
print(accuracy_score)
# Accuracy : 0.9999761904761905
co_mat = metrics.confusion_matrix(train_data_y, prediction)
print(co_mat)
prediction = clf.predict(test_data)
output = pd.DataFrame({"ImageId" : np.arange(1, 28000+1), "Label":prediction})
output.head()
output.to_csv('digit_recognizer_SVM7a.csv', index=False)
print("Your submission was successfully saved!")
The correct answer rate of the obtained results is 0.98375. It is slightly below the top 50%.
By the way, if you use the default hyperparameters with ``` sklearn.svm.SVC ()` ``, 0.97571. If you optimize only the hyperparameters without increasing or decreasing the feature amount, 0.98207. So, the increase or decrease of the feature amount is a little. It is effective (is it painful).
If you want to go any further, you should try CNN quickly.
-Script uploaded to Github (digit-recognition_SVM7a.py) -query to extract rows of pandas.DataFrame conditionally -Sklearn.preprocessing.MinMaxScaler's original manual -MNIST database --wikipedia; According to this, error rate = 0.56 is given as an example on a support vector machine, but I'm not sure about the details.
Recommended Posts