Overview

The program is available at https://github.com/rennone/is_color_page. I am doing self-catering of books, and I have digitized more than 1000 books (mainly manga). Due to the size of the PDF and the ease of reading on tablets, black and white pages are grayscaled and contrasted.

In order to automate that, we implemented a process to determine whether the scanned image is a color page or a black and white page. (Well, the scanner has automatic discrimination, but I want to keep the original data in color.)

In conclusion, python + SVM gave simple and relatively accurate results. I referred to https://qiita.com/kazuki_hayakawa/items/18b7017da9a6f73eba77 when implementing it. Thank you very much.

import sklearn
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
import cv2
import numpy as np

class IsColorPageSVM:
    def __init__(self, kernel='linear', random_state=None):
        self.svm = SVC(kernel=kernel, random_state=random_state)
        self.scaler =  StandardScaler() 

    def learn(self, vec_array, label_array, test_size=0.3, random_state=None):
        #Separate training data and test data
        vec_train, vec_test, label_train, label_test = sklearn.model_selection.train_test_split(vec_array, label_array, test_size=test_size, random_state=random_state )

        #Data standardization process
        self.scaler.fit(vec_train)
        #Model learning
        self.svm.fit(self.scaler.transform(vec_train), label_train)

        #Returns the correct answer rate for training data and test data
        return self.test(vec_train, label_train), self.test(vec_test, label_test)

    #If you pass test data and answer, it will return the correct answer rate
    def test(self, vec_array, label_array):
        vec_array_std   = self.scaler.transform(vec_array)
        pred_train        = self.svm.predict(vec_array_std)
        return sklearn.metrics.accuracy_score(label_array, pred_train)

Why SVM

This is because it specializes in two-class classification and seems to have high accuracy. The other reason is that it seemed like it could be done with python soon. If you use the trendy deep learning, you may get different results.

data form

If you insert the image data as it is, the number of dimensions will vary depending on the image, so conversion processing will be applied. It may be possible to use Reshape for 128 * 128, but I found out that it would be better if it was a color judgment. Tell me goo posted the following.

I calculated the standard deviation of each HSV element using cvAvgSdv. After that, as far as I tried with the image I have, The saturation (S) is now roughly as follows. Color: Approximately 40 to 50 with many single colors. It usually looks like that, over 80, over 100. Black and white: around 20. Yellowing: Around 20 to 30. * I didn't have many samples, so it's reasonable. From now on, it seems that it can be judged by setting the threshold value around 30-40.

Then, I think it's okay to make a judgment based on this, without having to do machine learning. I thought so and tried it with the image I had, but it was mostly correct, but the page that was too burnt was judged as color, but if I raised the threshold, the subtle color page was judged as black and white. There was a problem that it was done. In that case, it is necessary to judge the hue as well, and it is impossible to judge the threshold value by the scale! That's why I decided to do machine learning.

So, convert the image to HSV format, get the standard deviation of all HSVs, and then add the mean/maximum/minimum values and try.

Finally, I tested the following 12-dimensional vector as a data format.

(Standard deviation of H, Standard deviation of S, Standard deviation of V, Mean of H, Mean of S, Mean of V, Maximum value of H, Maximum value of S, Maximum value of V, Minimum value of H Value, minimum value of S, minimum value of V)

Test method

The data of the color page/black and white page was read as follows, and the average value trained 100 times was calculated. At that time, we also conducted a test after extracting some dimensions with the meaning of measuring how much of the 12-dimensional vectors contributed to learning.

item	Contents
environment	python 3.8
Number of training data for color pages	267
Number of training data for black and white pages	267
Number of trials	100


def main():
    #Read 2D array csv file
    color_vec_array = np.loadtxt('color_page.csv', delimiter=',')
    gray_vec_array = np.loadtxt('gray_page.csv', delimiter=',')
    #Since the training data for black and white pages is large for color images, make them the same size.
    gray_vec_array = gray_vec_array[: len(color_vec_array),:]

    #Merge into one 2D array
    vec_array = np.append(color_vec_array, gray_vec_array, axis=0)

    #Color image label is 0,Set the label of the black and white image to 1.
    label_array = [0] * len(color_vec_array) +[1] * len(gray_vec_array)
    
    # i,j is the dimension of the vector to extract
    for i in range(0,vec_array.shape[1]):
        for j in range(i+1,vec_array.shape[1] + 1):
            train_score, test_score = 0.0, 0.0
            #Take the average of the results of training 100 times
            test_num = 100
            print('{0:2}:{1:2}Element data'.format(i, j), end='  ')
            for n in range(test_num):
                data_list = vec_array[:,i:j]
                model = IsColorPageSVM()                    
                train, test = model.learn(data_list, label_array)
                train_score += train
                test_score += test
            print(u'Correct answer rate(Training/test):{0:0.3} - {1:0.3}'.format(train_score/test_num, test_score/test_num))

if __name__ == '__main__':
    main()

result

The following is the correct answer rate for each dimension extracted. The item of the number of dimensions indicates from where to where of the 12 dimensions was used (3: 6 means that the 3rd dimension from the 3rd element to the element immediately before the 6th is used). .. Looking at this, the standard deviation of S alone is a considerable accuracy rate. (The area matches what was said in the data format of ↑)

The highest accuracy rate of the test data was 99% of 0: 4 data (standard deviation of HSV element + mean value of H) and 3: 9 (mean value of HSV + maximum value of HSV). What is surprising is that the 3: 9 score, which does not use the standard deviation, also gives the result of the 1st place tie (this is the best result because the score of the training data is also the 1st place). Regarding this, I wrote a little in the chapter on data formats, but it may be because I labeled the image so that monochromatic color pages (such as pages with a single blue color) are also color-judged.

Naturally, information on only the maximum and minimum values and only the brightness is almost unrecognizable (since there are two categories, around 50% is virtually random). The minimum value would be 0 if there were black pixels, and the brightness doesn't matter for color or black and white pages.

Number of dimensions	Correct answer rate(Training data)	Correct answer rate(test data)	Remarks
0: 1	0.774	0.777	H standard deviation only
0: 2	0.958	0.96	H,Standard deviation of S
0: 3	0.97	0.968	H, S,Standard deviation of V
0: 4	0.99	0.99 (1st place tie)	H, S,Standard deviation of V,Mean of H
0: 5	0.99	0.987	H, S,Standard deviation of V, H,Mean of S
0: 6	0.991	0.984	H, S,Standard deviation of V, H, S,Mean of V
0: 7	0.991	0.985	H, S,Standard deviation of V, H, S,Mean of V,Maximum value of H
0: 8	0.993	0.986	H, S,Standard deviation of V, H, S,Mean of V, H,Maximum value of S
0: 9	0.996	0.989	H, S,Standard deviation of V, H, S,Mean of V, H, S,Maximum value of V
0:10	0.995	0.988	H, S,Standard deviation of V, H, S,Mean of V, H, S,Maximum value of V,Minimum value of H
0:11	0.995	0.986	H, S,Standard deviation of V, H, S,Mean of V, H, S,Maximum value of V, H,Minimum value of S
0:12	0.995	0.987	All elements
1: 2	0.953	0.952	S standard deviation only
1: 3	0.968	0.964
1: 4	0.99	0.987
1: 5	0.989	0.989
1: 6	0.991	0.986
1: 7	0.992	0.983
1: 8	0.992	0.988
1: 9	0.995	0.989
1:10	0.996	0.987
1:11	0.994	0.987
1:12	0.995	0.985
2: 3	0.602	0.588	V standard deviation only
2: 4	0.832	0.831
2: 5	0.928	0.92
2: 6	0.941	0.937
2: 7	0.943	0.935
2: 8	0.991	0.986
2: 9	0.994	0.988
2:10	0.994	0.989
2:11	0.994	0.987
2:12	0.994	0.986
3: 4	0.814	0.816	Mean of H only
3: 5	0.925	0.925
3: 6	0.937	0.935
3: 7	0.938	0.939
3: 8	0.99	0.988
3: 9	0.994	0.99 (1st place tie)	H, S,Mean of V, H, S,Maximum value of V
3:10	0.994	0.989
3:11	0.993	0.986
3:12	0.993	0.986
4: 5	0.772	0.775	Mean of S only
4: 6	0.761	0.75
4: 7	0.762	0.754
4: 8	0.961	0.96
4: 9	0.97	0.964
4:10	0.97	0.961
4:11	0.975	0.964
4:12	0.981	0.97
5: 6	0.525	0.491	Mean of V only
5: 7	0.529	0.492
5: 8	0.961	0.954
5: 9	0.969	0.962
5:10	0.969	0.964
5:11	0.974	0.966
5:12	0.982	0.971
6: 7	0.516	0.477	Maximum value of HV
6: 8	0.957	0.961
6: 9	0.97	0.963
6:10	0.969	0.964
6:11	0.973	0.966
6:12	0.98	0.969
7: 8	0.961	0.957	Maximum value of S only
7: 9	0.97	0.965
7:10	0.968	0.969
7:11	0.973	0.97
7:12	0.978	0.968
8: 9	0.521	0.487	Maximum value of V only
8:10	0.522	0.484
8:11	0.533	0.493
8:12	0.903	0.897
9:10	0.512	0.477	Only the minimum value of H
9:11	0.518	0.477
9:12	0.896	0.891
10:11	0.518	0.481	Only the minimum value of S
10:12	0.895	0.891
11:12	0.894	0.894	Only the minimum value of V

Future plans

Regarding color judgment, the only way to do it in the future is to reshape the image data and use it as it is, or try other methods such as deep learning. After that, I also want to judge whether it is an illustration of a novel or a text page, so I may do that too (I may write it because I got a decent result with the same feeling) That's it. Thank you for reading until the end.