Introduction

Thanks to ElasticSearch as a search engine, it has become relatively easy to extract work information from search words.

With ElasticSearch, you can easily realize the method of recommending from keywords by involving the genre of the work you are reading and the information of the tag.

However, in my case, it's a manga site, so there are also elements such as painting tastes and tastes, so when I had money, the content was secondary, and I usually bought jackets.

So, it is an attempt to supplement such strangely tasteful parts in some way.

As a way to try it, I understand that you can take in manga as image information and classify it in some way (clustering). Then what should I do?

** I don't know if it matches at all, so if I don't have any tsukkomi around it, **

For the time being, I was interested in studying the history of artificial intelligence and bite a little, so if you take that into account, ** Wouldn't it be possible to express an image with what is called a feature quantity and cluster it using it? ** ** I think I'll start from that recognition.

Find a start

When I asked Google Sensei, he often uses a local feature called ** SURF ** in image pattern recognition. This is to generate features by taking points that do not change even if the brightness of the image is changed, scaled, or rotated. Since multiple of them can be taken, ** local ** features. It seems that one image is not one.

Let's start with this, and when clustering its features, let's quickly put in a classification by k-means that seems to be often used again.

To be honest, there are a lot of words that don't make sense, but I think you can find out how to do this as needed.

Environment

Masu is the environment OS:Mac Language: Python2.7 Development environment: PyCharm Community Editioin 2017.1 Machine learning library: scikit-learn Image processing library: mahotas Numerical library: NumPy

The reason for this environment is that it was simply the mechanism I learned in the online curriculum. Maybe we will change it as needed in the future.

When using a Mac, Python 2.7 was done by default, so that's the same.

flow

Learning phase

Read images from the URL described in the text file (100-2000 items)
Calculation of local features by SURF (mahotas)
Use this as unsupervised data (Base feature) after clustering by k-means (scikit-learn)

Classification phase

Read 100 data separately
Calculate SURF
Execute clustering (first 10 classifications and 25 classifications) from the Base features mentioned earlier.
Look at the results.

Code

`Learning phase`


# coding:utf-8
import numpy as np
from sklearn import cluster
from sklearn.externals import joblib
import mahotas as mh
from mahotas import surf
from datetime import datetime
import cStringIO
import urllib

datetime_format = "%Y/%m/%d %H:%M:%S"

#Parameters
feature_category_num = 512


#Bring the image URL from a text file.
list = []

list_file = open("list.txt")

for l in list_file:
    list.append(l.rstrip())

list_file.close()

#Image processing
base = []

j=0

for url in list:
    file = cStringIO.StringIO(urllib.urlopen(url).read())
    im = mh.imread(file, as_grey=True)
    im = im.astype(np.uint8)
    base.append(surf.surf(im))

concatenated = np.concatenate(base)

del base

#Calculation of Base features

km = cluster.KMeans(feature_category_num)
km.fit(concatenated)

#Storage of Base features
joblib.dump(km, "km-cluster-surf-.pk1")

`Classification phase`


# coding:utf-8
import numpy as np
from sklearn import cluster
from sklearn.externals import joblib
import mahotas as mh
from mahotas import surf

import cStringIO
import urllib

#Parameters
feature_category_num = 512
picture_category_num = 25

#Trained model loading

km = joblib.load("km-cluster-surf.pk1")


#Bring the image URL from a text file.
list = []

list_file = open("list2.txt")

for l in list_file:
    list.append(l.rstrip())

list_file.close()

#Image processing
base = []

for url in list:

    title = file[1]
    file = cStringIO.StringIO(urllib.urlopen(url).read())

    im = mh.imread(file, as_grey=True)
    im = im.astype(np.uint8)

    base.append(surf.surf(im))

concatenated = np.concatenate(base)

features = []

#Start classifying images from basic features

for d in base:
    c = km.predict(d)
    features.append(np.array([np.sum(c==ci) for ci in range(feature_category_num)]))

features=np.array(features)


km = cluster.KMeans(n_clusters=picture_category_num,verbose=1)
km.fit(features)

#File spit out the result

list = np.array(list)

for i in range(picture_category_num):
    print('Image category{0}'.format(i))
    challenge = list[km.labels_ == i]
    for c in list:
        print(c)

result

Writing from the conclusion, the classification was completed.

But I couldn't find it.

There are probably various reasons, but I would like to think about that next time.

Manga Recommendations with Machine Learning Part 1 First, try dividing without thinking

Introduction

Find a start

Environment

flow

Learning phase

Classification phase

Learning phase

Classification phase

result

`Learning phase`

`Classification phase`