First Deep Learning ~ Preparation ~

Nice to meet you. @best_not_best. The other day, when I talked about Chainer at an in-house study session, I got a surprisingly good response. I would like to summarize the details in this article.

Thing you want to do

You guys have one or two of your favorite celebrities, right? (I'll talk on the premise that I'm there.) But I'm sure it's unlikely that I'll be able to meet that person in person. If there is a person who is close to you ... and if you can get to know that person ...

Caution

We have scraped our internal site below, but this article does not endorse that. Please read it as a story to the last. We are not responsible for any damage caused by actually performing this article. Observe your own internal information security rules and enjoy working.

environment

MacBook Pro 15-inch
OS X Yosemite 10.10.5
Python 2.7.9
chainer 1.3.0
lxml 3.4.4
selenium 2.47.1
numpy 1.9.2

procedure

Collect images of employees
Cut out the face part of the collected employee images
Collect learning images (favorite entertainers)
Cut out the face part of the learning image
Create a discriminator by learning 4. with Python + Chainer
Let the discriminator discriminate the image in 2.

Practice

1. Collect images of employees

I think your company's intra site has an employee search function. Search for a suitable employee from there and look up the URL of the employee image. If the employee ID is included in the URL, such as http://hogehoge.co.jp/image/12345.jpg. Depending on the company, the ID may be hashed with MD5 etc. Anyway, find the relevance between the employee ID and the image URL. (If you can't find it, give up ...)

Next, look for a list of employee IDs. If you press the search button without entering anything in the search form, a list may appear. Scrap the list page.

`abstraction_id.py`


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import lxml.html
from selenium import webdriver

TARGET_URL = 'http://hogehoge.co.jp/list.html'
driver = webdriver.PhantomJS()
driver.get(TARGET_URL)
root = lxml.html.fromstring(driver.page_source)
links = root.cssselect('p.class')
for link in links:
    if link.text is None:
        continue
    if link.text.isdigit():
        print link.text

Execute it with the following command.

$ python abstraction_id.py > member_id.txt

The part of target_url ='http://hogehoge.co.jp/list.html' can be a local file path, so scraping after saving the page is also possible. Enter the HTML element name that describes the employee ID in root.cssselect (). This time, there were relevant elements in multiple parts of HTML, so we are determining the conditions. This is determined when the employee ID is only numbers, but please replace it with a regular expression as appropriate.

The image is acquired locally using the acquired ID list.

`image_crawler.py`


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

from urllib2 import Request, urlopen, URLError, build_opener
import os
import time

ID_LIST = './member_id.txt'
URL_FMT = 'http://hogehoge.co.jp/image/%s.jpg'
OUTPUT_FMT = './photos/%s.jpg'
opener = build_opener()

for id in open(ID_LIST, 'r'):
    url = URL_FMT % id.strip()
    output = OUTPUT_FMT % id.strip()

    req = Request(url)
    try:
        response = urlopen(req)
    except URLError, e:
        if hasattr(e, 'reason'):
            err = e.reason
        elif hasattr(e, 'code'):
            err = e.code
    else:
      file = open(output, 'wb')
      file.write(opener.open(req).read())
      file.close()

    time.sleep(0.1)

Execute it with the following command.

$ python image_crawler.py

Just in case, let's put time.sleep (). ʻOUTPUT_FMT` will be the storage directory, so select it as appropriate.

2. Cut out the face part of the collected employee images

I will cut it out using OpenCV. I referred to the following article. Py-opencv Cut out a part of the image and save it --Symfoware

`cutout_face.py`


#!/usr/bin/env python
# -*- coding: UTF-8 -*-

import numpy
import os
import cv2

CASCADE_PATH = '/usr/local/opt/opencv/share/OpenCV/haarcascades/haarcascade_frontalface_alt.xml'
INPUT_DIR_PATH = './photos/'
OUTPUT_DIR_PATH = './cutout/'
OUTPUT_FILE_FMT = '%s%s_%d%s'
COLOR = (255, 255, 255)

files = os.listdir(INPUT_DIR_PATH)
for file in files:
    input_image_path = INPUT_DIR_PATH + file

    #File reading
    image = cv2.imread(input_image_path)
    #Grayscale conversion
    try:
        image_gray = cv2.cvtColor(image, cv2.cv.CV_BGR2GRAY)
    except cv2.error:
        continue

    #Acquire the features of the cascade classifier
    cascade = cv2.CascadeClassifier(CASCADE_PATH)

    #Execution of object recognition (face recognition)
    facerect = cascade.detectMultiScale(image_gray, scaleFactor=1.1, minNeighbors=1, minSize=(1, 1))

    if len(facerect) > 0:
        #Saving recognition results
        i = 1
        for rect in facerect:
            print rect
            x = rect[0]
            y = rect[1]
            w = rect[2]
            h = rect[3]

            path, ext = os.path.splitext(os.path.basename(file))
            output_image_path = OUTPUT_FILE_FMT % (OUTPUT_DIR_PATH, path, i, ext)
            cv2.imwrite(output_image_path, image[y:y+h, x:x+w])

            i += 1

Execute it with the following command.

$ python cutout_face.py

ʻINPUT_DIR_PATH is the storage directory in the previous section, and ʻOUTPUT_DIR_PATH is the storage directory of the extracted file, so select it as appropriate. ʻImportError: No module named cv2`

import cv2

import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
import cv2

I think that it can be avoided by rewriting.

I think that you can cut out the face part in most images, but in some cases, the tie part may be recognized as a face as shown below. This is a future issue.

next time

That's all for this time. (I'm sorry halfway ...) Continuing from the 21st day article of Intelligence Advent Calendar 2015!

Postscript

solved! → First Deep Learning ~ Solution ~ --Qiita