Nice to meet you. @best_not_best. The other day, when I talked about Chainer at an in-house study session, I got a surprisingly good response. I would like to summarize the details in this article.
You guys have one or two of your favorite celebrities, right? (I'll talk on the premise that I'm there.) But I'm sure it's unlikely that I'll be able to meet that person in person. If there is a person who is close to you ... and if you can get to know that person ...
We have scraped our internal site below, but this article does not endorse that. Please read it as a story to the last. We are not responsible for any damage caused by actually performing this article. Observe your own internal information security rules and enjoy working.
I think your company's intra site has an employee search function. Search for a suitable employee from there and look up the URL of the employee image. If the employee ID is included in the URL, such as http://hogehoge.co.jp/image/12345.jpg
.
Depending on the company, the ID may be hashed with MD5 etc. Anyway, find the relevance between the employee ID and the image URL. (If you can't find it, give up ...)
Next, look for a list of employee IDs. If you press the search button without entering anything in the search form, a list may appear. Scrap the list page.
abstraction_id.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import lxml.html
from selenium import webdriver
TARGET_URL = 'http://hogehoge.co.jp/list.html'
driver = webdriver.PhantomJS()
driver.get(TARGET_URL)
root = lxml.html.fromstring(driver.page_source)
links = root.cssselect('p.class')
for link in links:
if link.text is None:
continue
if link.text.isdigit():
print link.text
Execute it with the following command.
$ python abstraction_id.py > member_id.txt
The part of target_url ='http://hogehoge.co.jp/list.html'
can be a local file path, so scraping after saving the page is also possible.
Enter the HTML element name that describes the employee ID in root.cssselect ()
.
This time, there were relevant elements in multiple parts of HTML, so we are determining the conditions.
This is determined when the employee ID is only numbers, but please replace it with a regular expression as appropriate.
The image is acquired locally using the acquired ID list.
image_crawler.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
from urllib2 import Request, urlopen, URLError, build_opener
import os
import time
ID_LIST = './member_id.txt'
URL_FMT = 'http://hogehoge.co.jp/image/%s.jpg'
OUTPUT_FMT = './photos/%s.jpg'
opener = build_opener()
for id in open(ID_LIST, 'r'):
url = URL_FMT % id.strip()
output = OUTPUT_FMT % id.strip()
req = Request(url)
try:
response = urlopen(req)
except URLError, e:
if hasattr(e, 'reason'):
err = e.reason
elif hasattr(e, 'code'):
err = e.code
else:
file = open(output, 'wb')
file.write(opener.open(req).read())
file.close()
time.sleep(0.1)
Execute it with the following command.
$ python image_crawler.py
Just in case, let's put time.sleep ()
.
ʻOUTPUT_FMT` will be the storage directory, so select it as appropriate.
I will cut it out using OpenCV. I referred to the following article. Py-opencv Cut out a part of the image and save it --Symfoware
cutout_face.py
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import numpy
import os
import cv2
CASCADE_PATH = '/usr/local/opt/opencv/share/OpenCV/haarcascades/haarcascade_frontalface_alt.xml'
INPUT_DIR_PATH = './photos/'
OUTPUT_DIR_PATH = './cutout/'
OUTPUT_FILE_FMT = '%s%s_%d%s'
COLOR = (255, 255, 255)
files = os.listdir(INPUT_DIR_PATH)
for file in files:
input_image_path = INPUT_DIR_PATH + file
#File reading
image = cv2.imread(input_image_path)
#Grayscale conversion
try:
image_gray = cv2.cvtColor(image, cv2.cv.CV_BGR2GRAY)
except cv2.error:
continue
#Acquire the features of the cascade classifier
cascade = cv2.CascadeClassifier(CASCADE_PATH)
#Execution of object recognition (face recognition)
facerect = cascade.detectMultiScale(image_gray, scaleFactor=1.1, minNeighbors=1, minSize=(1, 1))
if len(facerect) > 0:
#Saving recognition results
i = 1
for rect in facerect:
print rect
x = rect[0]
y = rect[1]
w = rect[2]
h = rect[3]
path, ext = os.path.splitext(os.path.basename(file))
output_image_path = OUTPUT_FILE_FMT % (OUTPUT_DIR_PATH, path, i, ext)
cv2.imwrite(output_image_path, image[y:y+h, x:x+w])
i += 1
Execute it with the following command.
$ python cutout_face.py
ʻINPUT_DIR_PATH is the storage directory in the previous section, and ʻOUTPUT_DIR_PATH
is the storage directory of the extracted file, so select it as appropriate.
ʻImportError: No module named cv2`
import cv2
To
import sys
sys.path.append('/usr/local/lib/python2.7/site-packages')
import cv2
I think that it can be avoided by rewriting.
I think that you can cut out the face part in most images, but in some cases, the tie part may be recognized as a face as shown below. This is a future issue.
That's all for this time. (I'm sorry halfway ...) Continuing from the 21st day article of Intelligence Advent Calendar 2015!
solved! → First Deep Learning ~ Solution ~ --Qiita
Recommended Posts