BOOTH is known for selling many avatars for sale. As of December 09, 2019, the "3D model" tag There are 11,527 models. Of course, this does not mean the number of avatars as it is because it contains a lot of materials that are not related to avatars. VRC model database published by KingYoSun has about 1,600 models. Is registered, but I think this is the most appropriate at the moment.
Is it possible to distinguish this from the thumbnail image? It seems that it can be done by recognizing the face, but is it possible to acquire only the face as an independent image?
That's why I'm scraping first. As expected, it is the one that puts the code that works with copy and paste, so only the URL is hidden.
import urllib.request as ur
from bs4 import BeautifulSoup
import requests
images = []
def img_save(img_url,title):
url = img_url
file_name = str(len(images)) + ".jpg "
labeled_name = str(len(images)) + "___" + title + ".jpg "
response = requests.get(url)
image = response.content
#This is just a serial number
with open("data/" + file_name, "wb") as o:
o.write(image)
#This one has a title
with open("labeled_data/" + labeled_name, "wb") as o:
o.write(image)
def img_search(url_data):
url = url_data
html = ur.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
title = str(soup.title.text)
char_list = ["/","'",'"',"*","|","<",">","?","\\"," - BOOTH"]
for c in char_list:
title = title.replace(c,"")
print(title)
for s in soup.find_all("img"):
if str(s).find("market") > 0:
img_url = s.get("src")
if img_url is not None:
print(img_url)
images.append(img_url)
img_save(img_url,title)
break
def page_access(page_number):
url = page_number
html = ur.urlopen(url)
soup = BeautifulSoup(html, "html.parser")
for s in soup.find_all("a"):
if str(s).find("item-card__title-anchor") > 0:
print (s.get("href"))
url = s.get("href")
img_search(url)
for i in range(1,240):
url = "***I can't put it***" + str(i)
page_access(url)
The result obtained in this way is as follows.
There are about 11,000 sheets.
Face detection is performed using the OpenCV library.
import cv2
sample = 11000
for i in range(sample):
file_name = 'data/' + str(i+1) + '.jpg'
img = cv2.imread(file_name)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
cascade = cv2.CascadeClassifier("haarcascade_frontalface_default.xml")
faces = cascade.detectMultiScale(img_gray,minSize=(100, 100))
color = (0, 0, 0)
print(faces)
if len(faces) > 0:
for rect in faces:
cv2.rectangle(img, tuple(rect[0:2]),tuple(rect[0:2]+rect[2:4]), color, thickness=10)
output_path = "face_detect/" + str(i+1) + ".jpg "
cv2.imwrite(output_path, img)
The face detection model needs to be downloaded separately and arranged locally. It's haarcascade_frontalface_default.xml
in the code above. You can download it from OpenCV github.
The result is below.
The accuracy is not good at all! I missed my face, or on the contrary, I misunderstood something different.
This is because the face detection model assumes a live-action face. When I searched for it, there was a person who created Model for Anime Face Detection. God? So I'll try again.
import cv2
sample = 11000
for i in range(sample):
file_name = 'data/' + str(i+1) + '.jpg'
img = cv2.imread(file_name)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
cascade = cv2.CascadeClassifier("lbpcascade_animeface.xml") #Here is changing
faces = cascade.detectMultiScale(img_gray,minSize=(100, 100))
color = (0, 0, 0)
print(faces)
if len(faces) > 0:
for rect in faces:
cv2.rectangle(img, tuple(rect[0:2]),tuple(rect[0:2]+rect[2:4]), color, thickness=10)
output_path = "face_detect/real" + str(i+1) + ".jpg "
cv2.imwrite(output_path, img)
Execution result.
The accuracy is too high!
Trim based on this detection result.
import cv2
sample = 11000
count = 1
for i in range(sample):
file_name = 'data/' + str(i+1) + '.jpg'
img = cv2.imread(file_name)
img_gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
classifier = cv2.CascadeClassifier("lbpcascade_animeface.xml")
faces = classifier.detectMultiScale(img_gray, minSize=(100, 100))
print(faces)
if len(faces) > 0:
for x,y,w,h in faces:
face_image = img[y:y+h, x:x+w]
output_path = 'face_trim/' + str(count) + '.jpg'
cv2.imwrite(output_path,face_image)
count += 1
Execution result.
...... I'm dizzy because there are too many avatars.
Since I got a lot of face icons, I could only do ghosts when I used the method I did the other day, so I don't use methods such as GAN. It seems that an interesting picture will not come out. Will study.
Approximately 3,000 images were generated, but since one thumbnail has multiple faces and a good number of special clothes (that is, thumbnails have faces) are sold, the actual avatar is There should be less. About half, about 1,600 points mentioned at the beginning seems to be a reasonable number. I thought it would be interesting to combine it with character recognition (thumbnails have a lot of sales complaints), but I would like to make it a future issue.
Also, it would be interesting to create a web service that displays only faces at random and makes it easy to search for avatars with your favorite faces from a large number of avatars for sale.
[Explanation for beginners] OpenCV face detection mechanism and practice (detectMultiScale) Anime face detection with OpenCV