Aikatsu! I wanted to use a photo of the character's face to draw the series analysis results, but the number of people is large and manual operation is troublesome. So I got scraping with Beautiful Soup and Print Screen with Selenium. I decided to do it automatically until the character's face is cut out from PrintScreen with OpenCV.
from urllib import request
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import time
import os
import shutil
import itertools
#OpenCV does not allow Japanese filenames, so load the mapping file
df=pd.read_csv("C:/XXXX/aikatsu_name_romaji_mapping.tsv", sep='\t', engine='python', encoding="utf-8")
#Load Chrome driver
driver = webdriver.Chrome("C:/XXXX/chromedriver/chromedriver.exe")
#Have a tuple URL to scrape
character_urls =(
"http://www.aikatsu.net/01/character/index.html",
"http://www.aikatsu.net/02/character/index.html",
"http://www.aikatsu.net/03/character/index.html",
"http://www.aikatsu.net/aikatsustars_01/character/index.html",
"http://www.aikatsu.net/aikatsustars_02/character/index.html",
"http://www.aikatsu.net/aikatsufriends_01/character/",
"http://www.aikatsu.net/aikatsufriends_02/character/",
"http://www.aikatsu.net/character/"
)
#Creating a directory for storing PrintScreen
target_dir = "C:/XXXX/download/"
if os.path.isdir(target_dir):
shutil.rmtree(target_dir)
time.sleep(1)
os.mkdir(target_dir)
It might have been better to make the directory creation part as a function.
The mapping is as simple as this.
I use Pandas only because the number of target characters is about 67 and there is no need to make it DB or rich, and I can make it quickly with just knowledge.
When using Selenium, the storage location of the driver is usually set in the environment variable, but since it is a disposable tool, it does not need to be so rich. So, while referring to the following, I wrote the driver in solid. Introduction to Selenium starting with just 3 lines of python
In addition, the following error occurred at the time of execution.
WebDriverError: unknown error: Runtime.executionContextCreated has invalid
This is solved by matching the version of the driver you are using with the version of chrome because it is different from the version of chrome.
for character_url in character_urls:
html = request.urlopen(character_url)
soup = BeautifulSoup(html, "html.parser")
#Get information about each character
characters=soup.find_all("a")
idol_names = [i.find('img') for i in characters]
urls = [i.get('href') for i in characters]
character_url_prefix=character_url.split("index.html")
for i, j in zip(idol_names, urls):
#If the alt tag cannot be taken correctly, the process is rejected.
if i == None:
continue
#Repel information other than characters
if j.startswith("http") or j.startswith("../") or j.startswith("index"):
continue
idol_name = i.get("alt").replace(" ","").replace(" ","")
print(idol_name)
#Page display and adjustment of selenium
driver.get(character_url_prefix[0]+j)
driver.set_window_size(1250, 1036)
driver.execute_script("document.body.style.zoom='90%'")
#Shirayuri Kaguya has an empty alt information, so set a fixed value
if idol_name == "":
idol_name = "Shirayuri Kaguya"
#OpenCV cannot use Japanese names, so convert it to Romaji
idol_name_romaji = df[df["character"]==idol_name]["romaji"].values[0]
file_name="{}{}.png ".format(target_dir, idol_name_romaji)
#If a file with the same name already exists, rename it.
if os.path.exists(file_name):
for i in itertools.count(1):
newname = '{} ({})'.format(idol_name_romaji, i)
file_name="{}{}.png ".format(target_dir, newname)
#Exit if the file with the same name does not exist
if not os.path.exists(file_name):
break
#Set a slightly longer sleep time to avoid effects when transitioning to web pages
time.sleep(5)
driver.save_screenshot(file_name)
driver.quit()
You can get the data like this. Originally, the Japanese name of alt was added to the file name, but since it can not be read by OpenCV, it is purposely converted to Roman alphabet notation. (Romaji is appropriate, so you may make a mistake)
Also, when getting the character names (idol_names), None will be obtained as shown below. Since the URL and character name of each character are looped with a zip, it is necessary to have the same number of elements, so I try to play inside instead of before the loop.
[<img alt="Aikatsu on Parade!" src="../images/logo.png "/>,
<img alt="Aikatsu on Parade! communication" src="../images/bt-aikatsuonparadecom.png "/>,
<img alt="Aikatsu on Parade! What is" src="../images/bt-aikatsuonparade.png "/>,
<img alt="Broadcast information" src="../images/bt-tvinfo.png "/>,
<img alt="character" src="../images/bt-character.png "/>,
<img alt="Story" src="../images/bt-story.png "/>,
<img alt="CD" src="../images/bt-cd.png "/>,
<img alt="BD/DVD" src="../images/bt-bddvd.png "/>,
<img alt="NEWS" src="../images/bt-news.png "/>,
<img alt="TOP" src="../images/bt-top.png "/>,
<img alt="Raki Kiseki" src="images/bt-raki.png "/>,
<img alt="Yuki Aine" src="images/bt-aine.png "/>,
<img alt="Mio Minato" src="images/bt-mio.png "/>,
<img alt="Hoshimiya Ichigo" src="images/bt-ichigo.png "/>,
<img alt="Akari Ozora" src="images/bt-akari.png "/>,
<img alt="Yume Nijino" src="images/bt-yume.png "/>,
<img alt="BANDAINAMCO Pictures" height="53" src="../images/bnp.png " width="118"/>,
None]
import os
import cv2
from pathlib import Path
#Creating a directory
download_dir = '{0}parse/'.format(target_dir)
if os.path.isdir(download_dir):
shutil.rmtree(download_dir)
time.sleep(1)
os.mkdir(download_dir)
#Create a classifier based on the feature file
classifier = cv2.CascadeClassifier('C:/XXX/lbpcascade_animeface.xml')
#Get files in scraped directory
p = Path(target_dir)
for i in list(p.glob("*.png ")):
#Face detection
image = cv2.imread(i.as_posix())
#Creating a directory
file_tmp=i.as_posix().split("/")
parse_dir = '{0}{1}/'.format(download_dir, file_tmp[len(file_tmp)-1:len(file_tmp)][0].split(".")[0])
os.mkdir(parse_dir)
#Grayscale
gray_image = cv2.cvtColor(image,cv2.COLOR_BGR2GRAY)
faces = classifier.detectMultiScale(gray_image)
for i, (x,y,w,h) in enumerate(faces):
#Cut out the face one by one. Adjust the y coordinate to make it rectangular
face_image = image[y-50:y+h, x:x+w]
output_path = '{0}{1}.png'.format(parse_dir, i)
#writing
cv2.imwrite(output_path ,face_image)
I wanted to make OpenCV a rectangle, so I just tweaked the coordinates a little and the contents are as follows. Anime face detection with OpenCV
You can get it like this. Due to the pose, some of them were not identified as characters by the classifier. Since there are only a few people, I wonder if I have to do it manually.
Recommended Posts