Voice actor history Scraping Wikipedia to see how many years have you become a voice actor in the "Aikatsu!" Series

This article is Aikatsu! Advent Calendar 2019 This is the article on the 18th day. Yesterday was an article by gecko655 "Aikatsu! Let Spotify tell you the musical features of the song".

Thing you want to do

I think the voice actor is one of the elements that support the "Aikatsu!" Series. Aikatsu! In order to become a voice actor, I will scrape Wikipedia to get what year it will be cast for some role in the "Aikatsu!" Series.

The configuration is roughly divided into two functions.

Obtain the voice actor name listed in the character information of the anime.
Acquire and process the cast information of the voice actor name acquired in 1.

Implementation

1. Obtain the voice actor name listed in the character information of the anime

import re
import requests
from urllib.request import urlopen
from urllib.parse import urlparse
import urllib.request
from bs4 import BeautifulSoup
from collections import OrderedDict

#Get a voice actor from the Wikipedia anime page
def get_voice_actor(soup):
    list_voice_actors_tmp=[]
    replace_list = ['/song-', '\[', '\（', "\n"]
    split_list = ['→', '、']

    for target in [i.text for i in soup.findAll('dd')]:
        if target.startswith("voice"):
            #Exclusion of unnecessary strings
            voice_actor = target.replace('voice- ','')
            voice_actor = voice_actor.replace(" ","")       

            for i in replace_list:
                m = re.search(i, voice_actor)
                if(bool(m)):
                    voice_actor = voice_actor[0:m.start()]

            #Split processing of multiple casts such as getting off the board
            split_flg = False
            for i in split_list:
                tmp_voice_actor_list = voice_actor.split(i)
                if len(tmp_voice_actor_list) > 1:
                    #Since it is extended, it is duplicated, but since it is duplicated at the time of scraping, duplicate deletion is done at once.
                    list_voice_actors_tmp.extend(tmp_voice_actor_list)
                    split_flg = True
            if split_flg:
                continue

            list_voice_actors_tmp.append(voice_actor)
    return list_voice_actors_tmp

#Get a list of Aikatsu voice actors
target_work_characters = ['Aikatsu!Character list of','Aikatsuスターズ!', 'Aikatsuフレンズ!']

list_voice_actors = []
for character in target_work_characters:
    html = requests.get(r'https://ja.wikipedia.org/wiki/{}'.format(urllib.parse.quote_plus(character, encoding='utf-8')))
    list_voice_actors.extend(get_voice_actor(BeautifulSoup(html.text, "lxml")))

What I'm doing is getting the voice actor name listed in the characters of the "Aikatsu!" Series. It is a process to exclude suffix etc. attached to the voice actor name.

As a result, the following List will be output.

['Sumire Morohoshi',
 'Azusa Tadokoro',
 'Ayaka Ohashi',
 'Tomoyo Kurosawa',
 'Manami Numakura',
 'Kiyono Yasuno',
 'Yuna Mimura',
 'Asami Seto',
 'Satomi Moriya',
 'Shino Shimoji',
　　　:
　　　:
 'Yu Wakui',
 'Misako Tomioka',
 'Nami Tanaka',
 'Yuri Yamaoka',
 'Mitsuki Saiga',
 'Kumpei Sakamoto',
 'Shinya Takahashi',
 'Takashi Onozuka',
 'Nanako Mori']

The cast with the same name is duplicated in the output result. This is because there are people like Sumire Morohoshi who are casting across multiple works in terms of implementation. Duplicate deletion is not performed here, but is deleted in the subsequent processing.

Since there are multiple people for each role, such as cast dismissal, I divide it and extend it to List.

2. Acquire and process the cast information of the voice actor name acquired in 1.

#Acquisition of cast information of voice actors
def get_cast(html):
    soup = BeautifulSoup(html, "lxml")

    list_cast=[]
    #Loop in units of anime / game
    for extract_dl in soup.findAll('dl') :
        for extract_dl_child in extract_dl.children :
            #Line breaks excluded
            if extract_dl_child.string == '\n':
                continue
            #Since the year is set in the first line, get for setting
            if extract_dl_child.string is not None:
                year = extract_dl_child.string
            #Get works and cast
            else:
                #Loop for works
                for extract_li in extract_dl_child.findAll('li'):
                    extract_a = extract_li.find('a')
                    #It's possible to get None type data
                    if isinstance(extract_a,type(None)) == False:
                        title = extract_a.text
                        #Get the character name
                        title_char = str(extract_li.get_text())
                        character = title_char[title_char.find('（')+1:title_char.find('）')]
                        #Year,Title name,Make 1 data with the character name.
                        list_cast.append("{},{},{}".format(year, title.replace(" ",""), character))                        
    return list_cast

#Get voice actor information from Wikipedia
def get_html(target):
    
    sufffix_dict = {"Voice actor": "_(%E5%A3%B0%E5%84%AA)","Actor": "_(%E4%BF%B3%E5%84%AA)",} 
    
    res = requests.get(r'https://ja.wikipedia.org/wiki/{}'.format(urllib.parse.quote_plus(target, encoding='utf-8')))
    
    if res.status_code == 200:
        return get_cast(res.text)

    #Consider because there are cases where the order is not guaranteed depending on the Python version
    for suffix in OrderedDict(sufffix_dict).values():
        print(suffix)
        res = requests.get(r'https://ja.wikipedia.org/wiki/{}{}'.format(urllib.parse.quote_plus(target, encoding='utf-8'),suffix))

        if res.status_code == 200:
            break
            
    return get_cast(res.text)

#Processing of voice actor cast information
def get_target_carrer(target_work, list_voice_actors):
    list_carrer =[]
    for target in set(list_voice_actors):
        #Get voice actor information from Wikipedia
        list_cast = get_html(target)

        #If you can't get the information, go to the next voice actor
        if len(list_cast) == 0 :
            continue

        #Acquired a career as a voice actor up to the target work
        #Obtain the number of works and years of experience up to the target work
        cast_year = None
        debut_year = None
        count_cast_num = 1
        for str_cast in sorted(list_cast):
            #print(str_cast)
            #Get the debut year. The age is abstract, so remove it
            if not str_cast.split(',')[0].endswith("Age") and debut_year is None:
                debut_year = str_cast.split(',')[0]

            #Get the year cast for the first time in the target work
            if str_cast.split(',')[1] == target_work:
                cast_year = str_cast.split(',')[0]

            #When the year when the target work is cast for the first time is old, move on to the next voice actor
            if cast_year is not None:
                target_work_carrer = int(cast_year.replace('Year',''))  - int(debut_year.replace('Year',''))+ 1
                list_carrer.append("{},{},{},{},{},{}".format(target_work,
                                                              target, 
                                                              debut_year, 
                                                              cast_year, 
                                                              str('{}Year'.format(target_work_carrer)),
                                                              str('{}the work'.format(count_cast_num))
                                                             )
                                  )
                break
            count_cast_num = count_cast_num + 1
    return list_carrer

target_works_list = ['Aikatsu!', 'Aikatsuスターズ!', 'Aikatsuフレンズ!']
list_va_carrer = []
for target_work in target_works_list:
    list_va_carrer.extend(get_target_carrer(target_work, list_voice_actors))

Here, from the voice actor name, the anime / game work cast by that voice actor is acquired and sorted. The voice actor history when it was first cast in the "Aikatsu!" Series from the debut work and the number of works cast so far are shown.

The output looks like this.

['Aikatsu!,Akemi Kanda,the year of 2000,2013,14th year,152 works',
 'Aikatsu!,Ayaka Ohashi,year 2012,year 2012,1st year,3 works',
 'Aikatsu!,Mamiko Noto,1998,year 2012,15th year,405 works',
 'Aikatsu!,Hisako Kanemoto,2009,2015,7th year,209 works',
 'Aikatsu!,Mari Yokoo,1980,2013,34th year,155 works',
 'Aikatsu!,Hiroshi Yanaka,1986,year 2012,27th year,152 works',
 'Aikatsu!,Miki Hase,2010,2014,5th year,11 works',
 'Aikatsu!,Aya Suzaki,2010,2014,5th year,61 works',
 'Aikatsu!,Nanako Mori,2013,2015,Third year,6 works',
 'Aikatsu!,Satomi Moriya,2008,year 2012,5th year,28 works',
　　　                 :
　　　                 :
 'Aikatsu Friends!!,Yuri Yamaoka,2009,2018,10th year,66 works',
 'Aikatsu Friends!!,Risa Kubota,2015,2018,4th year,26 works',
 'Aikatsu Friends!!,Makoto Yasumura,the year of 2000,2018,19th year,206 works',
 'Aikatsu Friends!!,Ikumi Hasegawa,2016,2018,Third year,27 works',
 'Aikatsu Friends!!,Tomokazu Sugita,1999,2019,21st year,672 works',
 'Aikatsu Friends!!,Mitsuki Nakae,2014,2018,5th year,68 works',
 'Aikatsu Friends!!,Yuki Kuwahara,2013,2018,6th year,93 works']

As for how to put out the voice actor history, the first cast role is set as the debut year, and it is subtracted from the year when it was first cast in the "Aikatsu!" Series.

If it is a voice actor name, it is not possible to pick up a case like "XXXX_ (voice actor name)", so we have implemented it so that it can be picked up.

As an issue, if a voice actor like Rin Aira who has just made a debut or an actor like Karen Miyama is also popular, it is different from the general voice actor Wikipedia's Fort Mat, so pick it up. There is a point that there is no. If you make an implementation that picks up that much, it seems that there will be many individual implementations, so I thought that it would be better not to do it, but I thought that it would be better to make an implementation that can tell who could not pick it up.

List

Below is a list of the results adjusted with output and Excel while being processed with Pandas.

import pandas as pd
abc = pd.Series(list_va_carrer)
df = abc.str.split(',', expand=True)
df.columns = ["Title of work", "Voice actor name", "Debut year", "Casting year", "From debut to casting(Years)","From debut to casting（役数）"]
df.to_csv("aikatsu.csv")

at the end

Aikatsu! Advent Calendar 2019 It's still a few days free, so feel free to join us!