Automatically download images with scraping

thank you for your hard work. @ibusan. This time, as the title suggests, we implemented a program that automatically downloads images by scraping. I will leave it in the article as a memorandum so that I can look back when I forget it.

Overview

There are various targets for scraping, but this time, the Fan Kit of the game "Princess Connect! Re: Dive" that I am addicted to. I would like to automatically collect /). It's a lot to do by hand, isn't it?

Preparation

First, prepare the environment for scraping. The environment built this time is as follows.

Anaconda
ChromDriver

Anaconda is a platform that provides Python packages for data science. You can install it from the link above. ChromDriver is a driver required to operate Chrom programmatically. If Anaconda is installed

pip install chromedriver-binary=='Driver version'

You can install it with. The following site will be helpful for installing Chrom Driver. ChromDriver installation procedure

The library to be used is as follows. All can be installed with pip.

selenium
BeutifulSoup
requests
os
time

policy

This time, we will proceed with the implementation according to the following procedure.

Get the URL of all pages (common 1 page, 2 pages) of Fankit page
Get all the URLs of the fan kits on each page Go to the URL obtained in 3.2 and download the fan kit

coding

Now that the policy has been decided, we can start coding.

from selenium import webdriver
import time
import os
from bs4 import BeautifulSoup
import requests

First, import the library. Import the 5 libraries listed in Preparation.

#Launch Google Chrome
browser = webdriver.Chrome("/Users/ibuki_sakata/opt/anaconda3/lib/python3.7/site-packages/chromedriver_binary/chromedriver")
browser.implicitly_wait(3)

Then use ChromDriver and selenium to start Chrom. The second line is the description for starting. The path in parentheses is the Chrom Driver path. The description on the third line is for pausing the program so that the next process does not proceed until the browser is started.

#Go to URL
url_pricone = "https://priconne-redive.jp/fankit02/"
browser.get(url_pricone)
time.sleep(3)

The URL of the top page of the fan kit is specified in the first line, and the transition is made to the URL specified in the second line. The browser get method is similar to the http communication get method.

#Get the URL of all fan kit web pages
current_url = browser.current_url
html = requests.get(current_url)
bs = BeautifulSoup(html.text, "html.parser")
fankitPage = bs.find("ul", class_="page-nav").find_all("li")
page = []

for li_tag in fankitPage:
    a_tag = li_tag.find("a")
    if(a_tag.get('class')):
        page.append(current_url)
    else:
        page.append(a_tag.get("href"))

Here, get the URL of all pages such as the first page and the second page. Use BeautifulSoup to get the URL. There are many sites that explain how to use BeautifulSoup in detail, so I will not explain it here.

#Download fan kit
for p in page:
    html = requests.get(p)
    browser.get(p)
    time.sleep(1)
    bs = BeautifulSoup(html.text, "html.parser")
    ul_fankit_list = bs.find("ul", class_="fankit-list")
    li_fankit_list = ul_fankit_list.find_all("li")
    fankit_url = []
    for li_tab in li_fankit_list:
        a_tab = li_tab.find("a")
        fankit_url.append(a_tab.get("href"))

    for url in fankit_url:
        browser.get(url)
        time.sleep(1)
        html_fankit = requests.get(url)
        bs_fankit = BeautifulSoup(html_fankit.text, "html.parser")
        h3_tag = bs_fankit.find("h3")
        title = h3_tag.text
        os.makedirs(title, exist_ok=True)
        ul_dl_btns = bs_fankit.find_all("ul", class_="dl-btns")
        for i,ul_tag in enumerate(ul_dl_btns, start=0):
            li_tag = ul_tag.find("li")
            a_tag = li_tag.find("a")
            img_url = a_tag.get("href")
            browser.get(img_url)
            time.sleep(1)
            print(img_url)
            img = requests.get(img_url)
            with open(title + "/{}.jpg ".format(i), "wb") as f:
                f.write(img.content)
        browser.back()

Download the fan kit here. The basics are the same as before. The flow is to get the html source with requests, analyze it with BeautifulSoup, and get the desired tag. The image is downloaded by opening the file in binary mode and writing the image data acquired by requests.

Execution result

スクリーンショット 2020-06-07 14.53.32.png

In this way, the images are downloaded for each fan kit.