Introduction

This is the article on the 13th day of Kinki University Advent Calendar 2019.

First of all, it's important to note that basically scraping is a last resort that shouldn't be done if you don't need it. This time we will scrape Qiita's tag ranking, but Qiita has an api and you get the tag ranking there. I did scraping because there was no api to do (as of December 8, 2019). If you can get the information you want using api, get it with api. Also, when scraping, wait at the time of connection Let's set aside time and allow time for connection.

Things necessary

Docker

The version I'm using is below. Docker version 19.03.5, build 633a0ea

docker-compose.yml

`docekr-compsoe.yml`


version: '3'

services:
    selenium-hub:
        image: selenium/hub
        container_name: selenium-hub
        ports:
          - "4444:4444"
      
    chrome:
        image: selenium/node-chrome-debug
        depends_on:
          - selenium-hub
        environment:
          - HUB_PORT_4444_TCP_ADDR=selenium-hub 
          - HUB_PORT_4444_TCP_PORT=4444
    
    python:
        build: .
        container_name: python
        volumes: 
            - .:/workspace
        command: /bin/bash
        tty: true
        stdin_open: true

FROM python:3.7
WORKDIR /workspace

RUN pip install \
    selenium \
    beautifulsoup4

dockerfile and compose description

I will not write what dokcer-compose and dockerfile are anymore, so if you do not understand I tried using Elixir's Phoenix and PostgreSQL on Docker Please refer to that because it is written in detail in.

Let's start with docker-compose.yml.

I think the basic Selenium option is now for scraping dynamic sites.

Create a container with the images selenium / hub and selenium / node-chrome-debug. Here, ʻenvironment is set in selenium / node-chrome-debug`. Please note that you cannot scrape without this.

The Dockerfile builds a python environment. The RUN command is used to download the required libraries.

Place these files and the code below on the same level and run docker-compose up -d --build to launch the container.

Code example

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup
import pprint

class QiitaGetRanking():
    """
Class to get ranking data from Qiita.
    """

    def get_tag_ranking(self, browser: webdriver) -> dict:
        """
Function to get information about tag ranking from Qiita.
        
        Parameters
        ----------
        browser: webdrive
Webdriver object for scraping
        
        Returns
        -------
        tag_ranking_data: dict
Dictionary object containing tag ranking.
        """
        html = browser.page_source.encode('utf-8')
        soup = BeautifulSoup(html, "html.parser")
        ra_tag_names = soup.find_all(class_='ra-Tag_name pr-1')
        tag_ranking_data = {}
        for i, ra_tag_name in enumerate(ra_tag_names):
            tag_ranking_data[i+1] = [ra_tag_name.text, 
            'https://qiita.com/tags/%s'%(ra_tag_name.text.lower())]
        return tag_ranking_data

if __name__ == "__main__":
    """
main statement.The browser should be closed as soon as the html is acquired.The same applies when an error occurs.
    """
    
    try:
        browser = webdriver.Remote(
            command_executor='http://selenium-hub:4444/wd/hub',
            desired_capabilities=DesiredCapabilities.CHROME)
        print("start scrape")
        browser.get('https://qiita.com')
        #Wait until all javascript is loaded.Timeout judgment if reading does not finish even after 15 seconds.
        WebDriverWait(browser, 15).until(EC.presence_of_all_elements_located)
        print("generate object")
        qgr = QiitaGetRanking()
        ranking_data = qgr.get_tag_ranking(browser)
        browser.close()
        browser.quit()
        pprint.pprint(ranking_data)
    except:
        browser.close()
        browser.quit()

I think the code is pretty simple. When scraping with docker, building a selenium server and scraping is easier and lighter than building a selenium environment on ubuntu.

For webdrriver.Remote, refer to 2.5. Using Selenium with Remote WebDriver. Please try.

Execution result

Make sure you have 3 containers in docker ps, then run the program in docker exec -it python python qiita.py.

start scrape
generate object
{1: ['Python', 'https://qiita.com/tags/python'],
 2: ['JavaScript', 'https://qiita.com/tags/javascript'],
 3: ['AWS', 'https://qiita.com/tags/aws'],
 4: ['Rails', 'https://qiita.com/tags/rails'],
 5: ['Ruby', 'https://qiita.com/tags/ruby'],
 6: ['Beginner', 'https://qiita.com/tags/Beginner'],
 7: ['Docker', 'https://qiita.com/tags/docker'],
 8: ['PHP', 'https://qiita.com/tags/php'],
 9: ['Vue.js', 'https://qiita.com/tags/vue.js'],
 10: ['Go', 'https://qiita.com/tags/go']}

If it is displayed like this, it is a complete victory. Thank you for your hard work.

at the end

This time I showed you how to scrape a dynamic site using Docker. Let's enjoy scraping within the limits!

Let's scrape a dynamic site with Docker