Practice web scraping with Python and Selenium

Introduction

This time, I would like to do web scraping using Python and Selenium, which I am currently studying. The target site is Google, which searches with the specified keyword, acquires the specified number of items, and writes the three items "Title, URL, Summary" to the database.

Get the following items. The ultimate goal is to write the information to the SQLite database.

Development environment

Use Python 3.7.1. The development environment is Visual Studio 2019. I downloaded the driver for Firefox from geckodriver.

code

The source code is as follows. ** The code below works as of "April 23, 2020", but please note that it may not work due to changes in the specifications of the site in the future. ** **

`google.py`


import urllib.parse
import records
from selenium.webdriver import Firefox, FirefoxOptions
from sqlalchemy.exc import IntegrityError

#Search using the following keywords
keywd = ['Python','Machine learning']

#The name of the file to save the retrieved data
db = records.Database('sqlite:///google_search.db')

db.query('''CREATE TABLE IF NOT EXISTS items (
            url text PRIMARY KEY,
            title text,
            summary text NULL)''')

def store_data(url, title, summary):
    try:
        db.query('''INSERT INTO items (url, title, summary)
                    VALUES (:url, :title, :summary)''', 
                    url=url, title=title, summary=summary)
    except IntegrityError:
        #Skip this item as it already exists
        print("it's already exist.")
        return False

    return True

def visit_next_page(driver, url):
    driver.get(url)

    items = driver.find_elements_by_css_selector('#rso > div.g')

    for item in items:
        tag = item.find_element_by_css_selector('div.r > a')
        link = tag.get_attribute('href')
        title = tag.find_element_by_tag_name('h3').text.strip()

        summary = item.find_element_by_css_selector('div.s span.st').text.strip()

        if store_data(link, title, summary):
            print(title, link, sep='\n', end='\n\n')

def main():
    #Target site and number of searches (search_unit * search_loop）
    base_url = "https://www.google.co.jp/"
    search_unit = 20 #Number of items displayed on one page (It seems impossible to specify 100 or more)
    search_loop = 5
    start = 0

    #Combine keywords into one string
    target = ' '.join(keywd)

    #URL encoding (default encoding is"utf-8"）
    target = urllib.parse.quote(target)

    opt = FirefoxOptions()
    
    #If you want to observe the behavior of the browser yourself, please comment
    opt.add_argument('-headless')
    driver = Firefox(options=opt)

    #Set the waiting time
    driver.implicitly_wait(10)

    #Read page by page
    for i in range(search_loop):
        url = "{0}search?num={1}&start={2}&q={3}".format(base_url, search_unit, start, target)
        start += search_unit

        print("\npage count: {0}...".format(i + 1), end='\n\n')
        visit_next_page(driver, url)

    driver.quit()

if __name__ == '__main__':
    main()

Commentary

As you can see from the comments in the source, I will explain the central part of scraping.

The part that contains the data for one case is as follows. Get this with code like this:

items = driver.find_elements_by_css_selector('#rso > div.g')

The URL of the title and link is as follows. The outline part is as follows.

Get this with code like this:

tag = item.find_element_by_css_selector('div.r > a')
link = tag.get_attribute('href')
title = tag.find_element_by_tag_name('h3').text.strip()

summary = item.find_element_by_css_selector('div.s span.st').text.strip()

How to use

When actually using it, specify the keyword you want to search for and the file name to save at the beginning of the source code.

#Search using the following keywords
keywd = ['Python', 'Machine learning']

#The name of the file to save the retrieved data
db = records.Database('sqlite:///google_search.db')

Also, even if the same program is executed multiple times, the same URL will be skipped without being registered.

def store_data(url, title, summary):
    try:
        db.query('''INSERT INTO items (url, title, summary)
                    VALUES (:url, :title, :summary)''', 
                    url=url, title=title, summary=summary)
    except IntegrityError:
        #Skip this item as it already exists
        print("it's already exist.")
        return False

    return True

Supplement

At first, I thought I would get about 1000 items in a single page display, but it seems that the limit is about 100 items per page according to Google specifications. So, I'm scraping while switching to the next page using two variables, search_unit and search_loop.

Also, why not use Beautiful Soup? However, I wanted to practice using Selenium, and since there are many sites that use JavaScript these days, it seems that there will be more opportunities to use Selenium, so this time scraping with this method I tried to.

At the end

You are free to use the source code introduced this time, but please do so at your own risk.

Reference articles, reference books

[Basics and Practice of Python Scraping / Seppe vanden Broucke et al.](Https://www.amazon.co.jp/Python%E3%82%B9%E3%82%AF%E3%83%AC%E3%82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0% E3% 81% AE% E5% 9F% BA% E6% 9C% AC% E3% 81% A8% E5% AE% 9F% E8% B7% B5-% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3 % 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AEWeb% E3% 83 % 87% E3% 83% BC% E3% 82% BF% E5% 8F% 8E% E9% 9B% 86% E8% A1% 93-impress-top-gear / dp / 4295005282/ref=tmm_pap_swatch_0?_encoding=UTF8&qid = 1587624962 & sr = 1-1)