Practice web scraping with Python and Selenium

Introduction

This time, I would like to do web scraping using Python and Selenium, which I am currently studying. The target site is Google, which searches with the specified keyword, acquires the specified number of items, and writes the three items "Title, URL, Summary" to the database.

Get the following items. sele_01.png The ultimate goal is to write the information to the SQLite database. sele_02.png

Development environment

Use Python 3.7.1. The development environment is Visual Studio 2019. I downloaded the driver for Firefox from geckodriver.

code

The source code is as follows. ** The code below works as of "April 23, 2020", but please note that it may not work due to changes in the specifications of the site in the future. ** **

google.py


import urllib.parse
import records
from selenium.webdriver import Firefox, FirefoxOptions
from sqlalchemy.exc import IntegrityError

#Search using the following keywords
keywd = ['Python','Machine learning']

#The name of the file to save the retrieved data
db = records.Database('sqlite:///google_search.db')

db.query('''CREATE TABLE IF NOT EXISTS items (
            url text PRIMARY KEY,
            title text,
            summary text NULL)''')

def store_data(url, title, summary):
    try:
        db.query('''INSERT INTO items (url, title, summary)
                    VALUES (:url, :title, :summary)''', 
                    url=url, title=title, summary=summary)
    except IntegrityError:
        #Skip this item as it already exists
        print("it's already exist.")
        return False

    return True

def visit_next_page(driver, url):
    driver.get(url)

    items = driver.find_elements_by_css_selector('#rso > div.g')

    for item in items:
        tag = item.find_element_by_css_selector('div.r > a')
        link = tag.get_attribute('href')
        title = tag.find_element_by_tag_name('h3').text.strip()

        summary = item.find_element_by_css_selector('div.s span.st').text.strip()

        if store_data(link, title, summary):
            print(title, link, sep='\n', end='\n\n')

def main():
    #Target site and number of searches (search_unit * search_loop)
    base_url = "https://www.google.co.jp/"
    search_unit = 20 #Number of items displayed on one page (It seems impossible to specify 100 or more)
    search_loop = 5
    start = 0

    #Combine keywords into one string
    target = ' '.join(keywd)

    #URL encoding (default encoding is"utf-8")
    target = urllib.parse.quote(target)

    opt = FirefoxOptions()
    
    #If you want to observe the behavior of the browser yourself, please comment
    opt.add_argument('-headless')
    driver = Firefox(options=opt)

    #Set the waiting time
    driver.implicitly_wait(10)

    #Read page by page
    for i in range(search_loop):
        url = "{0}search?num={1}&start={2}&q={3}".format(base_url, search_unit, start, target)
        start += search_unit

        print("\npage count: {0}...".format(i + 1), end='\n\n')
        visit_next_page(driver, url)

    driver.quit()

if __name__ == '__main__':
    main()

Commentary

As you can see from the comments in the source, I will explain the central part of scraping.

The part that contains the data for one case is as follows. sele_03.png Get this with code like this:

items = driver.find_elements_by_css_selector('#rso > div.g')

The URL of the title and link is as follows. sele_04.png The outline part is as follows. sele_05.png

Get this with code like this:

tag = item.find_element_by_css_selector('div.r > a')
link = tag.get_attribute('href')
title = tag.find_element_by_tag_name('h3').text.strip()

summary = item.find_element_by_css_selector('div.s span.st').text.strip()

How to use

When actually using it, specify the keyword you want to search for and the file name to save at the beginning of the source code.

#Search using the following keywords
keywd = ['Python', 'Machine learning']

#The name of the file to save the retrieved data
db = records.Database('sqlite:///google_search.db')

Also, even if the same program is executed multiple times, the same URL will be skipped without being registered.

def store_data(url, title, summary):
    try:
        db.query('''INSERT INTO items (url, title, summary)
                    VALUES (:url, :title, :summary)''', 
                    url=url, title=title, summary=summary)
    except IntegrityError:
        #Skip this item as it already exists
        print("it's already exist.")
        return False

    return True

Supplement

At first, I thought I would get about 1000 items in a single page display, but it seems that the limit is about 100 items per page according to Google specifications. So, I'm scraping while switching to the next page using two variables, search_unit and search_loop.

Also, why not use Beautiful Soup? However, I wanted to practice using Selenium, and since there are many sites that use JavaScript these days, it seems that there will be more opportunities to use Selenium, so this time scraping with this method I tried to.

At the end

You are free to use the source code introduced this time, but please do so at your own risk.

Reference articles, reference books

[Basics and Practice of Python Scraping / Seppe vanden Broucke et al.](Https://www.amazon.co.jp/Python%E3%82%B9%E3%82%AF%E3%83%AC%E3%82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0% E3% 81% AE% E5% 9F% BA% E6% 9C% AC% E3% 81% A8% E5% AE% 9F% E8% B7% B5-% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3 % 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AEWeb% E3% 83 % 87% E3% 83% BC% E3% 82% BF% E5% 8F% 8E% E9% 9B% 86% E8% A1% 93-impress-top-gear / dp / 4295005282/ref=tmm_pap_swatch_0?_encoding=UTF8&qid = 1587624962 & sr = 1-1)

Recommended Posts

Practice web scraping with Python and Selenium
Scraping with Python, Selenium and Chromedriver
Scraping with Selenium [Python]
Python web scraping selenium
Getting Started with Python Web Scraping Practice
Getting Started with Python Web Scraping Practice
Easy web scraping with Python and Ruby
Scraping with selenium in Python
I tried web scraping using python and selenium
Scraping with Selenium in Python
Web scraping using Selenium (Python)
Scraping with Selenium + Python Part 2
Web scraping beginner with python
Web scraping with Python ① (Scraping prior knowledge)
Scraping with Selenium in Python (Basic)
Web scraping with Python First step
I tried web scraping with python.
Scraping with Python and Beautiful Soup
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Selenium
Web crawling, web scraping, character acquisition and image saving with python
WEB scraping with Python (for personal notes)
Web scraping with Python ② (Actually scraping stock sites)
Horse Racing Site Web Scraping with Python
[python] Read html file and practice scraping
[For beginners] Try web scraping with Python
Successful scraping with Selenium
Scraping with Python (preparation)
Try scraping with Python.
Scraping with Python + PhantomJS
ScreenShot with Selenium (Python)
Scraping with Python + PyQuery
Scraping RSS with Python
AWS-Perform web scraping regularly with Lambda + Python + Cron
Scraping tabelog with python and outputting to CSV
Try running Google Chrome with Python and Selenium
Drag and drop local files with Selenium (Python)
Launch a web server with Python and Flask
Let's do web scraping with Python (weather forecast)
Let's do web scraping with Python (stock price)
Programming with Python and Tkinter
I tried scraping with Python
Encryption and decryption with Python
Data analysis for improving POG 1 ~ Web scraping with Python ~
Web scraping notes in python3
Scraping with chromedriver in python
Festive scraping with Python, scrapy
Save images with web scraping
Python: Working with Firefox with selenium
Parse and visualize JSON (Web application ⑤ with Python + Flask)
Web scraping technology and concerns
Automatic follow on Twitter with python and selenium! (RPA)
Quick web scraping with Python (while supporting JavaScript loading)
I was addicted to scraping with Selenium (+ Python) in 2020
Easy web scraping with Scrapy
Scraping with Tor in Python
Web API with Python + Falcon
Python beginners get stuck with their first web scraping
WEB scraping with python and try to make a word cloud from reviews
Scraping weather forecast with python