Scraping with selenium in Python

Introduction

In a university lecture, I had the task of acquiring the invention name according to the search item from the patent information platform and analyzing it by natural language processing. Other students copied all the HTML source of the page and used the grep function of Excel or an editor to extract only what they needed. I used python to automate it, get only what I needed, and even automate the process of creating a text file. This time, the code at that time is also used as a memorandum of my own, but I will publish it.

1. Introduction [2. What I wanted to do](#What I wanted to do) [3. Prerequisite knowledge](# Prerequisite knowledge) 4. Preparation [5. Actual code](#actual code) [6. Summary](# Summary) [7. Reference document](# Reference document)

What I wanted to do

Get the necessary items from Patent Information Platform and create a text file.

Prerequisite knowledge

I wonder if the knowledge required to read this code is as follows. --Basic Python grammar --Minimum HTML knowledge

Preparation

If you do not have the library required to use the program, please install it.

requests
selenium

>> pip install requests
>> pip install selenium

You will also need a chromedriver, so if you don't have one, install it from here and use the same directory as your program. Please put it in.

Actual code

Click here for Github

In implementing this time, in order to get all the information of the target page, it was necessary to scroll to the bottom of the page and load it, so I used the scroll function to scroll to the top. In the main function, the ID at the bottom of the page is acquired, and the invention names for that number are acquired. Please refer to here for how to get the value from HTML. After getting all the text, it is saved in a text file.

`main.py`


"""
A program that fetches patent invention names from the Japan Platform for Patent Information
"""
# coding:utf-8
import os
import time

import requests
from selenium import webdriver


def scroll(driver):
    """
Scroll down the page.
    """
    html01 = driver.page_source
    while 1:
        driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
        time.sleep(2)
        html02 = driver.page_source
        if html01 != html02:
            html01 = html02
        else:
            break
            
            
def main():
    """
Hit any search item on the patent information platform
Acquire the invention name of the patent
    """
    path = os.getcwd()  #Get the current directory
    #Set driver
    driver = webdriver.Chrome(path + '\\chromedriver')
    #Access to the Patent Information Platform
    driver.get('https://www.j-platpat.inpit.go.jp/')

    #Setting the word to search
    print('What are you search?')
    serach_word = input()
    #Setting the file name to create
    print('please type a file name')
    file_name = input()

    time.sleep(2)
    driver.find_element_by_name('s01_srchCondtn_txtSimpleSearch').click()
    driver.find_element_by_name('s01_srchCondtn_txtSimpleSearch').send_keys(serach_word)
    driver.find_element_by_name('s01_srchBtn_btnSearch').click()
    time.sleep(5)

    #Page scroll
    scroll(driver)

    #Get the maximum No of the thing that matches the search result
    id_str = driver.find_elements_by_id('patentUtltyIntnlSimpleBibLst_tableView_numberArea')[-1].text
    id_num = int(id_str)

    words = []
    for i in range(id_num):
        word = driver.find_element_by_id('patentUtltyIntnlSimpleBibLst_tableView_invenName{}'.format(i)).text
        words.append(word)
        print(word)
    print(words)

    #Create a text file
    with open(file_name, 'w') as f:
        f.write('\n'.join(words))


if __name__ == "__main__":
    main()

Summary

This time, I introduced how to get data from a web page and save it in a text file using Selenium in Python. I hope it will be helpful for those who want to see it using Selenium from now on.

Reference document

Selenium