In a university lecture, I had the task of acquiring the invention name according to the search item from the patent information platform and analyzing it by natural language processing. Other students copied all the HTML source of the page and used the grep function of Excel or an editor to extract only what they needed. I used python to automate it, get only what I needed, and even automate the process of creating a text file. This time, the code at that time is also used as a memorandum of my own, but I will publish it.
1. Introduction [2. What I wanted to do](#What I wanted to do) [3. Prerequisite knowledge](# Prerequisite knowledge) 4. Preparation [5. Actual code](#actual code) [6. Summary](# Summary) [7. Reference document](# Reference document)
Get the necessary items from Patent Information Platform and create a text file.
I wonder if the knowledge required to read this code is as follows. --Basic Python grammar --Minimum HTML knowledge
If you do not have the library required to use the program, please install it.
>> pip install requests
>> pip install selenium
You will also need a chromedriver, so if you don't have one, install it from here and use the same directory as your program. Please put it in.
In implementing this time, in order to get all the information of the target page, it was necessary to scroll to the bottom of the page and load it, so I used the scroll function to scroll to the top. In the main function, the ID at the bottom of the page is acquired, and the invention names for that number are acquired. Please refer to here for how to get the value from HTML. After getting all the text, it is saved in a text file.
main.py
"""
A program that fetches patent invention names from the Japan Platform for Patent Information
"""
# coding:utf-8
import os
import time
import requests
from selenium import webdriver
def scroll(driver):
"""
Scroll down the page.
"""
html01 = driver.page_source
while 1:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(2)
html02 = driver.page_source
if html01 != html02:
html01 = html02
else:
break
def main():
"""
Hit any search item on the patent information platform
Acquire the invention name of the patent
"""
path = os.getcwd() #Get the current directory
#Set driver
driver = webdriver.Chrome(path + '\\chromedriver')
#Access to the Patent Information Platform
driver.get('https://www.j-platpat.inpit.go.jp/')
#Setting the word to search
print('What are you search?')
serach_word = input()
#Setting the file name to create
print('please type a file name')
file_name = input()
time.sleep(2)
driver.find_element_by_name('s01_srchCondtn_txtSimpleSearch').click()
driver.find_element_by_name('s01_srchCondtn_txtSimpleSearch').send_keys(serach_word)
driver.find_element_by_name('s01_srchBtn_btnSearch').click()
time.sleep(5)
#Page scroll
scroll(driver)
#Get the maximum No of the thing that matches the search result
id_str = driver.find_elements_by_id('patentUtltyIntnlSimpleBibLst_tableView_numberArea')[-1].text
id_num = int(id_str)
words = []
for i in range(id_num):
word = driver.find_element_by_id('patentUtltyIntnlSimpleBibLst_tableView_invenName{}'.format(i)).text
words.append(word)
print(word)
print(words)
#Create a text file
with open(file_name, 'w') as f:
f.write('\n'.join(words))
if __name__ == "__main__":
main()
This time, I introduced how to get data from a web page and save it in a text file using Selenium in Python. I hope it will be helpful for those who want to see it using Selenium from now on.
Recommended Posts