This time, I would like to do web scraping using Python and Selenium, which I am currently studying. The target site is Google, which searches with the specified keyword, acquires the specified number of items, and writes the three items "Title, URL, Summary" to the database.
Get the following items. The ultimate goal is to write the information to the SQLite database.
Use Python 3.7.1. The development environment is Visual Studio 2019. I downloaded the driver for Firefox from geckodriver.
The source code is as follows. ** The code below works as of "April 23, 2020", but please note that it may not work due to changes in the specifications of the site in the future. ** **
google.py
import urllib.parse
import records
from selenium.webdriver import Firefox, FirefoxOptions
from sqlalchemy.exc import IntegrityError
#Search using the following keywords
keywd = ['Python','Machine learning']
#The name of the file to save the retrieved data
db = records.Database('sqlite:///google_search.db')
db.query('''CREATE TABLE IF NOT EXISTS items (
url text PRIMARY KEY,
title text,
summary text NULL)''')
def store_data(url, title, summary):
try:
db.query('''INSERT INTO items (url, title, summary)
VALUES (:url, :title, :summary)''',
url=url, title=title, summary=summary)
except IntegrityError:
#Skip this item as it already exists
print("it's already exist.")
return False
return True
def visit_next_page(driver, url):
driver.get(url)
items = driver.find_elements_by_css_selector('#rso > div.g')
for item in items:
tag = item.find_element_by_css_selector('div.r > a')
link = tag.get_attribute('href')
title = tag.find_element_by_tag_name('h3').text.strip()
summary = item.find_element_by_css_selector('div.s span.st').text.strip()
if store_data(link, title, summary):
print(title, link, sep='\n', end='\n\n')
def main():
#Target site and number of searches (search_unit * search_loop)
base_url = "https://www.google.co.jp/"
search_unit = 20 #Number of items displayed on one page (It seems impossible to specify 100 or more)
search_loop = 5
start = 0
#Combine keywords into one string
target = ' '.join(keywd)
#URL encoding (default encoding is"utf-8")
target = urllib.parse.quote(target)
opt = FirefoxOptions()
#If you want to observe the behavior of the browser yourself, please comment
opt.add_argument('-headless')
driver = Firefox(options=opt)
#Set the waiting time
driver.implicitly_wait(10)
#Read page by page
for i in range(search_loop):
url = "{0}search?num={1}&start={2}&q={3}".format(base_url, search_unit, start, target)
start += search_unit
print("\npage count: {0}...".format(i + 1), end='\n\n')
visit_next_page(driver, url)
driver.quit()
if __name__ == '__main__':
main()
As you can see from the comments in the source, I will explain the central part of scraping.
The part that contains the data for one case is as follows. Get this with code like this:
items = driver.find_elements_by_css_selector('#rso > div.g')
The URL of the title and link is as follows. The outline part is as follows.
Get this with code like this:
tag = item.find_element_by_css_selector('div.r > a')
link = tag.get_attribute('href')
title = tag.find_element_by_tag_name('h3').text.strip()
summary = item.find_element_by_css_selector('div.s span.st').text.strip()
When actually using it, specify the keyword you want to search for and the file name to save at the beginning of the source code.
#Search using the following keywords
keywd = ['Python', 'Machine learning']
#The name of the file to save the retrieved data
db = records.Database('sqlite:///google_search.db')
Also, even if the same program is executed multiple times, the same URL will be skipped without being registered.
def store_data(url, title, summary):
try:
db.query('''INSERT INTO items (url, title, summary)
VALUES (:url, :title, :summary)''',
url=url, title=title, summary=summary)
except IntegrityError:
#Skip this item as it already exists
print("it's already exist.")
return False
return True
At first, I thought I would get about 1000 items in a single page display, but it seems that the limit is about 100 items per page according to Google specifications.
So, I'm scraping while switching to the next page using two variables, search_unit
and search_loop
.
Also, why not use Beautiful Soup? However, I wanted to practice using Selenium, and since there are many sites that use JavaScript these days, it seems that there will be more opportunities to use Selenium, so this time scraping with this method I tried to.
You are free to use the source code introduced this time, but please do so at your own risk.
[Basics and Practice of Python Scraping / Seppe vanden Broucke et al.](Https://www.amazon.co.jp/Python%E3%82%B9%E3%82%AF%E3%83%AC%E3%82% A4% E3% 83% 94% E3% 83% B3% E3% 82% B0% E3% 81% AE% E5% 9F% BA% E6% 9C% AC% E3% 81% A8% E5% AE% 9F% E8% B7% B5-% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 82% B5% E3% 82% A4% E3% 82% A8% E3% 83% B3% E3 % 83% 86% E3% 82% A3% E3% 82% B9% E3% 83% 88% E3% 81% AE% E3% 81% 9F% E3% 82% 81% E3% 81% AEWeb% E3% 83 % 87% E3% 83% BC% E3% 82% BF% E5% 8F% 8E% E9% 9B% 86% E8% A1% 93-impress-top-gear / dp / 4295005282/ref=tmm_pap_swatch_0?_encoding=UTF8&qid = 1587624962 & sr = 1-1)
Recommended Posts