Not only did I solve the problem with paiza and AtCoder, but I also wanted to actually start programming. I was also inspired by this article (most of what you need to do to program web services was learned from scraping). This time, I would like to scrape the companies listed in Rikunabi Direct and register them in the DB.
Rikunabi Direct is a service that displays and introduces several companies per week that Rikunabi Direct judges to match the job seeker from the industry of choice registered by the job seeker. Searches from the job seeker side are not possible. For the time being, a list of listed companies is provided, but it is very time-consuming or reckless to look at a total of more than 16,000 companies from the end. Therefore, I thought about making it searchable by scraping company information and registering it in the DB.
Since it takes too much time to scrape (2.5 seconds per case), I started two PhantomJS and tried to speed up by parallel processing.
main.py
import utils
import os
import MySQLdb
import time
import selenium
import settings
import numpy as np
from concurrent.futures import ThreadPoolExecutor
from selenium import webdriver
utils.py
import sys
import requests
import csv
import os
import MySQLdb
import settings
import time
import selenium
from selenium import webdriver
main.py
NUMBER_OF_BROWSER = 2
browser = utils.generate_browser()
browser_2 = utils.generate_browser()
browser_list = [browser, browser_2]
browsers = []
for i in range(NUMBER_OF_BROWSERS):
browsers.append([browser_list[i], USER_IDs[i], PASS_WORDs[i]])
utils.py
PHANTOMJS_PATH = '/usr/local/bin/phantomjs'
def generate_browser():
browser = webdriver.PhantomJS(executable_path=PHANTOMJS_PATH)
print('PhantomJS initializing')
browser.implicitly_wait(3)
return browser
Waiting implicitly_wait () to wait for PhantomJS to start.
main.py
TIME_TO_WAIT = 5
USER_IDs = [USER_ID, USER_ID_2]
PASS_WORDs = [PASS_WORD, PASS_WORD_2]
browser = utils.generate_browser()
browser_2 = utils.generate_browser()
browser_list = [browser, browser_2]
for browser_param in browsers:
utils.login(browser_param[1], browser_param[2], browser_param[0])
utils.set_wait_time(TIME_TO_WAIT, browser_param[0])
utils.check_current_url(browser_param[0])
utils.py
def generate_browser():
browser = webdriver.PhantomJS(executable_path=PHANTOMJS_PATH)
print('PhantomJS initializing')
browser.implicitly_wait(3)
return browser
def set_wait_time(time, browser):
browser.set_page_load_timeout(time)
def login(user, pass_word, browser):
#Access the login page
url_login = 'https://rikunabi-direct.jp/2020/login/'
browser.get(url_login)
#Access judgment. Exit if inaccessible
test = requests.get(url_login)
status_code = test.status_code
if status_code == 200:
print('HTTP status code ' + str(status_code) + ':Access the login page')
else:
print('HTTP status code ' + str(test.status_code) + ':Could not access the login page')
sys.exit()
time.sleep(5)
#Enter user and password
#user
element = browser.find_element_by_name('accountId')
element.clear()
element.send_keys(user)
print('I entered user')
#password
element = browser.find_element_by_name('password')
element.clear()
element.send_keys(pass_word)
print('I entered the password')
#Send
submit = browser.find_element_by_xpath("//img[@alt='Login']")
submit.click()
print('I pressed the login button')
Use browse.get (url) to go to the url page and browser.find_element_by_name () to specify the value of the name attribute to find the desired element. The elements you want to find are as follows
<input type="text" name="accountId" autocomplete="off" value="" ...
Get the element with browser.find_element_by_name ('accountID'). Clear the text box of the fetched element with clear () and enter the userID with send_keys (). Enter the password in the same way. The submit button was obtained by specifying it with xpath. xpath is from Google chrome validation, right click> Copy> Copy XPath to copy the element (at first I did not know this convenient way and specified it as an absolute path like / html / body / ... I had a lot of trouble.) For find_element_by_, I referred to here.
main.py
#If csv does not exist, store all urls in an array and export as csv
if os.path.exists(URL_PATH) == False:
utils.move_to_company_list(browsers[0][0])
url_arr = utils.get_url(NUMBER_OF_COMPANY, browsers[0][0])
utils.export_csv(url_arr, URL_PATH)
utils.browser_close(browsers[0][0])
else:
#If csv exists, read csv and url_Store in arr
url_arr = utils.import_csv(URL_PATH)
utils.py
def move_to_company_list(browser):
#Go to the page of all listed companies
element = browser.find_element_by_link_text('All listed companies')
element.click()
#It will be opened in another tab, so move to the second tab
browser.switch_to_window(browser.window_handles[1])
#Get URLs of all listed companies
def get_url(number_of_company, browser):
url_arr = []
for i in range(2, number_of_company):
url_xpath = '/html/body/div/div/table/tbody/tr[{0}]/td/ul/li/a'.format(i)
element = browser.find_element_by_xpath(url_xpath)
url = element.get_attribute('href')
url_arr.append(url)
print(str(i))
print(url)
return url_arr
#Export array to CSV
def export_csv(arr, csv_path):
with open(csv_path, 'w') as f:
writer = csv.writer(f, lineterminator='\n')
writer.writerow(arr)
#Close the current tab and return to the first tab
def browser_close(browser):
browser.close()
browser.switch_to_window(browser.window_handles[0])
def import_csv(csv_path):
if os.path.exists(csv_path) == True:
with open(csv_path, 'r') as f:
data = list(csv.reader(f))#Two-dimensional array.The 0th element is an array of URLs.
return data[0]
else:
print('csv does not exist')
sys.exit()
It takes time to get the URL every time, so once you get the URL, write it to csv and save it.
main.py
url_arrs = list(np.array_split(url_arr, NUMBER_OF_BROWSERS))
for i in range(NUMBER_OF_BROWSERS):
print('length of array{0} : '.format(i) + str(len(url_arrs[i])))
main.py
connector = MySQLdb.connect(
unix_socket = DB_UNIX_SOCKET,
host=DB_HOST, user=DB_USER, passwd=DB_PASS_WORD, db=DB_NAME
)
corsor = connector.cursor()
main.py
#Perform scraping processing in each browser (parallel processing))
with ThreadPoolExecutor(max_workers=2, thread_name_prefix="thread") as executor:
for i in range(NUMBER_OF_BROWSERS):
executor.submit(utils.scraping_process, browsers[i][0], url_arrs[i], corsor, connector)
utils.py
def open_new_page(url, browser):
try:
browser.execute_script('window.open()')
browser.switch_to_window(browser.window_handles[1])
browser.get(url)
except selenium.common.exceptions.TimeoutException:
browser_close(browser)
print('connection timeout')
print('retrying ...')
open_new_page(url, browser)
def content_scraping(corsor, connector, browser):
#Find a scraping target
name_element = browser.find_element_by_class_name('companyDetail-companyName')
position_element = browser.find_element_by_xpath('//div[@class="companyDetail-sectionBody"]/p[1]')
job_description_element = browser.find_element_by_xpath('//div[@class="companyDetail-sectionBody"]/p[2]')
company_name = name_element.text
position = position_element.text
job_description = job_description_element.text
url = browser.current_url
casual_flag = is_exist_casual(browser)
#----------Below DB registration process----------#
#INSERT
corsor.execute('INSERT INTO company_data_2 SET name="{0}", url="{1}", position="{2}", description="{3}", is_casual="{4}"'.format(company_name, url, position, job_description, casual_flag))
connector.commit()
def scraping_process(browser, url_arr, corsor, connector):
count = 0
for url in url_arr:
open_new_page(url, browser)
print('{0} scraping start'.format(count))
check_current_url(browser)
try:
content_scraping(corsor, connector, browser)
except selenium.common.exceptions.NoSuchElementException:
print('Companies that are currently not listed')
except MySQLdb._exceptions.ProgrammingError:
print('SQL programming Error')
browser_close(browser)
print('{0} scraping process end.'.format(count))
count += 1
The text described in the retrieved element can be used in element.text. For oprn_new_page (), if an expected exception (TimeoutExecption in this case) occurs, wait a little while and implement recursive retry processing to call itself again and try to connect. I learned about recursive functions in the process of tackling problems with AtCoder and so on. I had no idea where to use it in the actual processing, but this time I was able to create a usage example by myself by programming it myself.
There are five types of information you want to scrape: company name, URL, job type, job description, and whether you can work in plain clothes. Get the element by the value of Xpath and name attribute respectively.
Some listed companies have stopped posting. In that case, find_element_by will try to get the element that does not exist, and at this time, NoSuchElementException will occur. In this case catch this. ProgrammingError is returned from MySQLdb when job_description etc. contains single quotes or double quotes. Someone tell me how to do something like PHP's PDO prepare statement ~~~~~~~~~~
Scraping_process () that actually executes the process is the flow of opening the target page> specifying the element and getting it> DB registration. The longest time in this flow is from the process of opening the first page to the acquisition of the next element. This is because it takes a long time to display the page after opening it. In order to improve the slowdown due to this part, parallel processing is performed with ThreadPoolExecutor of concurrent.futures. I think that browser2 is in a state of proceeding while browser1 is waiting for the page to be displayed. This makes it much faster than processing with a single browser.
We were able to successfully scrape nearly 16,000 companies. I learned a lot by creating a recursive function by myself, studying the structure of HTML to specify xpath, and trying to parallelize the last process by grouping it into a function. It was.
* 1 Selenium Python Bindings 4. Find Elements [Python] Selenium usage memo Summary of how to select elements in Selenium Selenium API (reverse lookup) Check the existence of the file with python Take and verify XPath in Chrome Parallel task execution using concurrent.futures in Python I thoroughly investigated the parallel processing and parallel processing of Python I operated MySQL from Python3 on Mac
Recommended Posts