This site uses React and uses selenium because it works with javascript.
Review the file structure
├── app
│ ├── drivers selenium put drivers
│ └── source
│ └── scraping.py processing
└── tmp
├── files
│ └── download Place the file downloaded by scraping
└── logs logs(selenium log etc.)
You need to decide which browser to use to run selenium. There are likely to be the following three candidates.
――Has it become the predominant headless browser? PhantomJS --Recently headless mode has been created and is becoming mainstream from now on Chrome --The first selenium is Firefox
At first I was using Chrome, but it didn't work well with the later Xvfb
, so I decided to use Firefox. Download the driver from the above URL and place it under / drivers /
. [^ 1]
Also, in order to run Firefox's gecko driver, Firefox must be installed on the OS. If you haven't already, please download it from the Official Site.
[^ 1]: I think I downloaded the latest version (mac version) ... In addition, it seems that if you place the driver in a predetermined position on macOS, you do not have to specify the file at startup, but this time I do not use that method.
Finally coding.
In order to proceed, prepare a download destination folder.
scraping.py
date = datetime.now().strftime('%Y%m%d')
dldir_name = os.path.abspath(__file__ + '/../../../tmp/files/download/{}/'.format(date))
dldir_path = Path(dldir_name)
dldir_path.mkdir(exist_ok=True)
download_dir = str(dldir_path.resolve())
The import statement etc. will be introduced together at the end. ... I think the code is verbose, but it worked, so I'm okay with this.
Next, I will describe up to the point of starting gecko driver with selenium. In the case of Firefox, the download dialog will appear when you start it normally, so you need to specify various options so that it does not appear.
scraping.py
driver_path = os.path.abspath(__file__ + '/../../drivers/geckodriver') #Specify the position of the driver.
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.folderList",2) #It seems that it was an option for "specifying the download folder".
fp.set_preference("browser.download.dir", download_dir)
fp.set_preference("browser.download.manager.showWhenStarting",False) #What is it? do not know
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet")
log_path = os.path.abspath(__file__ + '/../../../tmp/logs/geckodriver.log') #If you do nothing, a log file will be generated at the location of the executable file.
driver = webdriver.Firefox(firefox_profile=fp,executable_path=driver_path,service_log_path=log_path)
driver.maximize_window() #The side menu disappears depending on the window size ...
driver.implicitly_wait(10) #Error if there is no specified item for 10 seconds
Be careful with the helperApps.neverAsk.saveToDisk
option.
Only the file format specified here will not display the "Do you want to download?" Dialog. The xls file to download this time seemed to be ** application / vnd.openxmlformats-officedocument.spreadsheetml.sheet **. [^ 2]
[^ 2]: The official xls file format is different, but you need to specify the format of the file that is actually downloaded.
By the way, it was so easy with chrome
scraping.py
driver_path = os.path.abspath(__file__ + '/../../drivers/chromedriver')
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {"download.default_directory": download_dir})
driver = webdriver.Chrome(executable_path=driver_path, options=options)
In the case of this site, I was able to log in by doing the following.
scraping.py
#Login
driver.get(LOGIN_URL)
mail_address = driver.find_element_by_id("mail_address")
mail_address.send_keys(config.mail_address)
password = driver.find_element_by_id("password")
password.send_keys(config.password)
submit_button = driver.find_element_by_css_selector("button.submit")
submit_button.click()
… However, reCAPTCHA will appear when you click it, so you need to cancel it manually. I'll solve it later using 2captcha, but I haven't done so yet, so I'll release it myself when it comes out.
In order to have the process wait until it is released, wait process is inserted. In the example below, you can wait up to 100 seconds.
scraping.py
try:
WebDriverWait(driver, 100).until(
EC.presence_of_element_located((By.CSS_SELECTOR, "* Elements displayed after login"))
)
except TimeoutException:
driver.quit()
exit('Error:I failed to login')
Now, it's finally time to get the data. I did it like this.
scraping.py
#Search using search words
search_box = driver.find_element_by_xpath(SEARCH_BOX_XPATH)
search_box.clear()
search_box.click()
search_box.send_keys(word) #Enter in the search window
time.sleep(INTERVAL) #To wait for React processing
search_box.send_keys(Keys.RETURN) #Return key → The first search result is selected
#Open menu
try:
driver.find_element_by_xpath(MENU_XPATH).click()
except ElementNotInteractableException:
#When the menu cannot be selected → The accordion is not open
driver.find_element_by_xpath(MENU_OPEN_XPATH).click()
driver.find_element_by_xpath(MENU_XPATH).click()
#download
driver.find_element_by_xpath(DOWNLOAD_XPATH).click()
--In the case of recent reactive front-end frameworks, it is difficult to specify in csspath. Therefore, I basically decided to use xpath to specify the element.
--In the case of React etc., there were many cases where it did not work unless you waited for the operation of javascript to finish. Make good use of time.sleep ()
.
--When there are no elements, ʻElementNotInteractableException` is thrown. There seems to be no method to check "whether or not this element exists?", So let's make good use of it.
――If you are on a normal site, you can get many downloads by hitting the URL without having to click on them ... But this time you had to click on the site.
If you connect the ones introduced so far, the download part is completed once! Finally, I would like to introduce the import statement.
scraping.py
import os
import time
from pathlib import Path
from datetime import datetime
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.common.exceptions import ElementNotInteractableException
from selenium.webdriver.support.ui import Select
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.common.exceptions import TimeoutException
More on that.
Recommended Posts