We sometimes embed tags to collect specific data in the HTML of web pages, and we used automated tests to test whether the embedded tags were correct.
How to search HTML data using Beautiful Soup
However, if it is a static page, I could test it using Beautiful soup, but I could not get HTML data for a screen with strong security such as an SSL-enabled screen.
Therefore, if you cannot get the HTML data with Beautiful soup, we decided to use Selenium to move to the target screen and get the HTML page.
Below is a program that acquires HTML data using Beautiful soup and Selenium.
test.py
import time
from selenium import webdriver
from bs4 import BeautifulSoup
import re
#From the screen that can be transitioned to the screen that could not be acquired by Beautiful soup
driver.get("test.html")
driver.find_element_by_css_selector("test").click()
#If you can transition to the target screen
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
elems = soup.find_all("script",text=re.compile("test"))
#Move to the next screen
driver.find_element_by_css_selector("test").click()
…
For parsing HTML data, you can use Beautiful Soup as it is.
source = driver.page_source
soup = BeautifulSoup(source,'html.parser')
There is no problem if you use Selenium's ** ".page_source" ** to get the HTML data.
Create the above program for each required screen and you're done. Also, in the case of the above program, Chrome will start when you start it, so it may be better to start it with Headless. (I don't use Headless so much because Selenium often stops with an error ...)
Reference: I tried using Headless Chrome from Selenium
Recommended Posts