The Network tab of Chrome's developer tool (the one that opens with Ctl + Shift + i on Windows) is an interesting tool that allows you to see the timeline of the data acquired by the browser and simulate the line speed.
This time, I will simply get the URL list of the files displayed in this Network tab with Python + Selenium.
Chrome 79.0.3945.45 beta Python 3.7.3 selenium 3.141.0 chromedriver-binary 79.0.3945.36.0
Debian GNU/Linux 9 (Docker container)
Until the page is acquired by Selenium, it is as follows. Set options appropriately, such as headless mode. I get the page with driver.get (), but this excellent article was very helpful for the basic knowledge of this.
-Automatic operation of Chrome with Python + Selenium
netlogs.py
caps = DesiredCapabilities.CHROME
caps["goog:loggingPrefs"] = {"performance": "ALL"}
# caps["loggingPrefs"] = {"performance": "ALL"}
# options
options = ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--user-agent='+_headers["User-Agent"])
# get driver
driver = Chrome(options=options, desired_capabilities=caps)
driver.implicitly_wait(5)
driver.get("https://qiita.com/")
The log containing the URL is named performance
, so setDesiredCapabilities
to get the log [^ 1]
I'll give you this when you get the driver [^ 2].
The setting name of DesiredCapabilities
depends on the environment.
There was a case that it didn't work unless it was "loggingPrefs" instead of "goog: loggingPrefs".
Is it different depending on the Chrome version ...?
netlogs.py
time.sleep(2)
I'll wait until the page loads. It seems that the theory is to wait with driver.implicitly_wait (), I put sleep because I couldn't get the desired data well. Please let me know if there is a smarter way ...
netlogs.py
netLog = driver.get_log("performance")
The log acquired by driver.get_log ("performance ")
is in JSON-like format and looks like the following.
performance
[
{'level': 'INFO', 'message': '{
"message": {
"method": "Page.frameResized",
"params": {}
},
"webview": "***"
}', 'timestamp': ***
},
{'level': 'INFO', 'message': '{
...
We will extract only the necessary parts from the acquired performance log.
netlogs.py
def process_browser_log_entry(entry):
response = json.loads(entry['message'])['message']
return response
events = [process_browser_log_entry(entry) for entry in netLog]
events = [event for event in events if 'Network.response' in event['method']]
detected_url = []
for item in events:
if "response" in item["params"]:
if "url" in item["params"]["response"]:
detected_url.append(item["params"]["response"]["url"])
Of the properties " message "
, those that further include Network.responseReceived
in the"method"
name are selectively extracted.
Then, the extracted ʻeventswill be a set of items as follows. After that, I found the item containing
" url " in" params "=>" response ", extracted it, and stored it in
detected_url`.
network.response
[
{
"method": "Network.responseReceivedExtraInfo",
"params": {
"blockedCookies": [],
"headers": {
"cache-control": "max-age=0, private, must-revalidate",
"content-encoding": "gzip",
"content-type": "text/html; charset=utf-8",
"date": "Sat, 23 Nov 2019 07:41:40 GMT",
"etag": "W/\"***\"",
"referrer-policy": "strict-origin-when-cross-origin",
"server": "nginx",
"set-cookie": "***",
"status": "200",
"strict-transport-security": "max-age=2592000",
"x-content-type-options": "nosniff",
"x-download-options": "noopen",
"x-frame-options": "SAMEORIGIN",
"x-permitted-cross-domain-policies": "none",
"x-request-id": "***",
"x-runtime": "***",
"x-xss-protection": "1; mode=block"
},
"requestId": "***"
}
},
{
...
netlogs.py
caps = DesiredCapabilities.CHROME
caps["goog:loggingPrefs"] = {"performance": "ALL"}
options = ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-gpu')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--user-agent='+_headers["User-Agent"])
driver = Chrome(options=options, desired_capabilities=caps)
driver.implicitly_wait(5)
driver.get("https://qiita.com/")
time.sleep(2)
netLog = driver.get_log("performance")
def process_browser_log_entry(entry):
response = json.loads(entry['message'])['message']
return response
events = [process_browser_log_entry(entry) for entry in netLog]
events = [event for event in events if 'Network.response' in event['method']]
detected_url = []
for item in events:
if "response" in item["params"]:
if "url" in item["params"]["response"]:
detected_url.append(item["params"]["response"]["url"])
It seems that you can also execute a script to get the above information [^ 3].
netlogs_js.py
scriptToExecute = "var performance = window.performance || window.mozPerformance || window.msPerformance || window.webkitPerformance || {}; var network = performance.getEntries() || {}; return JSON.stringify(network);"
netData = driver.execute_script(scriptToExecute)
netJson = json.loads(str(netData))
detected_url = []
for item in netJson:
detected_url.append(item["name"])
I was able to get the URL list information by this method as well.
However, sometimes the desired file is not included, and I feel that it is not a stable method. (Not verified properly)
Please point out if there is a better way!
[^ 1]: I referred to this (almost copy)-[Selenium --python. How to capture network traffic's response [duplicate]](https://stackoverflow.com/questions/52633697/selenium-python-how- to-capture-network-traffics-response)
Recommended Posts