When I crawl the webapi that appears during rendering, it was played with CORS

The beginning of things

If you were running a program that crawls a page written using python3 One day I got an error like this.

Access to XMLHttpRequest at 'https://target' from origin 'https://xxxxxxxxx' has been blocked by CORS policy: Response to preflight request doesn't pass access control check: No 'Access-Control-Allow-Origin' header is present on the requested resource.

The implementation at that time is as follows.

import requests
res = requests.get("https://target") #webapi URL

The page that was the target of crawling is a page that reads and displays data with bootstrap etc. I was crawl targeting the webapi called from bootstrap during rendering.

Solution

Crawl using selenium-wire. The selenium webdriver can only handle rendered web pages, but it also has access to the results of queries during rendering. https://pypi.org/project/selenium-wire/

from seleniumwire import webdriver

driver = webdriver.Chrome()
driver.get("https://target") #URL of TOP page

for request in driver.requests:
    if "xxxxx" in request.url: #Conditions for narrowing down the URLs for which you want results (webapi URL))
        response_text = request.response.body.decode()

Rakugo

What is CROS in the first place

I studied in this article. https://qiita.com/att55/items/2154a8aad8bf1409db2b I see, it's definitely necessary. Because there are people who do things like themselves.

Is it possible to process CROS with python? → It seems that it cannot be done easily

I haven't researched much, but it seems that it can't be done easily. So I gave up. In CORS, it seems that the preflight request is skipped first, and then it is actually GET or POST. https://developer.mozilla.org/ja/docs/Glossary/Preflight_request

For this article, preflight requests will be automatically issued by your browser as needed. Front-end developers usually do not need to make such requests themselves. It says The browser does it for me = I gave up thinking that the way to fly is hidden. Even if you do something, articles will come out using Fetch API or XMLHttpRequest, so it seems that you can only move it with javascript.

NodeJS can do it → maybe it can (unverified)

By saying javascript. It seems that you can use Fetch API with NodeJS. https://www.npmjs.com/package/node-fetch

Recommended Posts

When I crawl the webapi that appears during rendering, it was played with CORS
What I did when I was angry to put it in with the enable-shared option
When I tried to run Python, it was skipped to the Microsoft Store
The story that Apache dealt with because it was down at AH00144
[Scikit-learn] I played with the ROC curve
In IPython, when I tried to see the value, it was a generator, so I came up with it when I was frustrated.
The file edited with vim was readonly but I want to save it
When I investigated whether the COTOHA API could understand comics, it was reasonable.
Turn off the dialog that appears when you enter Linux with Remote Desktop
When I tried to change the root password with ansible, I couldn't access it.
A story that I was addicted to when I made SFTP communication with python
[Python] I introduced Word2Vec and played with it.
I played with Floydhub for the time being
It became TLE when I confirmed the operation with the print function in the competition pro
A story that was convenient when I tried using the python ip address module
I tried to make the phone ring when it was posted at the IoT post
When I tried to do socket communication with Raspberry Pi, the protocol was different
When I checked the query generated by Django, it was issued in large numbers
Here is one of the apps with "artificial intelligence" that I was interested in.
Code memo that I was having trouble with not being on the discord.py site
Solved the problem that the image was not displayed in ROMol when loaded with PandasTools.LoadSDF.
A memo that I touched the Datastore with python
The story that Japanese output was confused with Django
When I tried the AtCoder Beginner Contest, it was a terrible result, so I look back
When I calculated the similar words of careful + brave with word2vec, it felt unexpectedly valid
[VLC] How to deal with the problem that it is not in the foreground during playback