Up to the last time, it has become a program that can achieve the purpose. But ... This requirement requires you to do this on a regular basis every day.
Normally, you can run the batch system regularly with cron etc., but this time it is not so easy.
First, reCAPTHCA measures. After investigating, I found that there is a Russian service called "2CAPTCHA".
It is a service that remotely releases reCAPTCHA. It is exceptionally cheap, with a breakthrough of 1000 times and a few hundred yen. I thought it was a bit suspicious, but I decided to use it.
Register an account with 2captcha and put money in Balance. You will not be charged just for registering and using your credit card, but you can use the service for the amount of money you put in.
I will omit how to use it because there are other people who have introduced it.
https://tanuhack.com/pr-2captcha/
However, I couldn't use "PayU", so I used paypal via "PayPro Global" and charged 300 yen first. At the current rate, this seems to go about 3000 times.
First of all
--2 Get the API KEY of Captcha --Get the reCAPTCHA google_site_key for the site --Find "textarea # g-recaptcha-response" on the site
You need to do three things.
google_site_key is said to be one shot when you search with data-sitekey on the above site, but in my case it was in javascript on the source. It feels like I found it by searching for
recaptcha`. (On the contrary, in order to prevent breakthroughs using this service, it may be good to make it difficult to find here ...)
The textarea found # g-recaptcha-response immediately. Due to the mechanism, this cannot be changed ...
As you can see on the introduction site above, if the textarea is invisible, you cannot enter it, so use javascript to visualize it.
Also, on my target site, the reCAPTCHA checkbox itself was hidden. The behavior was "press the login button to get reCAPTCHA (after canceling, press the login button again to log in)".
driver.execute_script('document.querySelector(hoge).style.height = "auto";')
driver.execute_script('document.querySelector(hoge).style.position = "inherit";')
driver.execute_script('document.getElementById("g-recaptcha-response").style.display="";')
First, get the captcha_id as follows. I've never had an ERROR with this, but it will happen when it comes to service maintenance.
#Check if 2captcha is ready
url = "http://2captcha.com/in.php?key=" + config.service_key + "&method=userrecaptcha&googlekey=" + config.google_site_key + "&pageurl=" + LOGIN_URL
resp = requests.get(url)
if resp.text[0:2] != 'OK':
exit('2captcha Service error. Error code:' + resp.text)
captcha_id = resp.text[3:]
Then use that captcha_id to request a release.
#Actually request cancellation
fetch_url = "http://2captcha.com/res.php?key="+ config.service_key + "&action=get&id=" + captcha_id
print('Requesting cancellation ...')
for __i in range(1, 10):
time.sleep(5) # wait 5 sec.
resp = requests.get(fetch_url)
if resp.text[0:2] == 'OK':
break
print('Google response token: ', resp.text[3:])
I haven't investigated the details, but it seems that "CHA_NOT_READY" may be returned in the response. Does it happen when the staff is not ready to unlock? In this case, it's a problem, so in my case I implemented it to start over.
if resp.text[3:] == 'CHA_NOT_READY':
print('Processing failed')
driver.quit()
if count == 0:
exit('Error: 2captcha is not ready')
else:
#Start over
return getLoginedDriver(config,count-1)
When the token is returned safely, put it in the textarea and log in.
#Enter the token in the textarea
driver.find_element_by_id('g-recaptcha-response').send_keys(resp.text[3:])
time.sleep(INTERVAL)
driver.execute_script('document.querySelector(hoge).style.visibility = "hidden";') #For this site I needed this to press the login button
submit_button = driver.find_element_by_css_selector(hoge)
submit_button.click()
As you can see when I try it, this is amazing ... Thank you for the wonderful service.
However, since the system depends on the survival of this service, I thought again that I do not want to scrape if possible. We are negotiating with the operation of this site so that the API will be prepared, but I hope it goes well.
ReCAPTCHA may not come out, so if you can handle that case, it will be completed. Now it works just by running (without releasing reCAPTCHA).
All I have to do now is run this on the server, not on the local PC ... It's another mountain because it doesn't work in headless mode.
More on that.
Recommended Posts