Requirement definition ~ python environment construction
Create a site scraping mechanism
Process the downloaded file (xls) to create the final product (csv)
Download file from S3 / Create file upload to S3
** Implemented 2captcha **
Allow it to be launched in a Docker container
Register for AWS batch

Towards automatic execution

Up to the last time, it has become a program that can achieve the purpose. But ... This requirement requires you to do this on a regular basis every day.

Normally, you can run the batch system regularly with cron etc., but this time it is not so easy.

You need to release reCAPTCHA (screen required, manual work required)
Selenium also doesn't work in headless mode (no screen) (because click download was required)

First, reCAPTHCA measures. After investigating, I found that there is a Russian service called "2CAPTCHA".

It is a service that remotely releases reCAPTCHA. It is exceptionally cheap, with a breakthrough of 1000 times and a few hundred yen. I thought it was a bit suspicious, but I decided to use it.

Registration to 2captcha ~ Billing

Register an account with 2captcha and put money in Balance. You will not be charged just for registering and using your credit card, but you can use the service for the amount of money you put in.

I will omit how to use it because there are other people who have introduced it.

https://tanuhack.com/pr-2captcha/

However, I couldn't use "PayU", so I used paypal via "PayPro Global" and charged 300 yen first. At the current rate, this seems to go about 3000 times.

Installation of 2captcha

Advance preparation

First of all

--2 Get the API KEY of Captcha --Get the reCAPTCHA google_site_key for the site --Find "textarea # g-recaptcha-response" on the site

You need to do three things.

google_site_key is said to be one shot when you search with data-sitekey on the above site, but in my case it was in javascript on the source. It feels like I found it by searching for recaptcha`. (On the contrary, in order to prevent breakthroughs using this service, it may be good to make it difficult to find here ...)

The textarea found # g-recaptcha-response immediately. Due to the mechanism, this cannot be changed ...

Make textarea visible

As you can see on the introduction site above, if the textarea is invisible, you cannot enter it, so use javascript to visualize it.

Also, on my target site, the reCAPTCHA checkbox itself was hidden. The behavior was "press the login button to get reCAPTCHA (after canceling, press the login button again to log in)".

driver.execute_script('document.querySelector(hoge).style.height = "auto";')
driver.execute_script('document.querySelector(hoge).style.position = "inherit";')
driver.execute_script('document.getElementById("g-recaptcha-response").style.display="";')

Ask 2captcha to release

First, get the captcha_id as follows. I've never had an ERROR with this, but it will happen when it comes to service maintenance.

#Check if 2captcha is ready
url = "http://2captcha.com/in.php?key=" + config.service_key + "&method=userrecaptcha&googlekey=" + config.google_site_key + "&pageurl=" + LOGIN_URL 
resp = requests.get(url) 
if resp.text[0:2] != 'OK': 
    exit('2captcha Service error. Error code:' + resp.text) 
captcha_id = resp.text[3:]

Then use that captcha_id to request a release.

#Actually request cancellation
fetch_url = "http://2captcha.com/res.php?key="+ config.service_key + "&action=get&id=" + captcha_id
print('Requesting cancellation ...')
for __i in range(1, 10):
    time.sleep(5) # wait 5 sec.
    resp = requests.get(fetch_url)
    if resp.text[0:2] == 'OK':
        break
print('Google response token: ', resp.text[3:])

I haven't investigated the details, but it seems that "CHA_NOT_READY" may be returned in the response. Does it happen when the staff is not ready to unlock? In this case, it's a problem, so in my case I implemented it to start over.

if resp.text[3:] == 'CHA_NOT_READY':
    print('Processing failed')
    driver.quit()
    if count == 0:
        exit('Error: 2captcha is not ready')
    else:
        #Start over
        return getLoginedDriver(config,count-1)

When the token is returned safely, put it in the textarea and log in.

#Enter the token in the textarea
driver.find_element_by_id('g-recaptcha-response').send_keys(resp.text[3:])
time.sleep(INTERVAL)

driver.execute_script('document.querySelector(hoge).style.visibility = "hidden";') #For this site I needed this to press the login button
submit_button = driver.find_element_by_css_selector(hoge)
submit_button.click()

Run

As you can see when I try it, this is amazing ... Thank you for the wonderful service.

However, since the system depends on the survival of this service, I thought again that I do not want to scrape if possible. We are negotiating with the operation of this site so that the API will be prepared, but I hope it goes well.

Complete

ReCAPTCHA may not come out, so if you can handle that case, it will be completed. Now it works just by running (without releasing reCAPTCHA).

All I have to do now is run this on the server, not on the local PC ... It's another mountain because it doesn't work in headless mode.

Automatic scraping of reCAPTCHA site every day (5/7: 2captcha)