Automatic scraping with service "2Captcha" that breaks through "CAPTCHA" and Ruby + Chrome_Remote

Introduction

When I was scraping, I think I had the experience that CAPTCHA came out and the program stopped. (Only those people will see this article.) In order to somehow avoid CAPTHCA, there are ways to make it move like BOT or IP distribution, but this time I will try to solve CAPTCHA obediently. Of course, since I'm an engineer, I want to solve it automatically on the program rather than solving it by myself. Machine learning has high learning costs and introduction costs, and I want to enjoy it even more. A service called 2Cpathca makes that possible. There are many other services, so find the one that suits you best. There was a Python article, but I couldn't find a Ruby article, so I wrote it.

What is 2Capthca

image.png It is a service to break through the CAPTHCA function, and authentication can be automated by using the API. It's a paid service, but with reCAPTCHA v2 it's as cheap as $ 2.99 for 1,000 requests. As a reminder, there is no exchange of money between me and 2Captcha for promotional purposes.

What is Chrome_Remote?

A library that allows you to operate Chrome instances from Ruby. Please refer to Explanation page and Repository for detailed usage. As a premise for scraping, it is necessary to make it difficult for CAPTHCA to appear in the first place. Unlike Selenium etc., Chrome_Remote that runs Chrome as it is is harder to judge BOT. (I want to verify the difference soon.)

Thing you want to do

image.png Break through the reCAPTCHA demo page. 2 Please refer to Predecessor's article for creating a Capthca account and obtaining an api key.

Break through reCAPTHCA with "2Captcha" and Ruby + Chrome_Remote

2 Get the Captcha api key and save it as a file.

key.yaml


---
:2Capthca:2 Captcha api key

Start Chrome with debugging-port.

For Mac


/Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome --remote-debugging-port=9222 &

Install the required Gem.

Gemfile


source "https://rubygems.org"
gem 'nokogiri'
gem 'chrome_remote'
bundle install

The ruby program itself.

crawler.rb



require 'nokogiri'
require 'chrome_remote'
require 'yaml'

class CaptchaDetectedException < StandardError; end

class ChromeController
  def initialize
    @chrome = ChromeRemote.client

    # Enable events
    @chrome.send_cmd "Network.enable"
    @chrome.send_cmd "Page.enable"
  end
  
  def open(url)
    #Page access
    move_to url
    captcha_detect
  end
  
  def reload_page
    sleep 1
    @chrome.send_cmd "Page.reload", ignoreCache: false
    wait_event_fired
  end
  
  def execute_js(js)
      @chrome.send_cmd "Runtime.evaluate", expression: js
  end
  
  def wait_event_fired
      @chrome.wait_for "Page.loadEventFired"
  end
  
  #Page navigation
  def move_to(url)
    sleep 1
    @chrome.send_cmd "Page.navigate", url: url
    wait_event_fired
  end
  
  #Get HTML
  def get_html
    response = execute_js 'document.getElementsByTagName("html")[0].innerHTML'
    html = '<html>' + response['result']['value'] + '</html>'
  end
  
  def captcha_detect
    bot_detect_cnt = 0
    begin
      html = get_html
      raise CaptchaDetectedException, 'captcha confirmed' if html.include?("captcha")
    rescue CaptchaDetectedException => e
      p e
      bot_detect_cnt += 1
      p "Captcha breakthrough attempt: #{bot_detect_cnt}Time"
      doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
      return if captcha_solve(doc) == 'Successful release'
      reload_page
      retry if bot_detect_cnt < 3
      p 'Captcha breakthrough error. Quit Ruby'
      exit
    end
    p 'There was no captcha'
  end

  def captcha_solve(doc)
    id = request_id(doc).match(/(\d.*)/)[1]
    solution = request_solution(id)
    return false unless solution
    submit_solution(solution)
    p captcha_result
  end

  def request_id(doc)
    #Read API key
    @key = YAML.load_file("key.yaml")[:"2Capthca"]
    # data-Get the value of the sitekey attribute
    googlekey = doc.at_css('#recaptcha-demo')["data-sitekey"]
    method = "userrecaptcha"
    pageurl = execute_js("location.href")['result']['value']
    request_url="https://2captcha.com/in.php?key=#{@key}&method=#{method}&googlekey=#{googlekey}&pageurl=#{pageurl}"
    #Request to release captcha
    fetch_url(request_url)
  end

  def request_solution(id)
    action = "get"
    response_url = "https://2captcha.com/res.php?key=#{@key}&action=#{action}&id=#{id}"
    sleep 15
    retry_cnt = 0
    begin
      sleep 5
      #Get captcha unlock code
      response_str = fetch_url(response_url)
      raise 'Before releasing captcha' if response_str.include?('CAPCHA_NOT_READY')
    rescue => e
      p e
      retry_cnt += 1
      p "retry:#{retry_cnt}Time"
      retry if retry_cnt < 10
      return false
    end
    response_str.slice(/OK\|(.*)/,1)
  end

  def submit_solution(solution)
    #Enter the unlock code in the specified textarea
    execute_js("document.getElementById('g-recaptcha-response').innerHTML=\"#{solution}\";")
    sleep 1
    #Click the submit button
    execute_js("document.getElementById('recaptcha-demo-submit').click();")
  end

  def captcha_result
    sleep 1
    html = get_html
    doc = Nokogiri::HTML.parse(html, nil, 'UTF-8')
    doc.at_css('.recaptcha-success') ? 'Successful release' : 'Cancellation failure'
  end


  def fetch_url(url)
    sleep 1
    `curl "#{url}"`
  end

end

crawler = ChromeController.new
url = 'https://www.google.com/recaptcha/api2/demo'
crawler.open(url)

When you run the program, it will access the reCAPTCHA demo page and try to break through the CAPTCHA.

bundle exec ruby crawler.rb

Finally

Depending on the purpose and mode of scraping and how to handle the data obtained by scraping, there is a risk of violating copyright law and personal information protection law. I wish you all a happy scraping life.

Recommended Posts

Automatic scraping with service "2Captcha" that breaks through "CAPTCHA" and Ruby + Chrome_Remote
[Ruby] 5 errors that tend to occur when scraping with Selenium and how to deal with them