I usually work mainly on crawler development. In my business, I have the opportunity to visit various sites to collect data, and I struggle with access blocks every day. This time, as a memorandum of my own, I will summarize the techniques and countermeasures for scraping that do not fit the access block. The scraping technology handled is Ruby's open-uri, curl command, and selenium. Needless to say, let's comply with the rules of the crawled site and related laws.

Items to check when access is blocked

There are various types of access blocks, and avoid them by appropriate means.

IP 1: Access frequency If you frequently send a request to the site side, it may be judged as a robot. Solution: Increase sleep time. Number of crawls * sleep = required time <Long sleep is taken within the allowable required time.

2: If you send the request at the same pace, it may be judged as a robot. Solution: Set a random sleep time.

3: If requests are frequently sent from the same IP to the same site, it may be blocked on a per-IP basis. Solution: Check the number of continuous connections and seconds per IP, and if necessary, set SSH server stepping stone and IP rotation. Even if they are different sites, CDN is Akamai is common, and if IP is blocked by Akamai, both sites may be blocked.

#ruby open-uri
socksify gem
Socksify::proxy("127.0.0.1", port) { open(url).read }

# curl
curl -x socks5h://loalhost:port url

# selenium
selenium_option = [ --proxy-server=socks5://127.0.0.1:port ]

4: Some sites block access from specific IPs. Solution: If you are using a data center (IDCF or AWS) as a server, try changing the IP provider etc. by setting SSH server as a stepping stone.

HTTP Header 1: Default HTTP Header The server side that receives the request recognizes that it is a suspicious request when an unnatural HTTP Header request comes in, and blocks it if necessary. The default HTTP Headers such as open-uri and curl are clearly BOT. Solution: Specify the same as when accessing User-Agent, Accept-Language, etc. with a browser. Check the Header when accessed with a browser by copying it from Chrome → Developer tool → Network tab → Copy → Copy as cURL.

2: Access frequency from the same user If the server determines from the HTTP Header (or IP) that the same user is over-accessing, it may be blocked. Solution: Try rotating User-Agent etc.

Captcha There is a site that can determine the movement of BOT and solve problems that humans can easily solve. Solution 1: Do not trigger the Captcha test. Try various conditions such as lengthening sleep, changing IP, remove webdriver property, remove headless option if you are using selenium.

Solution 2: Solve the test automatically. Use a captha breakthrough API, or an image recognition technology that passes this check using machine learning or deep learning skills.

Supplement

Remove the webdriver property

If you're using selenium, you'll find it easy to auto-operate with the following code.

var isAutomated = navigator.webdriver;

if(isAutomated){
    blockAccess();
}

Run the following code to remove the webdriver property.

delete Object.getPrototypeOf(navigator).webdriver;

Remove the headless option

If you are using selenium, running it headless may block it on the site side. Headless is a proof that you are not human, and you can easily tell if it is headless with the following code. For sites that use ditsil or geetest bot prevention technology, headless is not possible, and if you want to run it on a server, you need to have a GUI.

navigator.permissions.query({name:'notifications'}).then(function(permissionStatus) {
    if(Notification.permission === 'denied' && permissionStatus.state === 'prompt') {
        console.log("Headless Chrome");
    } else {
        console.log("Not Headless Chrome");
    }
});

Check if the CDN is Akamai

Find out the IP address from the domain name with the dig command dig domain name Example: dig www.armaniexchange.com If your CDN is Akamai, you'll find akamai edge in the Answer section.

Look up a domain name from an IP address

Reverse lookup from IP address with dig command

dig -x IP address

Check if port forwarding is done

Check your global IP address with the dig command and reverse the domain with the -x option of the dig command. Determine if port forwarding is possible with the domain name in the Answer section. IDCF：idcfcloud.net. AWS EC2：amazonaws.com. FLET'S: mesh.ad.jp.

Use the SSH server as a stepping stone

If you use the SSH server as a stepping stone, you can make it think that you have connected to the connected Web server from the IP address of the SSH server. Create a "tunnel" to the SSH server using the "dynamic forward" function of the SSH server and use it as a "SOCKS proxy".

ssh -D Port number Username@hostname

Randomly fetch SOCKS proxy ports

When IP rotation, it is necessary to launch multiple SOCKS proxies and randomly retrieve ports. When the dynamic_ports method is executed, the Linux command pgrep -fal ssh extracts the ssh server and returns the SOCKS proxy ports as an array, so only one should be extracted at random with the sample method.

`dynamic_ports.rb`


def dynamic_ports
  lines =  `pgrep -fal ssh`.split("\n")
  ports = []
  lines.each do |line|
    opts = line.split
    d_options_index = opts.find_index("-D")
    if d_options_index.present? && d_options_index > 0
      next if opts[d_options_index + 1].to_i <= 0
      ports << opts[d_options_index + 1].to_i
    end
  end
  ports
end

SOCKS proxy type

Local name resolution curl --socks5 localhost:port Name resolution on the server curl --socks5-hostname localhost:port curl -x socks5h://localhost:post chrome_option = [ --proxy-server="socks5://127.0.0.1:port" ]

References

10 Ways to hide your Bot Automation from Detection | How to make Selenium undetectable and stealth Reliable wherever you go! How to access the Web using an SSH server as a stepping stone

at the end

Access blocks vary from site to site, and countermeasures vary widely. No site can't be scraped! I want to acquire the technical ability that can be said.

[RUBY] Be careful to avoid access blocking with scraping