Crawling and scraping any site with mitmproxy

This article is the 19th day article of Crawler / Web Scraping Advent Calendar 2016.

First of all, mitmproxy is ** neither a scraping tool nor a crawling tool **, but the purpose of this article is to use it for scraping. ..

The ultimate weapon called mtimproxy

As the name implies, mtimproxy is ** proxy **, which seems to be an abbreviation for man-in-the-middle proxy. OSS made by Python.

It looks like this in a diagram. So to speak, is it an image of a proxy that attacks itself as a man-in-the-middle?

I wrote it myself and I was a little surprised because it was a meaningless figure, but when I use mitmproxy,

--You can hook the contents of the request sent to the external site and process it. --You can hook the contents of the response received from the external site and process it.

This means that you can ** automatically process all the sites you visit with a Python script **.

Being able to process a response in Python means that you can save the response, so it can be used as a ** crawling tool **.

Of course, mitmproxy is just a proxy tool, so you can't automatically go crawling with any strategy. ** Humans operate as spiders (bots) **, and ** save and scrape the browsing results ** is this strategy.

Specific use case

Let me explain why scraping with mitmproxy is the strongest.

--Human decides which page to download, but human is smarter than bot (as of 2016) --Human behaves more naturally than bots (as of 2016) --If you can use a proxy, you can save any type of site.

For example, are there the following use cases?

--I want to crawl the API of Javascript or Flash sites ――I want to analyze the site I saw in a day

On the contrary, I don't think the following are suitable.

--I want to crawl on a regular basis ――I want to crawl sites all over the world by myself --I want to crawl the results on sites where Javascript rendering is important, such as SPA

Install mitmproxy

Installation is easy. If you get stuck in the installation, use the official Docker version.

pip install mitmproxy

mitmproxy has two tools. mitmproxy and mitmdump. The former is a CUI and interactive tool. The latter has no interaction and is good for scraping applications.

The startup method is simple, just type mitmdump into the shell. By default, it starts on port 8080, so you can specify 127.0.0.1:8080 as the proxy (in Chrome, Proxy SwitchySharp. The extension / webstore / detail / proxy-switchysharp / dpplabbmogkhghncfbfdeeokoefdjegm? hl = ja) is easy to use).

If the proxy settings are successful, open a suitable site. If all goes well, the content received by mtimdump should look like this:

How to scrape with mitmproxy

I will show you how to execute a Python script with mitmproxy.

mitmdump -s path_to_script.py

I think the script to be executed is as follows, for example. The following example narrows down the Content Type to text / html and saves it.

def response(flow):
    content_type = flow.response.headers.get('Content-Type', '')
    path = flow.request.url.replace('/', '_').replace(':', '_')
    if content_type.startswith('text/html'):
        with open(path, 'w') as f:
            f.write(flow.response.text)

Of course, I think that you should extract elements using lxml etc. as appropriate.

Another example that covers the features of mitmproxy is on GitHub.

Summary

I introduced the crawling technique using mitmproxy. The best human beings!

I've omitted the explanation, but it also supports SSL sites (please google for details).

Caution

This method is like a man-in-the-middle attack. If you try to attack someone without your consent, it will be a good crime, so please do not do it. To the last, let's use it to save and analyze the site you visited. Also, stop using private APIs that are prohibited by our Terms of Service.