Good morning everyone, @_akisato.

Crawler / web scraping Advent Calendar http://qiita.com/advent-calendar/2015/ It is written as an article on the 6th day of crawler.

Today, I would like to introduce scraping of web pages that cannot be read unless JavaScript and cookies are allowed.

The implementation is uploaded on GitHub https://github.com/akisato-/pyScraper.

First of all, scraping without any ingenuity

(1) Get the web page with requests, and (2) scrape with BeautufulSoup4. The Python standard HTML parser is not very good, so we will use lxml here. For the basic usage of BeautifulSoup4, refer to http://qiita.com/itkr/items/513318a9b5b92bd56185.

Installation of required packages

Use pip.

pip install requests
pip install lxml
pip install beautifulsoup4

Source

I think it will be as follows. If you specify the URL of the page you want to scrape and the output file name, the title of the page will be returned in JSON format. The function scraping is the main body.

`scraping.py`


import sys
import json
import requests
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # get a HTML response
    response = requests.get(url)
    html = response.text.encode(response.encoding)  # prevent encoding errors
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content'].text
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

Supports JavaScript

The number of web pages that cannot be seen without JavaScript enabled is increasing significantly. If you access such a page with the previous source, you will only get the page "Please enable JavaScript".

In order to support such pages, we will replace the web page acquisition that was done in requests with a combination of Selenium and PhantomJS. Selenium is a tool for automating browser operations, and PhantomJS is a Qt-based browser. [^ browser]

[^ browser]: PhantomJS is a browser, so you can replace it with a commonly used web browser such as IE, Firefox, Chrome, etc. For details, see the official document http://docs.seleniumhq.org/docs/03_webdriver.jsp#selenium-webdriver-s-drivers.

Install PhantomJS

On Mac and Linux, it can be installed immediately with a package manager such as brew or yum.

`Mac`


brew install phantomjs

`CentOS`


yum install phantomjs

On Windows, download the binary from http://phantomjs.org/download.html, put it in a suitable location, and then put it in the path.

Install Selenium

You can do it immediately with pip.

pip install selenium

Source

Using Selenium and PhantomJS, the scraping source is modified as follows. There is no need to change the procedure after acquiring the web page. Configure the PhantomJS web driver with Selenium and get the HTML through that driver. After that, it is the same. If you want to record the driver operation log, rename os.path.devnull to the file name.

`scraping_js.py`


import sys
import json
import os
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # Selenium settings
    driver = webdriver.PhantomJS(service_log_path=os.path.devnull)
    # get a HTML response
    driver.get(url)
    html = driver.page_source.encode('utf-8')  # more sophisticated methods may be available
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content'].text
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

Corresponds to Proxy

You can enter the proxy setting as an argument of PhantomJS.

phantomjs_args = [ '--proxy=proxy.server.no.basho:0000' ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)

Corresponds to cookies

PhantomJS has cookies enabled by default. If you want to keep the cookie file handy, you can set it as an argument of PhantomJS.

phantomjs_args = [ '--cookie-file={}'.format("cookie.txt") ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)

Source final form

If all the functions are covered, it will be as follows.

`scraping_complete.py`


import sys
import json
import os
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import codecs

def scraping(url, output_name):
    # Selenium settings
    phantomjs_args = [ '--proxy=proxy.server.no.basho:0000', '--cookie-file={}'.format("cookie.txt") ]
    driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)
    # get a HTML response
    driver.get(url)
    html = driver.page_source.encode('utf-8')  # more sophisticated methods may be available
    # parse the response
    soup = BeautifulSoup(html, "lxml")
    # extract
    ## title
    header = soup.find("head")
    title = header.find("title").text
    ## description
    description = header.find("meta", attrs={"name": "description"})
    description_content = description.attrs['content']
    # output
    output = {"title": title, "description": description_content}
    # write the output as a json file
    with codecs.open(output_name, 'w', 'utf-8') as fout:
        json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)

if __name__ == '__main__':
    # arguments
    argvs = sys.argv
    ## check
    if len(argvs) != 3:
        print("Usage: python scraping.py [url] [output]")
        exit()
    url = argvs[1]
    output_name = argvs[2]

    scraping(url, output_name)

Easy scraping with Python (JavaScript / Proxy / Cookie compatible version)

First of all, scraping without any ingenuity

Installation of required packages

Source

scraping.py

Supports JavaScript

Install PhantomJS

Mac

CentOS

Install Selenium

Source

scraping_js.py

Corresponds to Proxy

Corresponds to cookies

Source final form

scraping_complete.py

`scraping.py`

`Mac`

`CentOS`

`scraping_js.py`

`scraping_complete.py`