Good morning everyone, @_akisato.
Crawler / web scraping Advent Calendar http://qiita.com/advent-calendar/2015/ It is written as an article on the 6th day of crawler.
Today, I would like to introduce scraping of web pages that cannot be read unless JavaScript and cookies are allowed.
The implementation is uploaded on GitHub https://github.com/akisato-/pyScraper.
(1) Get the web page with requests, and (2) scrape with BeautufulSoup4. The Python standard HTML parser is not very good, so we will use lxml here. For the basic usage of BeautifulSoup4, refer to http://qiita.com/itkr/items/513318a9b5b92bd56185.
Use pip.
pip install requests
pip install lxml
pip install beautifulsoup4
I think it will be as follows. If you specify the URL of the page you want to scrape and the output file name, the title of the page will be returned in JSON format. The function scraping is the main body.
scraping.py
import sys
import json
import requests
from bs4 import BeautifulSoup
import codecs
def scraping(url, output_name):
# get a HTML response
response = requests.get(url)
html = response.text.encode(response.encoding) # prevent encoding errors
# parse the response
soup = BeautifulSoup(html, "lxml")
# extract
## title
header = soup.find("head")
title = header.find("title").text
## description
description = header.find("meta", attrs={"name": "description"})
description_content = description.attrs['content'].text
# output
output = {"title": title, "description": description_content}
# write the output as a json file
with codecs.open(output_name, 'w', 'utf-8') as fout:
json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)
if __name__ == '__main__':
# arguments
argvs = sys.argv
## check
if len(argvs) != 3:
print("Usage: python scraping.py [url] [output]")
exit()
url = argvs[1]
output_name = argvs[2]
scraping(url, output_name)
The number of web pages that cannot be seen without JavaScript enabled is increasing significantly. If you access such a page with the previous source, you will only get the page "Please enable JavaScript".
In order to support such pages, we will replace the web page acquisition that was done in requests with a combination of Selenium and PhantomJS. Selenium is a tool for automating browser operations, and PhantomJS is a Qt-based browser. [^ browser]
[^ browser]: PhantomJS is a browser, so you can replace it with a commonly used web browser such as IE, Firefox, Chrome, etc. For details, see the official document http://docs.seleniumhq.org/docs/03_webdriver.jsp#selenium-webdriver-s-drivers.
On Mac and Linux, it can be installed immediately with a package manager such as brew or yum.
Mac
brew install phantomjs
CentOS
yum install phantomjs
On Windows, download the binary from http://phantomjs.org/download.html, put it in a suitable location, and then put it in the path.
You can do it immediately with pip.
pip install selenium
Using Selenium and PhantomJS, the scraping source is modified as follows. There is no need to change the procedure after acquiring the web page. Configure the PhantomJS web driver with Selenium and get the HTML through that driver. After that, it is the same. If you want to record the driver operation log, rename os.path.devnull to the file name.
scraping_js.py
import sys
import json
import os
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import codecs
def scraping(url, output_name):
# Selenium settings
driver = webdriver.PhantomJS(service_log_path=os.path.devnull)
# get a HTML response
driver.get(url)
html = driver.page_source.encode('utf-8') # more sophisticated methods may be available
# parse the response
soup = BeautifulSoup(html, "lxml")
# extract
## title
header = soup.find("head")
title = header.find("title").text
## description
description = header.find("meta", attrs={"name": "description"})
description_content = description.attrs['content'].text
# output
output = {"title": title, "description": description_content}
# write the output as a json file
with codecs.open(output_name, 'w', 'utf-8') as fout:
json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)
if __name__ == '__main__':
# arguments
argvs = sys.argv
## check
if len(argvs) != 3:
print("Usage: python scraping.py [url] [output]")
exit()
url = argvs[1]
output_name = argvs[2]
scraping(url, output_name)
You can enter the proxy setting as an argument of PhantomJS.
phantomjs_args = [ '--proxy=proxy.server.no.basho:0000' ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)
PhantomJS has cookies enabled by default. If you want to keep the cookie file handy, you can set it as an argument of PhantomJS.
phantomjs_args = [ '--cookie-file={}'.format("cookie.txt") ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)
If all the functions are covered, it will be as follows.
scraping_complete.py
import sys
import json
import os
import requests
from selenium import webdriver
from bs4 import BeautifulSoup
import codecs
def scraping(url, output_name):
# Selenium settings
phantomjs_args = [ '--proxy=proxy.server.no.basho:0000', '--cookie-file={}'.format("cookie.txt") ]
driver = webdriver.PhantomJS(service_args=phantomjs_args, service_log_path=os.path.devnull)
# get a HTML response
driver.get(url)
html = driver.page_source.encode('utf-8') # more sophisticated methods may be available
# parse the response
soup = BeautifulSoup(html, "lxml")
# extract
## title
header = soup.find("head")
title = header.find("title").text
## description
description = header.find("meta", attrs={"name": "description"})
description_content = description.attrs['content']
# output
output = {"title": title, "description": description_content}
# write the output as a json file
with codecs.open(output_name, 'w', 'utf-8') as fout:
json.dump(output, fout, indent=4, sort_keys=True, ensure_ascii=False)
if __name__ == '__main__':
# arguments
argvs = sys.argv
## check
if len(argvs) != 3:
print("Usage: python scraping.py [url] [output]")
exit()
url = argvs[1]
output_name = argvs[2]
scraping(url, output_name)
Recommended Posts