Selenium, Phantomjs & BeautifulSoup4

Installation of required packages

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=16.04
DISTRIB_CODENAME=xenial
DISTRIB_DESCRIPTION="Ubuntu 16.04 LTS"

$ sudo aptitude install phantomjs xvfb
$ pip install selenium pyvirtualdisplay
from selenium import webdriver
from pyvirtualdisplay import Display
display = Display(visible=0, size=(800, 600))
display.start()
# <Display cmd_param=['Xvfb', '-br', '-nolisten', 'tcp', '-screen', ' - snip -

driver = webdriver.PhantomJS()
driver.get("http://www.example.com)
type(driver.page_source)
# <class 'str'>

driver.page_source
# '<!DOCTYPE html><html itemscope="" itemtype="http://schema.org/Web - snip -

from bs4 import BeautifulSoup
soup = BeautifulSoup(driver.page_sourve)
i = [ {"href": x["href"], "text": x.string, "class": x._class } for x in soup.find_all("a") ]
print(i)
# [{'class': None, 'text': 'MENU', 'href': 'javascript:;'}, {'class': None, 'text': 'top page', 'href': '/'}, {'class': None, 'text': 'platform', 'href': '/pf/'},  - snip -

Even now (September 2016), there are the following problems, so when using Phantomjs on Ubuntu 16.04, it is better to install it by the normal procedure instead of from the package. https://bugs.launchpad.net/ubuntu/+source/phantomjs/+bug/1578444

Recommended Posts

Selenium, Phantomjs & BeautifulSoup4
phantomjs and selenium
python selenium chromedriver beautifulsoup
selenium
Re: Life in Heroku starting from scratch with Flask ~ Selenium & PhantomJS & Beautifulsoup ~
BeautifulSoup4 memo
Reboot the router using Python, Selenium, PhantomJS
Use selenium phantomjs webdriver with python unittest