It is troublesome to write web scraping code that requires POST such as login page. I used selenium to eliminate that annoyance. It automatically runs the browser through selenium, automates operations that require POST, and performs web scraping.
OS: Ubuntu 16.04 (Sakura VPS)
mkdir download
cd download
wget https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
rm google-chrome-stable_amd64.deb
(Reference URL) http://bit.ly/2bBK3Ku
Step2) Preparing to start Google Chrome You can start it by typing google-chrome on the command line, but starting in this state caused two problems. The two are
In CLI, you can start it by typing google-chrome, but if you start it in this state, two problems occurred. The two are
It corresponded with the following command.
sudo apt-get update
sudo apt-get -f install
You can install the GUI desktop with the following command, but I stopped it because it seems to take a long time.
GUI desktop installation
sudo apt-get -y install ubuntu-desktop
Install a virtual display and run Chrome on the virtual display.
is.
The specific work procedure is described in Step 3.
I installed the virtual display xvfb with the following command.
Install xvfb
sudo apt-get install xvfb
sudo apt-get install unzip
wget -N http://chromedriver.storage.googleapis.com/2.20/chromedriver_linux64.zip
unzip chromedriver_linux64.zip
chmod +x chromedriver
sudo mv -f chromedriver/usr/local/share/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver
sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver
To operate Chrome via python, install the selenium package for operating Chrome and the pyvirtualdisplay for operating the virtual display xvfb.
Selenium is one of the test tools for web applications. Instead of humans controlling the browser, Selenium controls the browser. pyvirtualdisplay is a package for operating virtual display xvfb with python.
I have both installed with the code below. (Since pip3 was not installed, pip3 is installed in advance.)
sudo apt-get install python3-setuptools
sudo easy_install3 pip
pip3 install pyvirtualdisplay selenium
I ran the following code.
from pyvirtualdisplay import Display
from selenium import webdriver
display = Display(visible=0, size=(800, 600))
display.start()
browser = webdriver.Chrome()
browser.get('http://www.google.co.jp')
print(browser.title)
browser.quit()
display.stop()
I don't think there is much confusion with the above code. Lines 1 and 2 call the virtual display and selenium.
The 4th line defines the virtual display and the 5th line starts it. Start Chrome on the virtual display with webdriver.Chrome () on line 6. Get the source data of google.co.jp on the 7th line Outputs the title tag element of the page acquired in the 8th line.
Now you have an environment to start Chrome only with CLI.
When actually scraping, I use PhantomJS instead of Chrome. Since PhantomJS is a headless browser, it doesn't require a virtual display, and it also scrapes code written in Javascript, which is useful. If you want to work with PhantomJS, please check here.
However, in the case of Chrome, you may want to use Chrome because you can test while seeing how the browser actually behaves. If you want to scrape with Chrome, please visit the here page.
browser = webdriver.PhantomJS(executable_path='')
Part of
browser= webdriver.Chrome()
If you replace it with, it will work ^^ (Repeat, please note that Javascript code cannot be scraped.)
Recommended Posts