Purpose

It is troublesome to write web scraping code that requires POST such as login page. I used selenium to eliminate that annoyance. It automatically runs the browser through selenium, automates operations that require POST, and performs web scraping.

environment

OS: Ubuntu 16.04 (Sakura VPS)

Step1) Install Chrome from the command line

mkdir download
cd download
wget  https://dl.google.com/linux/direct/google-chrome-stable_current_amd64.deb
sudo dpkg -i google-chrome-stable_current_amd64.deb
rm google-chrome-stable_amd64.deb

(Reference URL) http://bit.ly/2bBK3Ku

Step2) Preparing to start Google Chrome You can start it by typing google-chrome on the command line, but starting in this state caused two problems. The two are

The dependency is broken.
There is no screen (of course). is. The details of the action are shown below.

In CLI, you can start it by typing google-chrome, but if you start it in this state, two problems occurred. The two are

The dependency is broken.
There is no screen (of course). is. The details of the action are shown below.

Problem 1) Dependency repair

It corresponded with the following command.

sudo apt-get update
sudo apt-get -f install

Problem 2) There is no screen

((Proposal 1)) GUI Desktop

You can install the GUI desktop with the following command, but I stopped it because it seems to take a long time.

GUI desktop installation

sudo apt-get -y install ubuntu-desktop

((Proposal 2)) Install virtual display 

Install a virtual display and run Chrome on the virtual display.

As a procedure,

① Install xvfb of virtual display ② Install selenium and pyvirtualdisplay to operate Chrome from python ③ Write a Chrome startup program with python

is.

The specific work procedure is described in Step 3.

Step3) Launch Google Chrome

Step ①) Install xvfb 

I installed the virtual display xvfb with the following command.

Install xvfb

sudo apt-get install xvfb sudo apt-get install unzip wget -N http://chromedriver.storage.googleapis.com/2.20/chromedriver_linux64.zip unzip chromedriver_linux64.zip chmod +x chromedriver sudo mv -f chromedriver/usr/local/share/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/local/bin/chromedriver sudo ln -s /usr/local/share/chromedriver /usr/bin/chromedriver

Procedure ②) Installation of selenium etc. 

To operate Chrome via python, install the selenium package for operating Chrome and the pyvirtualdisplay for operating the virtual display xvfb.

Selenium is one of the test tools for web applications. Instead of humans controlling the browser, Selenium controls the browser. pyvirtualdisplay is a package for operating virtual display xvfb with python.

I have both installed with the code below. (Since pip3 was not installed, pip3 is installed in advance.)

sudo apt-get install python3-setuptools sudo easy_install3 pip pip3 install pyvirtualdisplay selenium

Step ③) Install xvfb 

I ran the following code.

from pyvirtualdisplay import Display from selenium import webdriver display = Display(visible=0, size=(800, 600)) display.start() browser = webdriver.Chrome() browser.get('http://www.google.co.jp') print(browser.title) browser.quit() display.stop()

I don't think there is much confusion with the above code. Lines 1 and 2 call the virtual display and selenium.

The 4th line defines the virtual display and the 5th line starts it. Start Chrome on the virtual display with webdriver.Chrome () on line 6. Get the source data of google.co.jp on the 7th line Outputs the title tag element of the page acquired in the 8th line.

Now you have an environment to start Chrome only with CLI.

How to actually scrape?

When actually scraping, I use PhantomJS instead of Chrome. Since PhantomJS is a headless browser, it doesn't require a virtual display, and it also scrapes code written in Javascript, which is useful. If you want to work with PhantomJS, please check here.

However, in the case of Chrome, you may want to use Chrome because you can test while seeing how the browser actually behaves. If you want to scrape with Chrome, please visit the here page.

browser = webdriver.PhantomJS(executable_path='')

Part of

browser= webdriver.Chrome()

If you replace it with, it will work ^^ (Repeat, please note that Javascript code cannot be scraped.)

Recommended Posts
Install Chrome on the command line on Sakura VPS (Ubuntu) and launch Chrome with python from virtual display and selenium

Set cron from 1 on Ubuntu 16.04 (Sakura VPS) and execute python program regularly

Install selenium on Mac and try it with python

Automate Chrome with Python and Selenium on your Chromebook

Put Ubuntu in Raspi, put Docker on it, and control GPIO with python from the container

Install mecab on Sakura shared server and call it from python

Get data from MySQL on a VPS with Python 3 and SQLAlchemy

[EC2] How to install and download chromedriver from the command line

Operate Firefox with Selenium from python and save the screen capture

Python virtual environment and packages on Ubuntu

Install pyenv and Python 3.6.8 on Ubuntu 18.04 LTS

Read the file with python and delete the line breaks [Notes on reading the file]

Quickly display the QR code on the command line

Try running Google Chrome with Python and Selenium

Install OpenCV 4.0 and Python 3.7 on Windows 10 with Anaconda

Install MongoDB on Ubuntu 16.04 and operate via python

Put Scipy + Matplotlib in Ubuntu on Vagrant and display the graph with X11 Forwarding

How to pass arguments when invoking python script from blender on the command line

Open Chrome version of LINE from the command line [Linux]

Automatic follow on Twitter with python and selenium! (RPA)

Install the latest stable Python with pyenv (both 2 and 3)

Ubuntu 20.04 on raspberry pi 4 with OpenCV and use with python

Execute the command on the web server and display the result

Install Ubuntu 20.04 with GUI and prepare the development environment

Install the latest Python from pyenv installed with homebrew

Install django on python + anaconda and start the server

Install Python 3.3 on Ubuntu 12.04

Scraping with Python + Selenium to add Apple refurbished products to your cart and notify on line

Search for large files on Linux from the command line

Use Python 3 introduced with command line tools on macOS Catalina

[Python] Read the csv file and display the figure with matplotlib

Install python3 and scientific calculation library on Ubuntu (virtualenv + pip)

Install CaboCha in Ubuntu environment and call it with Python.

After updating to MacOS Catalina, install Xcode Command Line Tools and change from Python 2.7 series to 3.7 series (bash)

Install OpenCV on Ubuntu + python

[EC2] How to install chrome and the contents of each command

Linking Python and Arduino to display IME On / Off with LED

Quickly install OpenCV 2.4 (+ python) on OS X and try the sample

Create custom Django commands and run them from the command line

How to automatically install Chrome Driver for Chrome version with Python + Selenium + Chrome

Install lp_solve on Mac OS X and call it with python.

Python standard module that can be used on the command line

Install Python 3.8 on Ubuntu 18.04 (OS standard)

Install Python 2.7.9 and Python 3.4.x with pip.

Install Python3 on Sakura server (FreeBSD)

Install Mecab and mecab-python3 on Ubuntu 14.04

Install and run dropbox on Ubuntu 20.04

Install OpenCV and Chainer on Ubuntu

Install Python from source with Ansible

Install CUDA 8.0 and Chainer on Ubuntu 16.04

Scraping with Python, Selenium and Chromedriver

Install the Python plugin with Netbeans 8.0.2

Install fabric on Ubuntu and try

Install Python 3.9 on Ubuntu 20.04 (OS standard?)

Install confluent-kafka for Python on Ubuntu

Install Python 2.7 on Ubuntu 20.04 (OS standard?)

Sakura Use Python on the Internet

Get the width of the div on the server side with Selenium + PhantomJS + Python

Fill the string with zeros in python and count some characters from the string

Don't work Python with OpenCV on AMD Ryzen CPU on WSL2 Ubuntu 18.04 And 20.04

Put Cabocha 0.68 on Windows and try to analyze the dependency with Python

Install Chrome on the command line on Sakura VPS (Ubuntu) and launch Chrome with python from virtual display and selenium

Purpose

environment

Step1) Install Chrome from the command line

Problem 1) Dependency repair

Problem 2) There is no screen

((Proposal 1)) GUI Desktop </ b>

`GUI desktop installation`

((Proposal 2)) Install virtual display </ b>

As a procedure,

① Install xvfb of virtual display ② Install selenium and pyvirtualdisplay to operate Chrome from python ③ Write a Chrome startup program with python

Step3) Launch Google Chrome

Step ①) Install xvfb </ b>

`Install xvfb`

Procedure ②) Installation of selenium etc. </ B>

Step ③) Install xvfb </ b>

How to actually scrape?