Last review

I tried to get the page information after logging in using Goutte last time, but I was defeated by image authentication. https://qiita.com/shioharu_/private/818154ac145c78076487

So this time I will change the method and scrape with Selenium + Python!

Introduction

Using Vagrant and VirtualBox on Windows 10 Introduce Selenium, Python and ChromeDriver to CentOS 7.0 in virtual environment.

Introduced with reference to the wisdom of our predecessors. https://worklog.be/archives/3422

Try using

`sample.py`


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1280,1024')
 
driver = webdriver.Chrome(options=options)
driver.get('https://www.yahoo.co.jp/')
 
driver.save_screenshot('test.png')
driver.quit()

Execute

`python sample.py`

The top page of yahoo was captured safely, so the sample seems to be okay!

Last issue

Last time, there was image authentication, and I couldn't display the screen after logging in. Selenium has a standby process, so if you log in manually during that time, you should be able to go to the image authentication page! I thought, but I found that by specifying the profile path of Chrome, it will maintain the state of the specified profile. https://rabbitfoot.xyz/selenium-chrome-profile/

After all, you just have to specify the profile path when you are manually logged in in advance. Thank you for being concise.

Since I am using CentOS in a virtual environment this time, I thought that if I put a symbolic link in the windows environment on the mount destination, it will be referenced from there.

Example

`mklink /J "C:\Users\[username]\Desktop\work\vagrant\User Data" "C:\Users\[username]\AppData\Local\Google\Chrome\User Data"`

Let's rewrite the sample source and execute it

`sample2.py`


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1280,1024')
options.add_argument('--user-data-dir=Profile path with a symbolic link')
 
driver = webdriver.Chrome(options=options)
driver.get('https://p.eagate.573.jp/game/2dx/27/ranking/weekly.html')
 
driver.save_screenshot('test2.png')
driver.quit()

However

There was a ruthless capture of a non-logged-in image ...

The cause was that the profile path reference was not working properly. There is a difference between the profile of chrome installed in the virtual environment and the profile of chrome on the windows side ... So there is no particular point in binding it in a virtual environment, so I would like to install Python and Selenium on the windows side and execute it.

Preferences on the Windows side

Reference: https://mylife8.net/install-selenium-and-run-on-windows/

Python https://www.python.org/downloads/ No special notes as it only follows the installer

Selenium After installing Python, you can install it by executing the following from the command prompt.

ChromeDriver https://sites.google.com/a/chromium.org/chromedriver/downloads Download the same Chrome Driver as your Chrome version. The location of chromedriver.exe can be anywhere, but I put it in the same place as Python for easy understanding.

`\Users\[username]\AppData\Local\Programs\Python\Python38\chromedriver.exe`

The environment variable PATH was also set above.

Run from windows side

Log in in advance from Chrome at https://p.eagate.573.jp/game/2dx/27/ranking/weekly.html, Let's keep Chrome closed. Rewrite the source below and execute!

`sample3.py`


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
 
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1280,1024')
options.add_argument('--user-data-dir=C:\\Users\\[username]\\AppData\\Local\\Google\\Chrome\\User Data')
 
driver = webdriver.Chrome(options=options)
driver.get('https://p.eagate.573.jp/game/2dx/27/ranking/weekly.html')
 
driver.save_screenshot('test3.png')
driver.quit()

I got it safely! screencapture-p-eagate-573-jp-game-2dx-27-ranking-weekly-html-2020-05-10-13_26_24.png

The part you actually want is the ranking part, so experiment to see if you can reach the ranking part. Try clicking and adjusting the page position to display the desired part.

`sample4.py`


from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time

options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1280,1024')
options.add_argument('--user-data-dir=C:\\Users\\[username]\\AppData\\Local\\Google\\Chrome\\User Data')

driver = webdriver.Chrome(options=options)
driver.get('https://p.eagate.573.jp/game/2dx/27/ranking/weekly.html')
driver.find_element_by_xpath("/html/body/div/div[1]/div/div/div[2]/div/div[2]/form/div[2]/ul[1]/li[3]/input").click()
time.sleep(3)

driver.execute_script("window.scrollTo(0, 800)")
time.sleep(3)

driver.save_screenshot('sample.png')
driver.quit()

It looks okay!

General comment

Finally, I've reached the point where I'm scraping a page that requires image authentication ...
This time it was a capture, but next time I will actually acquire the data and process it.

Scraping with Selenium + Python Part 1

Last review

Introduction

Try using

sample.py

python sample.py