I tried to get the page information after logging in using Goutte last time, but I was defeated by image authentication. https://qiita.com/shioharu_/private/818154ac145c78076487
So this time I will change the method and scrape with Selenium + Python!
Using Vagrant and VirtualBox on Windows 10 Introduce Selenium, Python and ChromeDriver to CentOS 7.0 in virtual environment.
Introduced with reference to the wisdom of our predecessors. https://worklog.be/archives/3422
sample.py
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1280,1024')
driver = webdriver.Chrome(options=options)
driver.get('https://www.yahoo.co.jp/')
driver.save_screenshot('test.png')
driver.quit()
Execute
python sample.py
The top page of yahoo was captured safely, so the sample seems to be okay!
Last time, there was image authentication, and I couldn't display the screen after logging in. Selenium has a standby process, so if you log in manually during that time, you should be able to go to the image authentication page! I thought, but I found that by specifying the profile path of Chrome, it will maintain the state of the specified profile. https://rabbitfoot.xyz/selenium-chrome-profile/
After all, you just have to specify the profile path when you are manually logged in in advance. Thank you for being concise.
Since I am using CentOS in a virtual environment this time, I thought that if I put a symbolic link in the windows environment on the mount destination, it will be referenced from there.
mklink /J "C:\Users\[username]\Desktop\work\vagrant\User Data" "C:\Users\[username]\AppData\Local\Google\Chrome\User Data"
Let's rewrite the sample source and execute it
sample2.py
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1280,1024')
options.add_argument('--user-data-dir=Profile path with a symbolic link')
driver = webdriver.Chrome(options=options)
driver.get('https://p.eagate.573.jp/game/2dx/27/ranking/weekly.html')
driver.save_screenshot('test2.png')
driver.quit()
There was a ruthless capture of a non-logged-in image ...
The cause was that the profile path reference was not working properly. There is a difference between the profile of chrome installed in the virtual environment and the profile of chrome on the windows side ... So there is no particular point in binding it in a virtual environment, so I would like to install Python and Selenium on the windows side and execute it.
Reference: https://mylife8.net/install-selenium-and-run-on-windows/
Python https://www.python.org/downloads/ No special notes as it only follows the installer
Selenium After installing Python, you can install it by executing the following from the command prompt.
ChromeDriver https://sites.google.com/a/chromium.org/chromedriver/downloads Download the same Chrome Driver as your Chrome version. The location of chromedriver.exe can be anywhere, but I put it in the same place as Python for easy understanding.
\Users\[username]\AppData\Local\Programs\Python\Python38\chromedriver.exe
The environment variable PATH was also set above.
Log in in advance from Chrome at https://p.eagate.573.jp/game/2dx/27/ranking/weekly.html, Let's keep Chrome closed. Rewrite the source below and execute!
sample3.py
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1280,1024')
options.add_argument('--user-data-dir=C:\\Users\\[username]\\AppData\\Local\\Google\\Chrome\\User Data')
driver = webdriver.Chrome(options=options)
driver.get('https://p.eagate.573.jp/game/2dx/27/ranking/weekly.html')
driver.save_screenshot('test3.png')
driver.quit()
I got it safely!
The part you actually want is the ranking part, so experiment to see if you can reach the ranking part. Try clicking and adjusting the page position to display the desired part.
sample4.py
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import time
options = Options()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-gpu')
options.add_argument('--window-size=1280,1024')
options.add_argument('--user-data-dir=C:\\Users\\[username]\\AppData\\Local\\Google\\Chrome\\User Data')
driver = webdriver.Chrome(options=options)
driver.get('https://p.eagate.573.jp/game/2dx/27/ranking/weekly.html')
driver.find_element_by_xpath("/html/body/div/div[1]/div/div/div[2]/div/div[2]/form/div[2]/ul[1]/li[3]/input").click()
time.sleep(3)
driver.execute_script("window.scrollTo(0, 800)")
time.sleep(3)
driver.save_screenshot('sample.png')
driver.quit()
It looks okay!
Recommended Posts