I wanted to see a list of books I bought from before, but it didn't work at all even after trial and error.
"Python Crawling & Scraping [Augmented and Revised Edition] -Practical Development Guide for Data Collection and Analysis"
I read, studied, and tried again [^ 1].
[^ 1]: I really read it, but the reference link was actually more useful. This book is useless because it also touches on laws and regulations.
from selenium import webdriver
import time
import dmm_user_pass
#Easy way to prevent leakage of user name and password
# (Dmm in the same directory_user_pass.Never publish py)
USER = dmm_user_pass.USER
PASS = dmm_user_pass.PASS
#Get a Phantom JS driver
browser = webdriver.PhantomJS()
browser.implicitly_wait(5)
#Access the login page(You should be taken to the purchased page immediately after logging in)
url_login = "https://accounts.dmm.com/service/login/password"
browser.get(url_login)
print("I visited the login page")
#Enter characters in the text box
e = browser.find_element_by_id("login_id")
e.clear()
e.send_keys(USER)
e = browser.find_element_by_id("password")
e.clear()
e.send_keys(PASS)
#Submit form
browser.find_element_by_xpath("//form[@name='loginForm']//input[@type='submit']").click()
print("I entered the information and pressed the login button")
time.sleep(10)
#View purchased pages for DMM e-books
#If page1 becomes a page that does not exist in loop, url will be automatically
# https://book.dmm.com/library/?age_limit=all&expired=Become 1(The page you want to transition to and the current url do not match)
url_purchased = 'https://book.dmm.com/library/?age_limit=all&expired=1&sort=old&page='
for i in range(1, 100):
page_i = url_purchased + str(i)
browser.get(page_i)
time.sleep(5)
if page_i != browser.current_url:
break #Processing to exit according to the comment above
#List purchased titles
links = browser.find_elements_by_css_selector(
".m-boxListBookProductBlock__main__info__ttl > a")
for a in links:
title = a.text
print("-", title)
/usr/local/lib/python3.7/site-packages/selenium/webdriver/phantomjs/webdriver.py:49: UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
I visited the login page
I entered the information and pressed the login button
-Mob Psycho 100
-tell me! Please Tell Me!
(Omitted below)
The warning will be described later.
I was able to get a list of books purchased with DMM e-books in chronological order.
I think it's possible to get a link or bring in the amount of money with a little more ingenuity, but I won't do it this time because it takes time and is troublesome. [^ 2]
[^ 2]: DMM e-books with multiple volumes are grouped together, so those with multiple volumes are linked to the page of the series, but those without multiple volumes are directly on the work details page. Since it is linked, it is necessary to make a conditional branch there when acquiring a link or bringing in an amount of money.
Originally I thought about crawling with the requests module, but I stopped because I couldn't log in, and I'm glad that I switched to Selenium and it worked.
Since selenium only causes the actual operation in the code, if you know the user name, password, and the position of the login button (specified by the class and id of the html tag), you should be able to log in regardless of what is being done internally. I feel that Selenium is fine for all crawling of sites that require login.
-[Python3] Scraping on site with login function [requests] [Beautiful Soup]
--I also referred to the following [Python3] scraping via browser (dynamic page etc.) [Selenium].
-Story of making a container to connect to daily pachinko
--Borrow the DMM login part
-I tried using Headless Chrome from Selenium
--UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
was warned, but it seems that PhantomJS has been developed. This time it worked, but I will use Chrome in the future.
--I used it as a reference when changing PhantomJS to Chrome. Especially the Preparation part.
--The actual code is almost unchanged
```Python
# #Get a Phantom JS driver
# browser = webdriver.PhantomJS()
# browser.implicitly_wait(5)
# ↓
#Get a Chrome driver
options = webdriver.ChromeOptions()
options.add_argument('--headless') #Option to hide screen
browser = webdriver.Chrome(options=options)
browser.implicitly_wait(5)
```
Recommended Posts