For scraping web articles that have been scraped I want to scrape using chrome driver headless with python
About the browser driver In short, it seems that it is a necessary tool to handle the browser with CUI instead of GUI.
Relationship between DNS server and local hosts When we access from a domain name in a browser, it asks the DNS server for that information, returns as an IP address, and the PC uses it to access the website and the site is displayed in the browser. However, if you put the domain and IP address in the hosts file of Mac, you can get the IP address without connecting to the DNS server.
Reference article [Selenium and Google Spreadsheets (4) "Until you start using Chrome Driver" (https://bitwave.showcase-tv.com/selenium%E3%81%A8google-spreadsheets4-%E3%80%8Cchrome-driver%E3 % 82% 92% E4% BD% BF% E3% 81% 84% E3% 81% AF% E3% 81% 98% E3% 82% 81% E3% 82% 8B% E3% 81% BE% E3% 81 % A7% E7% B7% A8% E3% 80% 8D /) This article about DNS servers, [Illustration] What is a DNS server? How to set / change and check This article is recommended for the hosts file. How to rewrite / edit hosts file on Mac! What should I do if it is not reflected?
Open the file.
$sudo vi /etc/hosts
Next, check that the contents of the hosts file look like this.
##
# Host Database
#
# localhost is used to configure the loopback interface
# when the system is booting. Do not change this entry.
##
127.0.0.1 localhost
255.255.255.255 broadcasthost
::1 localhost
Also, install the same version of driver as the chrome version included in the application from the selenium site. (In my case it was 78.0.3904.97.) ChromeDriver - WebDriver for Chrome
# -*- coding:utf-8 -*-
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
if __name__ == '__main__':
base = "Scraped site url"
options = Options()
#headless designation
options.add_argument('--headless')
driver = webdriver.Chrome(executable_path='Absolute path to the directory where the chrome driver is', chrome_options=options)
driver.get(url)
#Encode
html = driver.page_source.encode('utf-8')
#Instantiation
soup = BeautifulSoup(html, 'html.parser')
I usually use urllib.request It may be possible to solve it by using this selenium for sites that are anti-scraping. !
Recommended Posts