You will be an engineer in 100 days ――Day 75 ――Programming ――About scraping 6

Click here until yesterday

This time is also a continuation of scraping.

If you have finished installing Selenium, you can continue.

How to operate Selenium

Load Selenium

Load the library. Assuming that Google Chrome will run ...

from selenium import webdriver

#Driver settings
chromedriver = "Driver's full pass"
driver = webdriver.Chrome(executable_path=chromedriver)

I think that the save destination of the WEB driver is different for each person, so please rewrite it. This is the way to launch Google Chrome.

If you get an error message, you need to match the version of the WEB driver and Chrome. It may also be necessary to set permissions so that the WEB driver can be executed, so check the error details and take appropriate action.

At this point, you can operate the browser, so you can perform various operations.

Once you open the browser, it stays open until you close it. Don't forget to drop it as opening it in large numbers consumes resources.

You can also open it in headerless mode when using selenium. The headerless mode is a mechanism that moves the browser behind the scenes without visibly launching it.

This is a very convenient mode because it saves resources and allows you to use Selenium on Linux servers.

How to write is to create a variable to add the option setting of the browser Add the headerless setting and add it to the argument of the WEB driver call method.

Option variable = webdriver.ChromeOptions () Optional variable .add_argument ('--headless') Driver variable = webdriver.Chrome (options = option variable)

from selenium import webdriver

#Driver settings
chromedriver = "Driver's full pass"

#Option setting
options = webdriver.ChromeOptions()
options.add_argument('--headless')
#Driver call
driver = webdriver.Chrome(executable_path=chromedriver,
options=options)

Access the website with Selenium

We will operate using the variables when selenium is called. Since we called it with the variable name driver earlier, we will call it the driver variable from now on.

To access the website

Driver variable .get (URL)

And execute it.

driver.get(URL)

Let's go to my HP as a trial.

driver.get('http://www.otupy.net')

You can type in the URL to access the site each time you run it. It will take some time for all the websites to be displayed, so it is better to wait for a while before performing any subsequent operations.

Scroll within the site

You can scroll within the site by running Javascript. You can type the script with ʻexecute_script`.

Driver variable .execute_script (Javascript)

As the Javascript part, type the script as characters window.scrollBy (0, Y) and window.scrollTo (0, Y) Use to determine the scroll position.

window.scrollBy (0, window.innerHeight); for one page

If you specify window.scrollTo (0, document.body.scrollHeight);, you can scroll to the bottom.

Let's scroll.

#Scroll a little
driver.execute_script("window.scrollBy(0, window.innerHeight);")

#Scroll to the bottom
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

Now you can scroll your browser around.

Find the element

To work with your site, you need to find the element of where you want to work. You can search for elements on the site such as input orchids.

There are many ways to find an element Driver variable .find_element_by_XXXX You can search by the value of each attribute with the method.

If an element is found, it will be extracted as a data type called WebElement.

** Search by id attribute **

Driver variable .find_element_by_id (value of id attribute)

** Search by name attribute **

Driver variable .find_element_by_name (value of name attribute)

** Search by class name **

Driver variable .find_element_by_class_name (class name)

** tag name **

Driver variable .find_element_by_tag_name (tag name)

** Search by link_text **

Driver variable .find_element_by_link_text (value of link_text)

CSS_Selector

Driver variable .find_element_by_css_selector (value of css_selector)

xpath

Driver variable .find_element_by_xpath (value of xpath)

Manipulate elements

You must find the element first to work with it. If you find an element by the above method, assign it to the element variable and you can perform the following operations.

Element variable .find_element_by_XXXX () Element variable. Operation method

** Click an element **

Element variable .click ()

** Enter characters in the element **

Element variable .send_keys (character)

** Key input with element **

Load the Keys library first.


from selenium.webdriver.common.keys import Keys

Then find the element and use send_keys to enter the keys.

Element variable .send_keys (Keys. Special keys)

The keys that can be handled are as follows.

Key	Keys
Enter key	Keys.ENTER
ALT key(Combined with normal key)	Keys.ALT,"Key"
← key	Keys.LEFT
→ key	Keys.RIGHT
↑ key	Keys.UP
↓ key	Keys.DOWN
Ctrl key(Combined with normal key)	Keys.CONTROL,"Key"
Delete key	Keys.DELETE
HOME key	Keys.HOME
END key	Keys.END
ESCAPE key	Keys.ESCAPE
equal	Keys.EQUALS
COMMAND key	Keys.COMMAND
F1 key	Keys.F1
shift key(Combined with normal key)	Keys.SHIFT,"Key"
Page down key	Keys.PAGE_DOWN
Page up key	Keys.PAGE_UP
Space bar	Keys.SPACE
Return key	Keys.RETURN
tab key	Keys.TAB

Extract the source code of the page

You can get the source code of the page as a string.

Driver variable .page_source

driver.page_source

After acquisition, analysis can be performed using a library such as BeautifulSoup.

Summary

With selenium, with normal scraping techniques It is convenient because you can easily obtain information that cannot be obtained.

If you are having trouble getting data, try selenium. If you can do this, you will be able to get overwhelming data.

25 days until you become an engineer

Author information

Otsu py's HP: http://www.otupy.net/

Youtube： https://www.youtube.com/channel/UCaT7xpeq8n1G_HcJKKSOXMw

Twitter： https://twitter.com/otupython