Cheat sheet when scraping with Google Colaboratory (Colab)

table of contents

-[How to use Beautiful Soup](How to use # beautiful-soup) -[How to use Selenium](How to use #selenium) -[How to use Pandas](How to use #pandas) -[How to handle spreadsheets](#How to handle spreadsheets) -Regular expression look-ahead, after-Yomi is described in another article.

How to use Beautiful Soup

How to eliminate garbled characters

When using requests, you would normally write it as follows,

from bs4 import BeautifulSoup
import requests

res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')

There are sites that are garbled with this, so if you do the following, the garbled characters can be eliminated considerably.

from bs4 import BeautifulSoup
import requests

res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml", from_encoding='utf-8')

find code list

Description Code example
1 search soup.find('li')
Search all tags soup.find_all('li')
Attribute search soup.find('li', href='html://www.google.com/')
Get multiple elements soup.find_all(['a','p'])
id search soup.find('a', id="first")
class search soup.find('a', class_="first")
Attribute acquisition first_link_element['href']
Text search soup.find('dt' ,text='Search word')
Search for partial text matches soup.find('dt' ,text=re.compile('Search word'))
Get parent element .parent
Get 1 of the following elements .next_sibling
Get all the following elements .next_siblings
Get 1 previous element .previous_sibling
Get all previous elements .previous_siblings
Get text elements .string

Select code list

Description Code example
1 search soup.select_one('css selector')
Search all soup.select('css selector')

List of selector specification methods

Description Code example
id search soup.select('a#id')
class search soup.select('a.class')
Multiple search for class soup.select('a.class1.class2')
Attribute search 1 soup.select('a[class="class"]')
Attribute search 2 soup.select('a[href="http://www.google.com"]')
Attribute search 3 soup.select('a[href]')
Get child elements soup.select('.class > a[href]')
Get progeny elements soup.select('.class a[href]')

Change the attribute element according to the element you want to search. ʻId, class, href, name, summary, etc. Insert >if you want to get only child elements (one level down), and putspace` if you want to get offspring elements (all down one level).

How to use Selenium

Preparations for using Selenium

When using with Colab, Selenium download and UI specifications are not possible, so That setting is required.

#Download the libraries needed to use Selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium

from selenium import webdriver

#Settings for using the driver without a UI
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=options)
driver.implicitly_wait(10)

When using Selenium and Beautiful Soup

As a use case, when the element cannot be acquired by just Beautiful Soup If you want to load the page with seleniumu and then extract the necessary information with Beautiful Soup.

driver.get(url)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')

Selenium basic code

Description Code example
Open URL driver.get('URL')
Go back one step driver.back()
Go one step forward driver.forward()
Update browser driver.refresh()
Get the current URL driver.current_url
Get the current title driver.title
Close current window driver.close()
Close all windows driver.quit()
Get elements in class driver.find_element_by_class_name('classname')
Get element by ID driver.find_element_by_id('id')
Get elements with XPATH driver.find_element_by_xpath('xpath')
Text search with XPATH driver.find_element_by_xpath('//*[text()="strings"]')
Text partial match search with XPATH driver.find_element_by_xpath('//*[contains(text(), "strings")]')
Click an element driver.find_element_by_xpath('XPATH').click()
Text input driver.find_element_by_id('ID').send_keys('strings')
Get text driver.find_element_by_id('ID').text
Get attributes(For href) driver.find_element_by_id('ID').get_attribute('href')
Determine if the element is displayed driver.find_element_by_xpath('xpath').is_displayed()
Determine if the element is valid driver.find_element_by_xpath('xpath').is_enabled()
Determine if an element is selected driver.find_element_by_xpath('xpath').is_selected()

When you want to select a dropdown

from selenium.webdriver.support.ui import Select

element = driver.find_element_by_xpath("xpath")
Select(element).select_by_index(indexnum) #Select by index
Select(element).select_by_value("value") #value of value
Select(element).select_by_visible_text("text") #Display text

List of Xpath specification methods

Description Code example
Select all elements //*
Select all elements //a
Select an attribute @href
Select multiple elements [a or h2]
Get element by id //*[@id="id"]
Get elements with class //*[@class="class"]
Text search //*[text()="strings"]
Partial search of text //*[contains(text(), "strings")]
Partial match of class //*contains(@class, "class")
Get the next node /following-sibling::*[1]
Two a elements after /following-sibling::a[2]
Get the back node /preceding-sibling::*[1]

Refer to here for how to get other nodes

When changing tabs

Used when a new tab is created without page transition when clicked

handle_array = driver.window_handles
driver.switch_to.window(handle_array[1])

Wait until a specific element is displayed


from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By

#Wait until all elements on the page are loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_all_elements_located)

#Wait until the element on the page with the specified ID is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.ID, 'ID name')))

#CLASS name Wait until the element on the specified page is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, 'CLASS name')))

#Wait until the element on the page specified by the CLASS name in Xpath is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.XPATH, 'xpath')))

What to do when you can't click

target = driver.find_element_by_xpath('xpath')
driver.execute_script("arguments[0].click();", target)

How to use Pandas

How to create a data frame and add data

import pandas as pd
columns = ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
df = pd.DataFrame(columns=columns)

#Data acquisition process

se = pd.Series([data1, data2, data3, data4, data5], columns)
df = df.append(se, columns)

When downloading Pandas data

from google.colab import files

filename = 'filename.csv'
df.to_csv(filename, encoding = 'utf-8-sig') 
files.download(filename)

When saving Pandas data to My Drive

from google.colab import drive

filename = filename.csv'
path = '/content/drive/My Drive/' + filename

with open(path, 'w', encoding = 'utf-8-sig') as f:
  df.to_csv(f)

How to work with spreadsheets

Preparations for working with spreadsheets

#Download the library needed to work with spreadsheets
!pip install gspread

from google.colab import auth
from oauth2client.client import GoogleCredentials
import gspread

#Authentication process
auth.authenticate_user()
gc = gspread.authorize(GoogleCredentials.get_application_default())

Frequently used code

ss_id = 'Spreadsheet ID'
sht_name = 'Sheet name'

workbook = gc.open_by_key(ss_id)
worksheet = workbook.worksheet(sht_name)

#When acquiring data
worksheet.acell('B1').value
worksheet.cell(2, 1).value

#When updating
worksheet.update_cell(row, column, 'Update contents')

gspread code list

Workbook operation

Description Code example
Spreadsheet selection by ID gc.open_by_key('ID')
Spreadsheet selection by URL gc.open_by_url('URL')
Get Spreadsheet Title workbook.title
Get Spreadsheet ID workbook.id

Seat operation

Description Code example
Get sheet by sheet name workbook.worksheet('Sheet name')
Get a sheet with Index workbook.get_worksheet(index)
Get all sheets in an array workbook.worksheets()
Get sheet name worksheet.title
Get sheet ID worksheet.id

Cell manipulation

Description Code example
Data acquisition by A1 method worksheet.acell('B1').value
Data acquisition by R1C1 method worksheet.cell(1, 2).value
Select multiple cells and get as a one-dimensional array worksheet.range('A1:B10')
Data acquisition of selected row worksheet.row_values(1)
Get formula for selected row worksheet.row_values(1,2)
Data acquisition of selected columns worksheet.column_values(1)
Get formula for selected column worksheet.column_values(1,2)
Get all data worksheet.get_all_values()
Update cell values with A1 method worksheet.update_acell('B1','Value to update')
Update cell value with R1C1 method worksheet.update_cell(1,2,'Value to update')

[Reference site] BeautifulSoup4 cheat sheet (selector, etc.) Python3 memo --Beautiful Soup4 stuff Basics of CSS selectors for web scraping Summary of frequently used operation methods of Selenium webdriver What is XPath? Learn the basic knowledge of XPath! Indispensable for web scraping! Summary of Xpath Summary of how to use the gspread library! Work with spreadsheets in Python

Recommended Posts

Cheat sheet when scraping with Google Colaboratory (Colab)
Study Python with Google Colaboratory
About learning with google colab
Try OpenCV with Google Colaboratory
Code snippets often used when processing videos with Google Colaboratory
Code snippets often used when using BigQuery with Google Colab
Sample code summary when working with Google Spreadsheets from Google Colab
Google Test / Mock personal cheat sheet
Snippets (scraping) registered in Google Colaboratory
OpenCV feature detection with Google Colaboratory
Play with Turtle on Google Colab
[Beginner] Python web scraping using Google Colaboratory
Google colaboratory
How to search Google Drive with Google Colaboratory
Make a cascade classifier with google colaboratory
Manage deals with Trello + Google Colaboratory (Part 1)
Using Java's Jupyter Kernel with Google Colaboratory
Use TPU and Keras with Google Colaboratory
Machine learning with Pytorch on Google Colab
I tried simple image processing with Google Colaboratory.
Scraping with selenium
Scraping with selenium ~ 2 ~
Scraping with Python
Scraping with Python
Scraping with Selenium
Curry cheat sheet
SQLite3 cheat sheet
pyenv cheat sheet
How to load files in Google Drive with Google Colaboratory
Easy way to scrape with python using Google Colab
How to analyze with Google Colaboratory using Kaggle API
Easy learning of 100 language processing knock 2020 with "Google Colaboratory"
Building an environment to use CaboCha with google colaboratory