-[How to use Beautiful Soup](How to use # beautiful-soup) -[How to use Selenium](How to use #selenium) -[How to use Pandas](How to use #pandas) -[How to handle spreadsheets](#How to handle spreadsheets) -Regular expression look-ahead, after-Yomi is described in another article.
When using requests
, you would normally write it as follows,
from bs4 import BeautifulSoup
import requests
res = requests.get(url)
soup = BeautifulSoup(res.content, 'html.parser')
There are sites that are garbled with this, so if you do the following, the garbled characters can be eliminated considerably.
from bs4 import BeautifulSoup
import requests
res = requests.get(url)
soup = BeautifulSoup(res.content, "lxml", from_encoding='utf-8')
Description | Code example |
---|---|
1 search | soup.find('li') |
Search all tags | soup.find_all('li') |
Attribute search | soup.find('li', href='html://www.google.com/') |
Get multiple elements | soup.find_all(['a','p']) |
id search | soup.find('a', id="first") |
class search | soup.find('a', class_="first") |
Attribute acquisition | first_link_element['href'] |
Text search | soup.find('dt' ,text='Search word') |
Search for partial text matches | soup.find('dt' ,text=re.compile('Search word')) |
Get parent element | .parent |
Get 1 of the following elements | .next_sibling |
Get all the following elements | .next_siblings |
Get 1 previous element | .previous_sibling |
Get all previous elements | .previous_siblings |
Get text elements | .string |
Description | Code example |
---|---|
1 search | soup.select_one('css selector') |
Search all | soup.select('css selector') |
Description | Code example |
---|---|
id search | soup.select('a#id') |
class search | soup.select('a.class') |
Multiple search for class | soup.select('a.class1.class2') |
Attribute search 1 | soup.select('a[class="class"]') |
Attribute search 2 | soup.select('a[href="http://www.google.com"]') |
Attribute search 3 | soup.select('a[href]') |
Get child elements | soup.select('.class > a[href]') |
Get progeny elements | soup.select('.class a[href]') |
Change the attribute element according to the element you want to search.
ʻId,
class,
href,
name,
summary, etc. Insert
>if you want to get only child elements (one level down), and put
space` if you want to get offspring elements (all down one level).
When using with Colab, Selenium download and UI specifications are not possible, so That setting is required.
#Download the libraries needed to use Selenium
!apt-get update
!apt install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver /usr/bin
!pip install selenium
from selenium import webdriver
#Settings for using the driver without a UI
options = webdriver.ChromeOptions()
options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',options=options)
driver.implicitly_wait(10)
As a use case, when the element cannot be acquired by just Beautiful Soup If you want to load the page with seleniumu and then extract the necessary information with Beautiful Soup.
driver.get(url)
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, 'html.parser')
Description | Code example |
---|---|
Open URL | driver.get('URL') |
Go back one step | driver.back() |
Go one step forward | driver.forward() |
Update browser | driver.refresh() |
Get the current URL | driver.current_url |
Get the current title | driver.title |
Close current window | driver.close() |
Close all windows | driver.quit() |
Get elements in class | driver.find_element_by_class_name('classname') |
Get element by ID | driver.find_element_by_id('id') |
Get elements with XPATH | driver.find_element_by_xpath('xpath') |
Text search with XPATH | driver.find_element_by_xpath('//*[text()="strings"]') |
Text partial match search with XPATH | driver.find_element_by_xpath('//*[contains(text(), "strings")]') |
Click an element | driver.find_element_by_xpath('XPATH').click() |
Text input | driver.find_element_by_id('ID').send_keys('strings') |
Get text | driver.find_element_by_id('ID').text |
Get attributes(For href) | driver.find_element_by_id('ID').get_attribute('href') |
Determine if the element is displayed | driver.find_element_by_xpath('xpath').is_displayed() |
Determine if the element is valid | driver.find_element_by_xpath('xpath').is_enabled() |
Determine if an element is selected | driver.find_element_by_xpath('xpath').is_selected() |
from selenium.webdriver.support.ui import Select
element = driver.find_element_by_xpath("xpath")
Select(element).select_by_index(indexnum) #Select by index
Select(element).select_by_value("value") #value of value
Select(element).select_by_visible_text("text") #Display text
Description | Code example |
---|---|
Select all elements | //* |
Select all elements | //a |
Select an attribute | @href |
Select multiple elements | [a or h2] |
Get element by id | //*[@id="id"] |
Get elements with class | //*[@class="class"] |
Text search | //*[text()="strings"] |
Partial search of text | //*[contains(text(), "strings")] |
Partial match of class | //*contains(@class, "class") |
Get the next node | /following-sibling::*[1] |
Two a elements after | /following-sibling::a[2] |
Get the back node | /preceding-sibling::*[1] |
Refer to here for how to get other nodes
Used when a new tab is created without page transition when clicked
handle_array = driver.window_handles
driver.switch_to.window(handle_array[1])
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
#Wait until all elements on the page are loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_all_elements_located)
#Wait until the element on the page with the specified ID is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.ID, 'ID name')))
#CLASS name Wait until the element on the specified page is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.CLASS_NAME, 'CLASS name')))
#Wait until the element on the page specified by the CLASS name in Xpath is loaded (timeout judgment in 15 seconds)
WebDriverWait(driver, 15).until(EC.presence_of_element_located((By.XPATH, 'xpath')))
target = driver.find_element_by_xpath('xpath')
driver.execute_script("arguments[0].click();", target)
import pandas as pd
columns = ['Item 1', 'Item 2', 'Item 3', 'Item 4', 'Item 5']
df = pd.DataFrame(columns=columns)
#Data acquisition process
se = pd.Series([data1, data2, data3, data4, data5], columns)
df = df.append(se, columns)
from google.colab import files
filename = 'filename.csv'
df.to_csv(filename, encoding = 'utf-8-sig')
files.download(filename)
from google.colab import drive
filename = filename.csv'
path = '/content/drive/My Drive/' + filename
with open(path, 'w', encoding = 'utf-8-sig') as f:
df.to_csv(f)
#Download the library needed to work with spreadsheets
!pip install gspread
from google.colab import auth
from oauth2client.client import GoogleCredentials
import gspread
#Authentication process
auth.authenticate_user()
gc = gspread.authorize(GoogleCredentials.get_application_default())
ss_id = 'Spreadsheet ID'
sht_name = 'Sheet name'
workbook = gc.open_by_key(ss_id)
worksheet = workbook.worksheet(sht_name)
#When acquiring data
worksheet.acell('B1').value
worksheet.cell(2, 1).value
#When updating
worksheet.update_cell(row, column, 'Update contents')
Description | Code example |
---|---|
Spreadsheet selection by ID | gc.open_by_key('ID') |
Spreadsheet selection by URL | gc.open_by_url('URL') |
Get Spreadsheet Title | workbook.title |
Get Spreadsheet ID | workbook.id |
Description | Code example |
---|---|
Get sheet by sheet name | workbook.worksheet('Sheet name') |
Get a sheet with Index | workbook.get_worksheet(index) |
Get all sheets in an array | workbook.worksheets() |
Get sheet name | worksheet.title |
Get sheet ID | worksheet.id |
Description | Code example |
---|---|
Data acquisition by A1 method | worksheet.acell('B1').value |
Data acquisition by R1C1 method | worksheet.cell(1, 2).value |
Select multiple cells and get as a one-dimensional array | worksheet.range('A1:B10') |
Data acquisition of selected row | worksheet.row_values(1) |
Get formula for selected row | worksheet.row_values(1,2) |
Data acquisition of selected columns | worksheet.column_values(1) |
Get formula for selected column | worksheet.column_values(1,2) |
Get all data | worksheet.get_all_values() |
Update cell values with A1 method | worksheet.update_acell('B1','Value to update') |
Update cell value with R1C1 method | worksheet.update_cell(1,2,'Value to update') |
[Reference site] BeautifulSoup4 cheat sheet (selector, etc.) Python3 memo --Beautiful Soup4 stuff Basics of CSS selectors for web scraping Summary of frequently used operation methods of Selenium webdriver What is XPath? Learn the basic knowledge of XPath! Indispensable for web scraping! Summary of Xpath Summary of how to use the gspread library! Work with spreadsheets in Python
Recommended Posts