Scraping with Beautiful Soup

Environment Mac, Python3

Advance preparation

Install Beautiful Soup and lxml

$ pip install beautifulsoup4
$ pip install lxml

I got an error on the way, but the installation was successful. There are no problems so far.

Uninflected word of soup

from bs4 import BeautifulSoup
import urllib.request

#When getting html from the web
url = '××××××××××××'
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
html = response.read()
soup = BeautifulSoup(html, "lxml")
#When opening local html directly
soup = BeautifulSoup(open("index.html"), "lxml")

What to do from now on

Get the element by specifying the tag that contains the information you want.

Frequently used specification method


-Specify class
   soup.find(class_='class_name')
   #If there is no underscore after class, an error will occur.
-Specify id
   soup.find(id="id_name")
   #The id remains the same.
-Specify the tag together
   soup.find('li', class_='class_name')
   soup.find('div', id="id_name")

find () will only get the first hit. If you want to get more than one, use find_all ().

images = soup.find_all('img')
  for img in images:
    ~Individual processing~

soup.select("p > a")
soup.select('a[href="http://example.com/"]')

Execution sample

It will be a sample after loading html into soup.

Sample 1: Get the text between the tags

`sample.html`


<html>
  <title>test title</title>
</html>

>>> soup.title
<title>test title</title>
>>> soup.title.string
'test title'

You can get it by adding .string to the end.

Sample 2: Extract the src of the img tag

`sample.html`


<html>
  <div id="hoge">
    <img class="fuga" src="http://××.com/sample.jpg "/>
  </div>
</html>

First, get the div tag with id = "hoge"

>>> div = soup.find('div' id="hoge")
<div id="hoge">
  <img class="fuga" src="http://××.com/sample.jpg "/>
</div>

Next, get the img tag of class = "fuga" from the div

>>> img = div.find('img', class_='fuga')
<img class="fuga" src="http://××.com/sample.jpg "/>
>>> img['src']
"http://××.com/sample.jpg "

You don't actually need to get a div with this pattern. However, I wanted to make a sample that narrows down, so I added a div.

reference http://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

Recommended Posts

Scraping with Beautiful Soup

Table scraping with Beautiful Soup

Try scraping with Python + Beautiful Soup

Scraping multiple pages with Beautiful Soup

Scraping with Python and Beautiful Soup

Scraping pages with pagination with Beautiful Soup

Scraping with Beautiful Soup in 10 minutes

Website scraping with Python's Beautiful Soup

Beautiful Soup

Crawl practice with Beautiful Soup

Scraping with selenium

Scraping with selenium ~ 2 ~

Scraping with Python

Beautiful Soup memo

Scraping with Selenium

[Python] Scraping a table using Beautiful Soup

Remove unwanted HTML tags with Beautiful Soup

Successful scraping with Selenium

Scraping with Python (preparation)

Try scraping with Python.

Scraping with Python + PhantomJS

My Beautiful Soup (Python)

Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium

Write a basic headless web scraping "bot" in Python with Beautiful Soup 4

Scraping with Selenium [Python]

Scraping with Python + PyQuery

Scraping RSS with Python

[Python] Delete by specifying a tag with Beautiful Soup

I tried scraping with Python

Automatically download images with scraping

Web scraping with python + JupyterLab

Scraping with Selenium + Python Part 1

Scraping with chromedriver in python

Festive scraping with Python, scrapy

Save images with web scraping

Scraping with Selenium in Python

Easy web scraping with Scrapy

Scraping with Tor in Python

scraping the Nikkei 225 with playwright-python

Scraping with Selenium + Python Part 2

I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.

Web scraping beginner with python

I-town page scraping with selenium

Scraping Google News search results in Python (2) Use Beautiful Soup

[Raspberry Pi] Scraping of web pages that cannot be obtained with python requests + Beautiful Soup

A memorandum when using beautiful soup

Scraping with Node, Ruby and Python

Web scraping with Python ① (Scraping prior knowledge)

Scraping with Selenium in Python (Basic)

Web scraping with BeautifulSoup4 (layered page)

Scraping with Python, Selenium and Chromedriver

Scraping Alexa's web rank with pyQuery

Web scraping with Python First step

I tried web scraping with python.

Draw a beautiful circle with numpy

Let's do image scraping with Python

Get Qiita trends with Python scraping

Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath

"Scraping & machine learning with Python" Learning memo

Scraping 1