Scraping with Python and Beautiful Soup

There are already a lot of scraping material in Python in the world and Qiita, but I feel that there is a lot of information that pyquery is easy to use. Personally, I would like you to know the goodness of Beautiful Soup, so I would like to use Beautiful Soup here.

By the way, this entry is mostly a summary of the Beautiful Soup 4 documentation. See the documentation for more information.

English http://www.crummy.com/software/BeautifulSoup/bs4/doc/

Japanese http://kondou.com/BS4/

Common misunderstanding

There is an opinion that pyquery is easier to use than Beautiful Soup in that it can handle HTML using css selector like jQuery, but ** it can also be done with Beautiful Soup. ** (I don't know the old version) I'll explain how to do it below.

About version

The current version is Beautiful Soup 4. Please note that there are many commentary articles about older versions. However, the code that worked with Beautiful Soup3 should work even if you replace it with Beautiful Soup4 in many cases.

Installation

$ pip install beautifulsoup4

Easy to use

Creating a BeautifulSoup object

When dealing with plaintext HTML, it looks like below.

from bs4 import BeautifulSoup

html = """
	<html>
	...
	</html>
"""

soup = BeautifulSoup(html)

Also, since URLs cannot be handled directly, it is recommended to combine them with urllib2 etc. when handling websites.

import urllib2
from bs4 import BeautifulSoup

html = urllib2.urlopen("http://example.com")

soup = BeautifulSoup(html)

If you get a warning about the HTML parser here, follow the message to specify the parser. (For details, see [About HTML Parser](#html% E3% 83% 91% E3% 83% BC% E3% 82% B5% E3% 83% BC% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6))

soup = BeautifulSoup(html, "html.parser")

How to get a simple tag

To get all A tags from HTML

soup.find_all("a")

The object <class'bs4.element.ResultSet'> that can be obtained with this can be treated like a list.

To get only the first one instead of all tags

soup.find("a")

soup.a

soup.find ("a ") and soup.a will return None if the tag does not exist in the HTML.

Obtained tag information

To get the attributes of the obtained tag

soup.a.get("href")

To get the characters in the obtained tag

soup.a.string

Of course you can also get nested tags

soup.p.find_all("a")

Get tags with specific conditions

You can easily get tags by narrowing down the conditions by attributes. To get all a tags with class is link and href is / link, for example <a href="/link" class="link">

soup.find_all("a", class_="link", href="/link")

soup.find_all("a", attrs={"class": "link", "href": "/link"})

Note that class is a reserved word in Python, so it will be class_.

Also, you do not have to specify the tag.

soup.find_all(class_="link", href="/link")

soup.find_all(attrs={"class": "link", "href": "/link"})

Getting tags using regular expressions

To get all tags that start with b, such as B tags and BODY tags

import re
soup.find_all(re.compile("^b"))

To get all tags that have href attribute including "link"

import re
soup.find_all(href=re.compile("link"))

To get all A tags that contain "hello" in the string inside the tag

import re
soup.find_all("a", text=re.compile("hello"))

Get tags using css selector

If you use select instead of find_all you can get the tags using the css selector.

soup.select("#link1")

soup.select('a[href^="http://"]')

Rewrite

Add attributes to tags

a = soup.find("a")
a["target"] = "_blank"

Use `ʻunwrap`` to remove the tag

html = '''
<div>
    <a href="/link">spam</a>
</div>
'''

soup = BeautifulSoup(html)
soup.div.a.unwrap()

soup.div
# <div>spam</div>

On the contrary, if you want to add a new tag, create a tag with soup.new_tag and add it with wrap.

html = '''
<div>
    <a href="/link">spam</a>
</div>
'''

soup = BeautifulSoup(html)
soup.div.a.wrap(soup.new_tag("p"))

In addition, there are many operation methods such as ʻinsert`` and ʻextract``, so you can flexibly add and remove contents and tags.

output

By calling prettify, you can format it neatly and output it as a character string.

soup.prettify()

# <html>
#  <head>
#   <title>
#    Hello
#   </title>
#  </head>
#  <body>
#   <div>
#    <a href="/link">
#     spam
#    </a>
#   </div>
#   <div>
#    ...
#   </div>
#  </body>
# </html>

soup.div.prettify()

# <div>
#  <a href="/link">
#   spam
#  </a>
# </div>

About HTML parser

The HTML parser usually uses the Python standard html.parser, but if lxml or html5lib is installed, that will be used in preference. To specify it explicitly, specify as follows.

soup = BeautifulSoup(html, "lxml")

If your Python version is older, html.parser may not be able to parse it correctly. In my environment, I could parse with Python 2.7.3 and not with Python 2.6.

It is safe to install lxml or html5lib whenever possible to parse it properly. However, lxml etc. depend on the external C library, so you may have to install them depending on your environment.

Digression

Uses of Beautiful Soup

In my case, I have my own site that stores multiple blog articles together in the DB, but I usually get it from RSS, but since the number of RSS is small, in that case HTML is Beautiful Soup I read it with and save the contents.

Also, when displaying the body of the saved blog, unnecessary advertisements are removed and the target is specified in the a tag so that the link opens in a new tab.

Reference: http://itkr.net

I think Beautiful Soup is excellent for such applications.