There are already a lot of scraping material in Python in the world and Qiita, but I feel that there is a lot of information that pyquery is easy to use. Personally, I would like you to know the goodness of Beautiful Soup, so I would like to use Beautiful Soup here.
By the way, this entry is mostly a summary of the Beautiful Soup 4 documentation. See the documentation for more information.
English http://www.crummy.com/software/BeautifulSoup/bs4/doc/
Japanese http://kondou.com/BS4/
There is an opinion that pyquery is easier to use than Beautiful Soup in that it can handle HTML using css selector like jQuery, but ** it can also be done with Beautiful Soup. ** (I don't know the old version) I'll explain how to do it below.
The current version is Beautiful Soup 4. Please note that there are many commentary articles about older versions. However, the code that worked with Beautiful Soup3 should work even if you replace it with Beautiful Soup4 in many cases.
$ pip install beautifulsoup4
When dealing with plaintext HTML, it looks like below.
from bs4 import BeautifulSoup
html = """
<html>
...
</html>
"""
soup = BeautifulSoup(html)
Also, since URLs cannot be handled directly, it is recommended to combine them with urllib2 etc. when handling websites.
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen("http://example.com")
soup = BeautifulSoup(html)
If you get a warning about the HTML parser here, follow the message to specify the parser. (For details, see [About HTML Parser](#html% E3% 83% 91% E3% 83% BC% E3% 82% B5% E3% 83% BC% E3% 81% AB% E3% 81% A4% E3% 81% 84% E3% 81% A6))
soup = BeautifulSoup(html, "html.parser")
To get all A tags from HTML
soup.find_all("a")
The object <class'bs4.element.ResultSet'>
that can be obtained with this can be treated like a list.
To get only the first one instead of all tags
soup.find("a")
Or
soup.a
soup.find ("a ")
and soup.a
will return None if the tag does not exist in the HTML.
To get the attributes of the obtained tag
soup.a.get("href")
To get the characters in the obtained tag
soup.a.string
Of course you can also get nested tags
soup.p.find_all("a")
You can easily get tags by narrowing down the conditions by attributes. To get all a tags with class is link and href is / link, for example <a href="/link" class="link">
soup.find_all("a", class_="link", href="/link")
Or
soup.find_all("a", attrs={"class": "link", "href": "/link"})
Note that class is a reserved word in Python, so it will be class_.
Also, you do not have to specify the tag.
soup.find_all(class_="link", href="/link")
soup.find_all(attrs={"class": "link", "href": "/link"})
To get all tags that start with b, such as B tags and BODY tags
import re
soup.find_all(re.compile("^b"))
To get all tags that have href attribute including "link"
import re
soup.find_all(href=re.compile("link"))
To get all A tags that contain "hello" in the string inside the tag
import re
soup.find_all("a", text=re.compile("hello"))
If you use select
instead of find_all
you can get the tags using the css selector.
soup.select("#link1")
soup.select('a[href^="http://"]')
Add attributes to tags
a = soup.find("a")
a["target"] = "_blank"
Use `ʻunwrap`` to remove the tag
html = '''
<div>
<a href="/link">spam</a>
</div>
'''
soup = BeautifulSoup(html)
soup.div.a.unwrap()
soup.div
# <div>spam</div>
On the contrary, if you want to add a new tag, create a tag with soup.new_tag
and add it with wrap
.
html = '''
<div>
<a href="/link">spam</a>
</div>
'''
soup = BeautifulSoup(html)
soup.div.a.wrap(soup.new_tag("p"))
In addition, there are many operation methods such as ʻinsert`` and
ʻextract``, so you can flexibly add and remove contents and tags.
By calling prettify
, you can format it neatly and output it as a character string.
soup.prettify()
# <html>
# <head>
# <title>
# Hello
# </title>
# </head>
# <body>
# <div>
# <a href="/link">
# spam
# </a>
# </div>
# <div>
# ...
# </div>
# </body>
# </html>
soup.div.prettify()
# <div>
# <a href="/link">
# spam
# </a>
# </div>
The HTML parser usually uses the Python standard html.parser, but if lxml or html5lib is installed, that will be used in preference. To specify it explicitly, specify as follows.
soup = BeautifulSoup(html, "lxml")
If your Python version is older, html.parser may not be able to parse it correctly. In my environment, I could parse with Python 2.7.3 and not with Python 2.6.
It is safe to install lxml or html5lib whenever possible to parse it properly. However, lxml etc. depend on the external C library, so you may have to install them depending on your environment.
In my case, I have my own site that stores multiple blog articles together in the DB, but I usually get it from RSS, but since the number of RSS is small, in that case HTML is Beautiful Soup I read it with and save the contents.
Also, when displaying the body of the saved blog, unnecessary advertisements are removed and the target is specified in the a tag so that the link opens in a new tab.
Reference: http://itkr.net
I think Beautiful Soup is excellent for such applications.
Recommended Posts