lxml
It is a convenient library when handling html (xml).
lxml - Processing XML and HTML with Python(http://lxml.de/)
lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.
Using lxml, you can easily write the process of rewriting all relative links in html to absolute links.
Use lxml.html.make_links_absolute (). For example, suppose you have the following html (a.html).
a.html
<html>
<head>
<style type="text/css">
.download {background-image:url(./images/download.png);}
</style>
<script src="./js/lib.js" type="text/javascript"/>
<script src="./js/app.js" type="text/javascript"/>
</head>
<body>
<img src="images/icon.png " alt="image"/>
<a class="download "href="./download">download</a>
</body>
</html>
Execute code similar to the following. Please give the base url to base_url.
from lxml import html
with open("./a.html", "r") as rf:
doc = html.parse(rf).getroot()
html.make_links_absolute(doc, base_url="http://example.net/foo/bar")
print html.tostring(doc, pretty_print=True)
It can be rewritten as an absolute link as follows.
<html>
<head>
<style type="text/css">
.download {background-image:url(http://example.net/foo/images/download.png);}
</style>
<script src="http://example.net/foo/js/lib.js" type="text/javascript"></script><script src="http://example.net/foo/js/app.js" type="text/javascript"></script>
</head>
<body>
<img src="http://example.net/foo/images/icon.png " alt="image"><a class="download " href="http://example.net/foo/download">download</a>
</body>
</html>
It is wonderful that the following three are also interpreted as links.
For example, you may want to rewrite all links to absolute links, but leave only the js files as relative links.
Let's take a look at the implementation of make_links_absolute ().
## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py
class HtmlMixin(object):
#...
def make_links_absolute(self, base_url=None, resolve_base_href=True):
"""
Make all links in the document absolute, given the
``base_url`` for the document (the full URL where the document
came from), or if no ``base_url`` is given, then the ``.base_url`` of the document.
If ``resolve_base_href`` is true, then any ``<base href>``
tags in the document are used *and* removed from the document.
If it is false then any such tag is ignored.
"""
if base_url is None:
base_url = self.base_url
if base_url is None:
raise TypeError(
"No base_url given, and the document has no base_url")
if resolve_base_href:
self.resolve_base_href()
def link_repl(href):
return urljoin(base_url, href)
self.rewrite_links(link_repl)
rewrite_links () is used. In other words, you should use this.
Suppose you want to change a link other than the path that refers to js in the previous html to an absolute link.
from lxml import html
from urlparse import urljoin
import functools
def repl(base_url, href):
if href.endswith(".js"):
return href
else:
return urljoin(base_url, href)
with open("./a.html", "r") as rf:
doc = html.parse(rf).getroot()
base_url="http://example.net/foo/bar"
doc.rewrite_links(functools.partial(repl, base_url))
print html.tostring(doc, pretty_print=True)
The link to js (src attribute of script tag) remains a relative link.
<html>
<head>
<style type="text/css">
.download {background-image:url(http://example.net/foo/images/download.png);}
</style>
<script src="./js/lib.js" type="text/javascript"></script><script src="./js/app.js" type="text/javascript"></script>
</head>
<body>
<img src="http://example.net/foo/images/icon.png " alt="image"><a class="download " href="http://example.net/foo/download">download</a>
</body>
</html>
As mentioned above, using lxml is convenient because you can easily rewrite from a relative link to an absolute link. You need to be a little careful when trying to convert html that contains multibyte characters such as Japanese.
For example, suppose you have the following html (b.html).
b.html
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
Japanese string
<p>AIUEO</p>
</body>
</html>
Let's convert this with lxml.
# -*- coding:utf-8 -*-
from lxml import html
with open("./b.html", "r") as rf:
doc = html.parse(rf).getroot()
print html.tostring(doc, pretty_print=True)
The output result is as follows.
<html>
<head></head>
<body>
日本語の文字列
<p>あいうえお</p>
</body>
</html>
When the generated html is viewed with a browser, it will be displayed as the intended character string. However, when I open this html with an editor etc., it is converted into a list of multiple numbers and symbols. In html, there are two types of character representation, "character entity reference" and "numerical character reference". This is because the character string passed in the former has been converted to the latter when it is passed to lxml. (For details, see http://www.asahi-net.or.jp/~sd5a-ucd/rec-html401j/charset.html#h-5.3.1)
This is exactly the same format as when converting a unicode string to a str type value in python as follows.
print(u"AIUEO".encode("ascii", "xmlcharrefreplace"))
## あいうえお
In fact, the help for encode () also mentions that.
$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> help("".encode)
Help on built-in function encode:
encode(...)
S.encode([encoding[,errors]]) -> object
Encodes S using the codec registered for encoding. encoding defaults
to the default encoding. errors may be given to set a different error
handling scheme. Default is 'strict' meaning that encoding errors raise
a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
'xmlcharrefreplace' as well as any other name registered with
codecs.register_error that is able to handle UnicodeEncodeErrors.
The story goes back. If you want to pass html containing multibyte characters to lxml, Please give encoding to tostring.
# -*- coding:utf-8 -*-
from lxml import html
with open("./b.html", "r") as rf:
doc = html.parse(rf).getroot()
print html.tostring(doc, pretty_print=True, encoding="utf-8")
It was output in a human-readable format.
<html>
<head></head>
<body>
Japanese string
<p>AIUEO</p>
</body>
</html>
It will continue for a little longer.
As far as I looked around on the net, there are many places where I ended up passing the encoding to lxml.html.tostring. There are a few more traps.
In the previous example, the encoding specification was added to html. But when passed html without it, lxml returns strange output. (Although it can be said that the existence of html without encoding is evil in itself. The principle claim is powerless before reality)
html (d.html) without encoding specified
<html>
<head>
</head>
<body>
Japanese string
<p>AIUEO</p>
</body>
</html>
The result is as follows.
<html>
<head></head>
<body>
æ¥æ¬èªã®æåå
<p>ããããã</p>
</body>
</html>
Let's investigate the cause. Let's take a look at the implementation of lxml.html.parse, lxml.html.tostring. Since I passed the encoding earlier, I can say that the tostring is already supported, so I will look at the parse.
## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py
def parse(filename_or_url, parser=None, base_url=None, **kw):
"""
Parse a filename, URL, or file-like object into an HTML document
tree. Note: this returns a tree, not an element. Use
``parse(...).getroot()`` to get the document root.
You can override the base URL with the ``base_url`` keyword. This
is most useful when parsing from a file-like object.
"""
if parser is None:
parser = html_parser
return etree.parse(filename_or_url, parser, base_url=base_url, **kw)
It seems that it receives an argument called parser and is passed a parser called html_parser by default.
## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py
from lxml import etree
# ..snip..
class HTMLParser(etree.HTMLParser):
"""An HTML parser that is configured to return lxml.html Element
objects.
"""
def __init__(self, **kwargs):
super(HTMLParser, self).__init__(**kwargs)
self.set_element_class_lookup(HtmlElementClassLookup())
# ..snip..
html_parser = HTMLParser()
xhtml_parser = XHTMLParser()
A class that inherits from lxml.etree.HTMLParser is defined as HTMLParser. The default parser seems to be this instance. By the way, etree.HTMLParser seems to be a class written in C.
## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/etree/__init__.py
def __bootstrap__():
global __bootstrap__, __loader__, __file__
import sys, pkg_resources, imp
__file__ = pkg_resources.resource_filename(__name__,'etree.so')
__loader__ = None; del __bootstrap__, __loader__
imp.load_dynamic(__name__,__file__)
__bootstrap__()
Let's take a look at the documentation (lxml.etree.HTMLParse: http://lxml.de/3.1/api/lxml.etree.HTMLParser-class.html)
Method Details [hide private]
__init__(self, encoding=None, remove_blank_text=False, remove_comments=False, remove_pis=False, strip_cdata=True, no_network=True, target=None, XMLSchema schema=None, recover=True, compact=True)
(Constructor)
x.__init__(...) initializes x; see help(type(x)) for signature
Overrides: object.__init__
HTMLParser also seems to be able to specify encoding. Probably the default is None, and the behavior at this time is like setting the encoding by looking at the meta tag of html. If you don't have that meta tag, it probably sets the default encoding for your system environment.
I will imitate html_parser and create an instance.
# -*- coding:utf-8 -*-
from lxml import html
html_parser = html.HTMLParser(encoding="utf-8")
with open("./d.html", "r") as rf:
doc = html.parse(rf, parser=html_parser).getroot()
print html.tostring(doc, pretty_print=True, encoding="utf-8")
It seems that the output was successful.
<html>
<head></head>
<body>
Japanese string
<p>AIUEO</p>
</body>
</html>
If you don't know the proper encoding, you can use chardet (cchardet).
# -*- coding:utf-8 -*-
import chardet #if not installed. pip install chardet
print chardet.detect(u"AIUEO".encode("utf-8"))
# {'confidence': 0.9690625, 'encoding': 'utf-8'}
## warning
print chardet.detect(u"abcdefg".encode("utf-8")[:3])
# {'confidence': 1.0, 'encoding': 'ascii'}
def detect(fname, size = 4096 << 2):
with open(fname, "rb") as rf:
return chardet.detect(rf.read(size)).get("encoding")
print detect("b.html") # utf-8
print detect("a.html") # ascii
That's it.
Recommended Posts