Rewrite relative links in html to absolute links in python (lxml)

lxml

It is a convenient library when handling html (xml).

lxml - Processing XML and HTML with Python(http://lxml.de/)

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.

Using lxml, you can easily write the process of rewriting all relative links in html to absolute links.

Make_links_absolute () is good when rewriting all links to absolute links

Use lxml.html.make_links_absolute (). For example, suppose you have the following html (a.html).

a.html

<html>
  <head>
    <style type="text/css">
      .download {background-image:url(./images/download.png);}
    </style>
    <script src="./js/lib.js" type="text/javascript"/>
    <script src="./js/app.js" type="text/javascript"/>
  </head>
  <body>
    <img src="images/icon.png " alt="image"/>
    <a class="download "href="./download">download</a>
  </body>
</html>

Execute code similar to the following. Please give the base url to base_url.

from lxml import html

with open("./a.html", "r") as rf:
    doc = html.parse(rf).getroot()
    html.make_links_absolute(doc, base_url="http://example.net/foo/bar")
    print html.tostring(doc, pretty_print=True)

It can be rewritten as an absolute link as follows.

<html>
<head>
<style type="text/css">
      .download {background-image:url(http://example.net/foo/images/download.png);}
    </style>
<script src="http://example.net/foo/js/lib.js" type="text/javascript"></script><script src="http://example.net/foo/js/app.js" type="text/javascript"></script>
</head>
<body>
    <img src="http://example.net/foo/images/icon.png " alt="image"><a class="download " href="http://example.net/foo/download">download</a>
  </body>
</html>

It is wonderful that the following three are also interpreted as links.

src attribute of script tag
src attribute of img tag
Css background-image: url written in the page

If you want to do something a little more complicated, use rewrite_links ()

For example, you may want to rewrite all links to absolute links, but leave only the js files as relative links.

Let's take a look at the implementation of make_links_absolute ().

## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py

class HtmlMixin(object):
#...
    def make_links_absolute(self, base_url=None, resolve_base_href=True):
        """
        Make all links in the document absolute, given the
        ``base_url`` for the document (the full URL where the document
        came from), or if no ``base_url`` is given, then the ``.base_url`` of the document.

        If ``resolve_base_href`` is true, then any ``<base href>``
        tags in the document are used *and* removed from the document.
        If it is false then any such tag is ignored.
        """
        if base_url is None:
            base_url = self.base_url
            if base_url is None:
                raise TypeError(
                    "No base_url given, and the document has no base_url")
        if resolve_base_href:
            self.resolve_base_href()
        def link_repl(href):
            return urljoin(base_url, href)
        self.rewrite_links(link_repl)

rewrite_links () is used. In other words, you should use this.

Suppose you want to change a link other than the path that refers to js in the previous html to an absolute link.

from lxml import html
from urlparse import urljoin
import functools

def repl(base_url, href):
    if href.endswith(".js"):
       return href
    else:
        return urljoin(base_url, href)

with open("./a.html", "r") as rf:
    doc = html.parse(rf).getroot()   
    base_url="http://example.net/foo/bar"
    doc.rewrite_links(functools.partial(repl, base_url))
    print html.tostring(doc, pretty_print=True)

The link to js (src attribute of script tag) remains a relative link.

<html>
<head>
<style type="text/css">
      .download {background-image:url(http://example.net/foo/images/download.png);}
    </style>
<script src="./js/lib.js" type="text/javascript"></script><script src="./js/app.js" type="text/javascript"></script>
</head>
<body>
    <img src="http://example.net/foo/images/icon.png " alt="image"><a class="download " href="http://example.net/foo/download">download</a>
  </body>
</html>

Precautions when rewriting html containing multibyte characters

As mentioned above, using lxml is convenient because you can easily rewrite from a relative link to an absolute link. You need to be a little careful when trying to convert html that contains multibyte characters such as Japanese.

For example, suppose you have the following html (b.html).

b.html

<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
Japanese string
    <p>AIUEO</p>
  </body>
</html>

Let's convert this with lxml.

# -*- coding:utf-8 -*-
from lxml import html

with open("./b.html", "r") as rf:
    doc = html.parse(rf).getroot()
    print html.tostring(doc, pretty_print=True)

The output result is as follows.

<html>
<head></head>
<body>
    &#26085;&#26412;&#35486;&#12398;&#25991;&#23383;&#21015;
    <p>&#12354;&#12356;&#12358;&#12360;&#12362;</p>
  </body>
</html>

When the generated html is viewed with a browser, it will be displayed as the intended character string. However, when I open this html with an editor etc., it is converted into a list of multiple numbers and symbols. In html, there are two types of character representation, "character entity reference" and "numerical character reference". This is because the character string passed in the former has been converted to the latter when it is passed to lxml. (For details, see http://www.asahi-net.or.jp/~sd5a-ucd/rec-html401j/charset.html#h-5.3.1)

This is exactly the same format as when converting a unicode string to a str type value in python as follows.

print(u"AIUEO".encode("ascii", "xmlcharrefreplace"))
## &#12354;&#12356;&#12358;&#12360;&#12362;

In fact, the help for encode () also mentions that.

$ python
Python 2.7.3 (default, Sep 26 2012, 21:51:14) 
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> help("".encode)
Help on built-in function encode:

encode(...)
    S.encode([encoding[,errors]]) -> object
    
    Encodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeEncodeError. Other possible values are 'ignore', 'replace' and
    'xmlcharrefreplace' as well as any other name registered with
    codecs.register_error that is able to handle UnicodeEncodeErrors.

The story goes back. If you want to pass html containing multibyte characters to lxml, Please give encoding to tostring.

# -*- coding:utf-8 -*-
from lxml import html

with open("./b.html", "r") as rf:
    doc = html.parse(rf).getroot()
    print html.tostring(doc, pretty_print=True, encoding="utf-8")

It was output in a human-readable format.

<html>
<head></head>
<body>
Japanese string
    <p>AIUEO</p>
  </body>
</html>

It will continue for a little longer.

Continued, Precautions when rewriting html containing multibyte characters

As far as I looked around on the net, there are many places where I ended up passing the encoding to lxml.html.tostring. There are a few more traps.

In the previous example, the encoding specification was added to html. But when passed html without it, lxml returns strange output. (Although it can be said that the existence of html without encoding is evil in itself. The principle claim is powerless before reality)

html (d.html) without encoding specified

<html>
  <head>
  </head>
  <body>
Japanese string
    <p>AIUEO</p>
  </body>
</html>

The result is as follows.

<html>
<head></head>
<body>
    æ¥æ¬èªã®æåå
    <p>ããããã</p>
  </body>
</html>

Let's investigate the cause. Let's take a look at the implementation of lxml.html.parse, lxml.html.tostring. Since I passed the encoding earlier, I can say that the tostring is already supported, so I will look at the parse.

## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py

def parse(filename_or_url, parser=None, base_url=None, **kw):
    """
    Parse a filename, URL, or file-like object into an HTML document
    tree.  Note: this returns a tree, not an element.  Use
    ``parse(...).getroot()`` to get the document root.

    You can override the base URL with the ``base_url`` keyword.  This
    is most useful when parsing from a file-like object.
    """
    if parser is None:
        parser = html_parser
    return etree.parse(filename_or_url, parser, base_url=base_url, **kw)

It seems that it receives an argument called parser and is passed a parser called html_parser by default.

## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/html/__init__.py

from lxml import etree

# ..snip..

class HTMLParser(etree.HTMLParser):
    """An HTML parser that is configured to return lxml.html Element
    objects.
    """
    def __init__(self, **kwargs):
        super(HTMLParser, self).__init__(**kwargs)
        self.set_element_class_lookup(HtmlElementClassLookup())

# ..snip..

html_parser = HTMLParser()
xhtml_parser = XHTMLParser()

A class that inherits from lxml.etree.HTMLParser is defined as HTMLParser. The default parser seems to be this instance. By the way, etree.HTMLParser seems to be a class written in C.

## lxml-3.2.1-py2.7-linux-x86_64.egg/lxml/etree/__init__.py

def __bootstrap__():
   global __bootstrap__, __loader__, __file__
   import sys, pkg_resources, imp
   __file__ = pkg_resources.resource_filename(__name__,'etree.so')
   __loader__ = None; del __bootstrap__, __loader__
   imp.load_dynamic(__name__,__file__)
__bootstrap__()

Let's take a look at the documentation (lxml.etree.HTMLParse: http://lxml.de/3.1/api/lxml.etree.HTMLParser-class.html)

Method Details 	[hide private]
__init__(self, encoding=None, remove_blank_text=False, remove_comments=False, remove_pis=False, strip_cdata=True, no_network=True, target=None, XMLSchema schema=None, recover=True, compact=True)
(Constructor)
	 
x.__init__(...) initializes x; see help(type(x)) for signature

Overrides: object.__init__

HTMLParser also seems to be able to specify encoding. Probably the default is None, and the behavior at this time is like setting the encoding by looking at the meta tag of html. If you don't have that meta tag, it probably sets the default encoding for your system environment.

I will imitate html_parser and create an instance.

# -*- coding:utf-8 -*-
from lxml import html

html_parser = html.HTMLParser(encoding="utf-8")

with open("./d.html", "r") as rf:
    doc = html.parse(rf, parser=html_parser).getroot()
    print html.tostring(doc, pretty_print=True, encoding="utf-8")

It seems that the output was successful.

<html>
<head></head>
<body>
Japanese string
    <p>AIUEO</p>
  </body>
</html>

Chardet if you don't know the proper encoding

If you don't know the proper encoding, you can use chardet (cchardet).

# -*- coding:utf-8 -*-
import chardet #if not installed. pip install chardet

print chardet.detect(u"AIUEO".encode("utf-8"))
# {'confidence': 0.9690625, 'encoding': 'utf-8'}

## warning
print chardet.detect(u"abcdefg".encode("utf-8")[:3])
# {'confidence': 1.0, 'encoding': 'ascii'}

def detect(fname, size = 4096 << 2):
    with open(fname, "rb") as rf:
        return chardet.detect(rf.read(size)).get("encoding")

print detect("b.html") # utf-8
print detect("a.html") # ascii

That's it.