A note when scraping a website with Python and needing to parse the CSS written inline in the style
attribute of the tag.
Beautiful Soup doesn't seem to be able to handle CSS, so look for a library that meets your requirements.
I searched on PyPI and decided to use the top cssutils. Documentation is written properly, and it seems that development is continuing, so it looks good.
This time I tried it in the environment of Python 3.3.3. Installation is one shot with pip
.
$ python -V
Python 3.3.3
$ pip install cssutils
This time we will parse inline CSS, so we will use cssutils.parseStyle
. There are various interfaces for parsing, and although I haven't tried it this time, it seems that you can also parse by specifying the file name and URL. You can also specify the character code with an optional argument.
>>> from cssutils import parseStyle
>>> style = parseStyle('width: 300px; margin: 0 20px 0 10px;')
>>> type(style)
<class 'cssutils.css.cssstyledeclaration.CSSStyleDeclaration'>
Parsing inline CSS gives you an object of class cssutils.css.CSSStyleDeclaration
. What we want to do this time is to get the values specified by the width
and margin
properties from here.
It's easy to get the value of a property as a string.
>>> style.width
'300px'
>>> style.margin
'0 20px 0 10px'
Use the objects of the cssutils.css.Property
and cssutils.css.PropertyValue
classes when you want to analyze in a little more detail, such as when the value is composed of multiple elements or when you want to consider the unit.
>>> p = style.getProperty('margin')
>>> type(p)
<class 'cssutils.css.property.Property'>
>>> v = p.propertyValue
>>> type(v)
<class 'cssutils.css.value.PropertyValue'>
The cssutils.css.PropertyValue
class can handle values that consist of multiple elements individually.
>>> v.length
4
>>> v[0]
cssutils.css.DimensionValue('0')
>>> v[1]
cssutils.css.DimensionValue('20px')
Each element of the value can be obtained by a list-like operation. This time, an object of the cssutils.css.DimensionValue
class is returned. This class can handle units such as px
and ʻem`.
>>> v[1].value
20
>>> v[1].dimension
'px'
>>> v[1].cssText
'20px'
There are other classes such as cssutils.css.ColorValue
and cssutils.css.URIValue
, and it seems that the appropriate object is generated depending on the value format.
--You can parse CSS in Python by using cssutils. --You can easily handle values that consist of multiple elements or values with units.
Recommended Posts