PyQuery

Python has a handy module called PyQuery that provides a jQuery-like API. Beautiful Soup seems to be popular in the streets, but PyQuery is definitely easier to use. Since the base is lxml, I think that performance and reliability are guaranteed.

If you pass the url to the constructor, it will fetch it for you. You can also pass HTML strings or file objects. After that, if you specify a character string similar to the jQuery selector, you can get all the matching elements.

It is also possible to manipulate each element by passing a lambda expression or function. If you know jQuery, you can imagine what you can do. Please see the Manual for details!

DOM operation example

Give attributes to the selected element with the .each () method. class is a reserved word in Python, so if you set it to class_, it will be an HTML class.

`sample.py`


from pyquery import PyQuery as pq


html = '''
<ul>
  <li> item 1 </li>
  <li> item 2 </li>
  <li> item 3 </li>
</ul>
'''

dom = pq(html)
dom('li').each(lambda index, node: pq(node).attr(class_='red', x='123'))

print dom

When I executed it, class and mysterious attribute x were set.

<ul>
  <li x="123" class="red"> item 1 </li>
  <li x="123" class="red"> item 2 </li>
  <li x="123" class="red"> item 3 </li>
</ul>

For class you can do the same with dom ('li'). AddClass ('red').

Image URL acquisition sample

I made a sample program that accesses a web page and extracts the URL of an image. Select the img tag and access each element with .items ().

`img_scraper.py`


#!/usr/bin/env python
from urlparse import urljoin
from pyquery import PyQuery as pq
from pprint import pprint

url = 'http://www.yahoo.co.jp'

dom = pq(url)
result = set()
for img in dom('img').items():
    img_url = img.attr['src']
    if img_url.startswith('http'):
        result.add(img_url)
    else:
        result.add(urljoin(url, img_url))

pprint(result)

The result is as follows

set(['http://i.yimg.jp/images/sicons/box16.gif',
     'http://k.yimg.jp/images/clear.gif',
     'http://k.yimg.jp/images/common/tv.gif',
     'http://k.yimg.jp/images/icon/photo.gif',
     'http://k.yimg.jp/images/new2.gif',
     'http://k.yimg.jp/images/sicons/ybm161.gif',
     'http://k.yimg.jp/images/top/sp/cgrade/iconMail.gif',
     'http://k.yimg.jp/images/top/sp/cgrade/icon_point.gif',
     'http://k.yimg.jp/images/top/sp/cgrade/info_btn-140325.gif',
     'http://k.yimg.jp/images/top/sp/cgrade/logo7.gif',
     'http://lpt.c.yimg.jp/im_sigg6mIfJALB8FuA5LAzp6.HPA---x120-y120/amd/20150208-00010001-dtohoku-000-view.jpg'])

If you select the a tag instead of the img tag and search the list in combination with gevent, you can create a crawler in no time.

Google Finance Scraper

A script for scraping financial statements from Google Finance. Since it is long, I will post only the link to Gist.

https://gist.github.com/knoguchi/6952087

Scraping with Python + PyQuery

DOM operation example

sample.py

Image URL acquisition sample

img_scraper.py

Google Finance Scraper

`sample.py`

`img_scraper.py`