PyQuery
Python has a handy module called PyQuery that provides a jQuery-like API. Beautiful Soup seems to be popular in the streets, but PyQuery is definitely easier to use. Since the base is lxml, I think that performance and reliability are guaranteed.
If you pass the url to the constructor, it will fetch it for you. You can also pass HTML strings or file objects. After that, if you specify a character string similar to the jQuery selector, you can get all the matching elements.
It is also possible to manipulate each element by passing a lambda expression or function. If you know jQuery, you can imagine what you can do. Please see the Manual for details!
Give attributes to the selected element with the .each () method. class
is a reserved word in Python, so if you set it to class_
, it will be an HTML class.
sample.py
from pyquery import PyQuery as pq
html = '''
<ul>
<li> item 1 </li>
<li> item 2 </li>
<li> item 3 </li>
</ul>
'''
dom = pq(html)
dom('li').each(lambda index, node: pq(node).attr(class_='red', x='123'))
print dom
When I executed it, class and mysterious attribute x were set.
<ul>
<li x="123" class="red"> item 1 </li>
<li x="123" class="red"> item 2 </li>
<li x="123" class="red"> item 3 </li>
</ul>
For class you can do the same with dom ('li'). AddClass ('red').
I made a sample program that accesses a web page and extracts the URL of an image. Select the img tag and access each element with .items ().
img_scraper.py
#!/usr/bin/env python
from urlparse import urljoin
from pyquery import PyQuery as pq
from pprint import pprint
url = 'http://www.yahoo.co.jp'
dom = pq(url)
result = set()
for img in dom('img').items():
img_url = img.attr['src']
if img_url.startswith('http'):
result.add(img_url)
else:
result.add(urljoin(url, img_url))
pprint(result)
The result is as follows
set(['http://i.yimg.jp/images/sicons/box16.gif',
'http://k.yimg.jp/images/clear.gif',
'http://k.yimg.jp/images/common/tv.gif',
'http://k.yimg.jp/images/icon/photo.gif',
'http://k.yimg.jp/images/new2.gif',
'http://k.yimg.jp/images/sicons/ybm161.gif',
'http://k.yimg.jp/images/top/sp/cgrade/iconMail.gif',
'http://k.yimg.jp/images/top/sp/cgrade/icon_point.gif',
'http://k.yimg.jp/images/top/sp/cgrade/info_btn-140325.gif',
'http://k.yimg.jp/images/top/sp/cgrade/logo7.gif',
'http://lpt.c.yimg.jp/im_sigg6mIfJALB8FuA5LAzp6.HPA---x120-y120/amd/20150208-00010001-dtohoku-000-view.jpg'])
If you select the a tag instead of the img tag and search the list in combination with gevent, you can create a crawler in no time.
A script for scraping financial statements from Google Finance. Since it is long, I will post only the link to Gist.
https://gist.github.com/knoguchi/6952087
Recommended Posts