This article is mainly a translation of Scrapely
.
I checked what it was like while moving the contents written in README.
If you want to see the rough content in seconds, you should read both Scrapely
and Summary
in the section of this article.
A library for extracting structured data from HTML pages. Given a sample web page example and the data to be extracted, build a parser for all similar pages.
ʻData extraction using an algorithm called Instance Based Learning` ^ 1. ^ 2
Installation
Scrapely works in Python 2.7 or 3.3+. It requires numpy and w3lib Python packages.
pip install scrapely
$ python -m scrapely.tool myscraper.json
scrapely> help
ocumented commands (type help <topic>):
========================================
a annotate ls s ta
add_template del_template ls_annotations scrape td
al help ls_templates t tl
scrapely>
The usage of scrapely.tool is as follows
python -m scrapely.tool <scraper_file> [command arg ...]
<scraper_file>
is the file name to save the template information
The commands provided, such as ʻaand
ta, are alias commands such as ʻannotate
ʻadd_template`, respectively.
Command name | Description |
---|---|
add_template | add_template {url} [--encoding ENCODING] - (alias: ta) |
annotate | annotate {template_id} {data} [-n number] [-f field]- add or test annotation (aliases: a, t) |
del_template | del_template {template_id} - delete template (alias: td) |
ls_annotations | ls_annotations {template} - list annotations (alias: al) |
ls_templates | list templates (aliases: ls, tl) |
scrape | scrape {url} - scrape url (alias: s) |
scrapely> add_template http://pypi.python.org/pypi/w3lib/1.1
[0] http://pypi.python.org/pypi/w3lib/1.1
scrapely> ls_templates
[0] http://pypi.python.org/pypi/w3lib/1.1
scrapely> annotate 0 "w3lib 1.1"
[0] '<h1>w3lib 1.1</h1>'
[1] '<title>Python Package Index : w3lib 1.1</title>'
I got two elements with the above command
scrapely> annotate 0 "w3lib 1.1" -n 0
[0] '<h1>w3lib 1.1</h1>'
scrapely> annotate 0 "w3lib 1.1" -n 0 -f name
[new](name) '<h1>w3lib 1.1</h1>'
scrapely> annotate 0 "Scrapy project" -n 0 -f author
[new] '<span>Scrapy project</span>'
scrapely> ls_annotations 0
[0-0](name) '<h1>w3lib 1.1</h1>'
[0-1](author) '<span>Scrapy project</span>'
scrapely> scrape http://pypi.python.org/pypi/Django/1.3
[{'author': ['Django Software Foundation'], 'name': ['Django 1.3']}]
Scrapy
is an application framework for building web crawlers,
Scrapely
is a library for extracting structured data from HTML pages.
Scrapely
is more like BeautifulSoup or lxml than Scrapy
. ^ 3
In normal site scraping, you write a little selector specification
,
In Scrapely, it was possible to scrape similar pages by specifying the sample URL
and specifying the sample data
.
There was a service (open source) that made it possible to scrape sites even by people without knowledge of programs using this characteristic. [^ 4]
It was a summary (impression) that I see.
It was today's Friday I / O. At Wamuu Co., Ltd., every Friday is a day to work on something of interest and output the results in some way. Thank you very much.
Recommended Posts