Introduction

This article is mainly a translation of Scrapely. I checked what it was like while moving the contents written in README. If you want to see the rough content in seconds, you should read both Scrapely and Summary in the section of this article.

What is Scrapely

A library for extracting structured data from HTML pages. Given a sample web page example and the data to be extracted, build a parser for all similar pages.

ʻData extraction using an algorithm called Instance Based Learning` ^ 1. ^ 2

Installation

Scrapely works in Python 2.7 or 3.3+. It requires numpy and w3lib Python packages.

pip install scrapely

Use from the command line

$ python -m scrapely.tool myscraper.json
scrapely> help

ocumented commands (type help <topic>):
========================================
a             annotate      ls              s       ta
add_template  del_template  ls_annotations  scrape  td
al            help          ls_templates    t       tl

scrapely>

The usage of scrapely.tool is as follows

python -m scrapely.tool <scraper_file> [command arg ...]

<scraper_file> is the file name to save the template information

The commands provided, such as ʻaandta, are alias commands such as ʻannotate ʻadd_template`, respectively.

Command name	Description
add_template	add_template {url} [--encoding ENCODING] - (alias: ta)
annotate	annotate {template_id} {data} [-n number] [-f field]- add or test annotation (aliases: a, t)
del_template	del_template {template_id} - delete template (alias: td)
ls_annotations	ls_annotations {template} - list annotations (alias: al)
ls_templates	list templates (aliases: ls, tl)
scrape	scrape {url} - scrape url (alias: s)

Create scraper and add template

scrapely> add_template http://pypi.python.org/pypi/w3lib/1.1
[0] http://pypi.python.org/pypi/w3lib/1.1

View a list of templates available from scraper

scrapely> ls_templates
[0] http://pypi.python.org/pypi/w3lib/1.1

Testing selection criteria to add annotations

scrapely> annotate 0 "w3lib 1.1"
[0] '<h1>w3lib 1.1</h1>'
[1] '<title>Python Package Index : w3lib 1.1</title>'

I got two elements with the above command

Specifying the position to acquire

scrapely> annotate 0 "w3lib 1.1" -n 0
[0] '<h1>w3lib 1.1</h1>'

Add annotation field name to template

scrapely> annotate 0 "w3lib 1.1" -n 0 -f name
[new](name) '<h1>w3lib 1.1</h1>'
scrapely> annotate 0 "Scrapy project" -n 0 -f author
[new] '<span>Scrapy project</span>'

Show annotation list in template

scrapely> ls_annotations 0
[0-0](name) '<h1>w3lib 1.1</h1>'
[0-1](author) '<span>Scrapy project</span>'

Scraping similar pages using the added template

scrapely> scrape http://pypi.python.org/pypi/Django/1.3
[{'author': ['Django Software Foundation'], 'name': ['Django 1.3']}]

Although Scrapely and Scrapy have similar names. ..

Scrapy is an application framework for building web crawlers, Scrapely is a library for extracting structured data from HTML pages. Scrapely is more like BeautifulSoup or lxml than Scrapy. ^ 3

Summary

In normal site scraping, you write a little selector specification, In Scrapely, it was possible to scrape similar pages by specifying the sample URL and specifying the sample data. There was a service (open source) that made it possible to scrape sites even by people without knowledge of programs using this characteristic. [^ 4] It was a summary (impression) that I see.

Finally

It was today's Friday I / O. At Wamuu Co., Ltd., every Friday is a day to work on something of interest and output the results in some way. Thank you very much.

Algorithm-based web scraping library Scrapely