This is a memo of tools that can be used when scraping with Python.

requests --Get data from the web

The easiest way to access the web in Python is to use requests. You can install it with pip. For GET and POST, using requests.get and requests.post is generally sufficient.

Installation

$ pip install requests

Please see here for details. http://requests-docs-ja.readthedocs.org/en/latest/

BeautifulSoup4 --Parsing HTML

BeautifulSoup4 is a good way to parse HTML.

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div><h1 id="test">TEST</h1></div>', 'html')
>>> soup.select_one('div h1#test').text
'TEST'

The characters in the tag are soup.text, and the attributes can be accessed withsoup ['id'](where id is the attribute name).

Frequently used methods of BeautifulSoup object

--BeautifulSoup.find ()-> Search for tags and return the first hit tag --BeautifulSoup.find_all ()-> Search for tags and return a list of hit tags --BeautifulSoup.find_previous ()-> Returns the previous tag --BeautifulSoup.find_next ()-> Returns the next tag --BeautifulSoup.find_parent ()-> Returns parent tag --BeautifulSoup.select ()-> css selector returns a list of tags --BeautifulSoup.select_one ()-> Search with css selector and return the first hit tag

Please see here for details. http://kondou.com/BS4/

Data persistence

CSV format

CSV is a comma-separated format file. You can use the csv module. Learn more about the csv module here. http://docs.python.jp/3.4/library/csv.html

writing

import csv
with open('some.csv', 'wb') as f:
    writer = csv.writer(f)
    writer.writerows(someiterable)

reading

import csv
with open('some.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row

JSON format

The JSON format is also a commonly used format. Use the standard module json module.

>>> import json
>>> json.dumps([1, 2, 3, 4])
'[1, 2, 3, 4]'
>>> json.loads('[1, 2, 3, 4]')
[1, 2, 3, 4]
>>> json.dumps({'aho': 1, 'ajo': 2})
'{"aho": 1, "aro": 2}'
>>> json.loads('{"aho": 1, "ajo": 2}')
{u'aho': 1, u'aro': 2}

--json.dumps ()-> Make the object a JSON string --json.loads ()-> Make JSON string an object --json.dump ()-> Turn the object into a JSON string and write it to a file --json.load ()-> Read the JSON string in the file and make it an object

Please see here for details. http://docs.python.jp/3.4/library/json.html

sample

We have prepared some scraping samples. please refer. However, please do not throw requests bang bang as there are general sites. Even if you make a mistake, you can't just turn the loop.

--Extract tutorial information from PyConJP https://github.com/TakesxiSximada/happy-scraping/tree/master/pycon.jp --Extract new package information from PyPI https://github.com/TakesxiSximada/happy-scraping/tree/master/pypi.python.org --Break through Django's Admin site authentication https://github.com/TakesxiSximada/happy-scraping/tree/master/djangoadmin --User-Agent spoofing https://github.com/TakesxiSximada/happy-scraping/tree/master/fake-useragent --Extract the data dynamically generated by Javascript https://github.com/TakesxiSximada/happy-scraping/tree/master/dynamic-page

A site that looks interesting if you try to collect data

--https://teratail.com/ It might be a good idea to mow the entry on the top page. --http://isitchristmas.com/ Christmas Judgment (Timely) --https://data.nasa.gov/developer NASA data is available, so it may be interesting to look it up.

There are many other sites that look good ...

Python scraping notes