This is a memo of tools that can be used when scraping with Python.
The easiest way to access the web in Python is to use requests
. You can install it with pip.
For GET and POST, using requests.get and requests.post is generally sufficient.
Installation
$ pip install requests
Please see here for details. http://requests-docs-ja.readthedocs.org/en/latest/
BeautifulSoup4 is a good way to parse HTML.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<div><h1 id="test">TEST</h1></div>', 'html')
>>> soup.select_one('div h1#test').text
'TEST'
The characters in the tag are soup.text
, and the attributes can be accessed withsoup ['id']
(where id is the attribute name).
Frequently used methods of BeautifulSoup object
--BeautifulSoup.find ()-> Search for tags and return the first hit tag --BeautifulSoup.find_all ()-> Search for tags and return a list of hit tags --BeautifulSoup.find_previous ()-> Returns the previous tag --BeautifulSoup.find_next ()-> Returns the next tag --BeautifulSoup.find_parent ()-> Returns parent tag --BeautifulSoup.select ()-> css selector returns a list of tags --BeautifulSoup.select_one ()-> Search with css selector and return the first hit tag
Please see here for details. http://kondou.com/BS4/
CSV is a comma-separated format file. You can use the csv module. Learn more about the csv module here. http://docs.python.jp/3.4/library/csv.html
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(someiterable)
import csv
with open('some.csv', 'rb') as f:
reader = csv.reader(f)
for row in reader:
print row
The JSON format is also a commonly used format. Use the standard module json
module.
>>> import json
>>> json.dumps([1, 2, 3, 4])
'[1, 2, 3, 4]'
>>> json.loads('[1, 2, 3, 4]')
[1, 2, 3, 4]
>>> json.dumps({'aho': 1, 'ajo': 2})
'{"aho": 1, "aro": 2}'
>>> json.loads('{"aho": 1, "ajo": 2}')
{u'aho': 1, u'aro': 2}
--json.dumps ()-> Make the object a JSON string --json.loads ()-> Make JSON string an object --json.dump ()-> Turn the object into a JSON string and write it to a file --json.load ()-> Read the JSON string in the file and make it an object
Please see here for details. http://docs.python.jp/3.4/library/json.html
We have prepared some scraping samples. please refer. However, please do not throw requests bang bang as there are general sites. Even if you make a mistake, you can't just turn the loop.
--Extract tutorial information from PyConJP https://github.com/TakesxiSximada/happy-scraping/tree/master/pycon.jp --Extract new package information from PyPI https://github.com/TakesxiSximada/happy-scraping/tree/master/pypi.python.org --Break through Django's Admin site authentication https://github.com/TakesxiSximada/happy-scraping/tree/master/djangoadmin --User-Agent spoofing https://github.com/TakesxiSximada/happy-scraping/tree/master/fake-useragent --Extract the data dynamically generated by Javascript https://github.com/TakesxiSximada/happy-scraping/tree/master/dynamic-page
--https://teratail.com/ It might be a good idea to mow the entry on the top page. --http://isitchristmas.com/ Christmas Judgment (Timely) --https://data.nasa.gov/developer NASA data is available, so it may be interesting to look it up.
There are many other sites that look good ...
Recommended Posts