I'm thinking of using Scrapy, but first I tried to get Web information with "Requests" and "lxml". The first step in web scraping using Python.
--Getting information on the Web using "Requests" --Extracting necessary information from HTML obtained using "lxml"
pip install requests
pip install lxml
I placed it on EC2 and tested it via the Internet.
test.html
<html>
<body>
<div id="test1">test1
<ul id="test1_ul">test1 ul</ul>
</div>
</body>
</html>
--If you pass the URL as an argument, process from that HTML --User-Agent changed to Mac just in case
(Error handling when there is no argument etc. is not implemented)
scraping.py
import sys
import requests
import lxml.html
#set dummy user-agent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/602.4.8 (KHTML, like Gecko) Version/10.0.3 Safari/602.4.8'}
#Specify URL as an argument
url = ''
if len(sys.argv) > 1:
url = sys.argv[1]
response = requests.get(url, headers = headers)
html = lxml.html.fromstring(response.content)
for div in html.xpath('//*[@id="test1_ul"]') :
print(div.text)
The execution command is as follows. The URL of the argument is arbitrary.
python scraping.py http://ec2******
It's convenient to be able to easily get XPath and CSS selectors with Chrome developer tools.
Recommended Posts