As I wrote in Qiita before, I wrote Code for scraping websites in Java. Looking back now, it's hard to say that the code content is clean, although it meets the requirements. I was embarrassed to see it, so I decided to rewrite it in Python, so make a note.
There are many similar articles in Qiita, but it is a memorandum.
I used to use a library called jsoup when scraping with Java. This time we will use ** Beautiful Soup **.
BeautifulSoup is a library for scraping Python. Since you can extract the elements in the page using the CSS selector, it is convenient to extract only the desired data in the page. Official: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Since it is a Python library, it is installed with pip.
pip install beautifulsoup4
Like the article I wrote before, I want to extract the date, title, and URL of "Notice" from the following page.
<body>
<div class="section">
<div class="block">
<dl>
<dt>2019.08.04</dt>
<dd>
<a href="http://www.example.com/notice/0003.html">Notice 3</a>
</dd>
<dt>2019.08.03</dt>
<dd>
<a href="http://www.example.com/notice/0002.html">Notice 2</a>
</dd>
<dt>2019.08.02</dt>
<dd>
<a href="http://www.example.com/notice/0001.html">Notice 1</a>
</dd>
</dl>
</div>
</div>
</body>
Extract the notification with the following code and print it.
scraping.py
# -*- coding: utf-8 -*-
import requests
import sys
from bs4 import BeautifulSoup
from datetime import datetime as d
def main():
print("Scraping Program Start")
#Send a GET request to the specified URL to get the contents of the page
res=requests.get('http://www.example.com/news.html')
#Parse the retrieved HTML page into a BeautifulSoup object
soup = BeautifulSoup(res.text, "html.parser")
#Extract the entire block class element in the page
block = soup.find(class_="block")
#Extract dt element (date) and dd element in block class
dt = block.find_all("dt")
dd = block.find_all("dd")
if(len(dt) != len(dd)):
print("ERROR! The number of DTs and DDs didn't match up.")
print("Scraping Program Abend")
sys.exit(1)
newsList = []
for i in range(len(dt)):
try:
date = dt[i].text
title = dd[i].find("a")
url = dd[i].find("a").attrs['href']
print("Got a news. Date:" + date +", title:" + title.string + ", url:" + url)
except:
print("ERROR! Couldn't get a news.")
pass
print("Scraping Program End")
if __name__ == "__main__":
main()
The expected result when executing the above code is as follows.
Scraping Program Start
Got a news. Date:2019.08.04, title:Notice 3, url:http://www.example.com/notice/0003.html
Got a news. Date:2019.08.03, title:Notice 2, url:http://www.example.com/notice/0002.html
Got a news. Date:2019.08.04, title:Notice 1, url:http://www.example.com/notice/0001.html
Scraping Program End
Compared to the last time I wrote in Java's Spring Boot, it's good that the amount of coding is overwhelmingly small in Python. Please point out any mistakes in the content.
Recommended Posts