Note that I had the opportunity to scrape a website using Java's jsoup library. There are already many articles that explain in detail in an easy-to-understand manner, but it is a memo. Official: https://jsoup.org/
Just get the jar from the Official Site Download Page and add it to your library. It can also be installed from the Maven repository. If it is gradle, just add the following definition to build.gradle (version is the latest at the time of writing)
build.gradle
dependencies {
compile('org.jsoup:jsoup:1.12.1')
}
Take the case of extracting the date, title, and URL of "Notice" from the following page as an example.
<body>
<div class="section">
<div class="block">
<dl>
<dt>2019.08.04</dt>
<dd>
<a href="http://www.example.com/notice/0003.html">Notice 3</a>
</dd>
<dt>2019.08.03</dt>
<dd>
<a href="http://www.example.com/notice/0002.html">Notice 2</a>
</dd>
<dt>2019.08.02</dt>
<dd>
<a href="http://www.example.com/notice/0001.html">Notice 1</a>
</dd>
</dl>
</div>
</div>
</body>
Extract with the following code.
Example.java
import java.io.IOException;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.select.Elements;
public class Example {
public static void main(String[] args) throws IOException {
//Get by sending a request with GET to the specified URL (post()POST is also possible if it is a method)
Document document = Jsoup.connect("http://www.example.com").get();
//Extract the element from the obtained document with the CSS selector.
//This elements is in the HTML mentioned above<div class="block">List of size 1 which corresponds to the element of
Elements elements = document.select(".section .block");
//Get the 0th element of elements, and then get its child elements.
// elements.get(0).childNode(0)= In the above HTML<dl>Elements of
// elements.get(0).childNode(0).childNodes() = <dl>Of the child<dt>When<dd>List of elements
List<Node> nodeList = elements.get(0).childNode(0).childNodes();
//Extract the date, title, and URL of the notification from the nodeList with a for loop.
for (int i = 0; i < nodeList.size() / 2; i++) {
String newsDate = nodeList.get(i * 2).toString();
String newsTitle = nodeList.get(i * 2 + 1).childNode(0).toString();
String newsUrl = nodeList.get(i * 2 + 1).childNode(0).attr("href");
System.out.println(newsDate);
System.out.println(newsTitle);
System.out.println(newsUrl);
}
}
}
The documentation for the classes used in this code is below. You should read it before using it.
Document(jsoup Java HTML Parser 1.12.1 API) Elements(jsoup Java HTML Parser 1.12.1 API) Element(jsoup Java HTML Parser 1.12.1 API) Node(jsoup Java HTML Parser 1.12.1 API)
I use it to scrape the information page on the website of the graduate school where my relatives are enrolled.
The "Notices from the University" page on the university website, basically only notices that are not related to you are posted. Important notices are posted about once every few months, so you have to watch them often, which is a hassle. Therefore, I decided to create a batch that performs the following processing and run it with cron once an hour.
I also had some difficulty sending emails in Java. I will write about this later.
jsoup usage memo: https://qiita.com/opengl-8080/items/d4864bbc335d1e99a2d7