Data migration work is often done during site renewals. You can do it manually, but it costs money, so Functions such as HTML acquisition in batch ⇒ analysis ⇒ introduction to new system may be useful.
In Java, a library called jsoup is famous. In Python, a library called beautifulsoup4 is famous.
jsoup: https://jsoup.org/ beautifulsoup4: https://pypi.org/project/beautifulsoup4/
1 jsoup jsoup is a JAVA library for HTML parsing. You can easily parse HTML with the jquery-like selector. It supports the WHATWG HTML5 specifications.
Gradle example:
// https://mvnrepository.com/artifact/org.jsoup/jsoup
compile group: 'org.jsoup', name: 'jsoup', version: '1.12.1'
package com.test.jsoup;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class JsoupHtmlParser {
public static void main(String[] args) throws IOException {
Document doc = Jsoup.connect("https://news.yahoo.co.jp").get();
//Get the a tag for each article. Described in the same way as the jQuery selector
Elements newsHeadlines = doc.select(".topicsList li.topicsListItem a");
for (Element headline : newsHeadlines) {
System.out.println("title: " + headline.ownText() + ", href: " + headline.absUrl("href"));
}
}
}
title:Record storm caused by typhoon, killing two people, href: https://news.yahoo.co.jp/pickup/6336014
title:Narita Airport crowded with 10,000 people, href: https://news.yahoo.co.jp/pickup/6336017
title:Security company 3.Arrangements to steal 600 million yen, href: https://news.yahoo.co.jp/pickup/6336018
title:Planned suspension timetable normalization issues, href: https://news.yahoo.co.jp/pickup/6336013
title:Buzzing college girl tears 50 times on fire, href: https://news.yahoo.co.jp/pickup/6335993
title:Basketball World Cup 5th loss 3P Remove all, href: https://news.yahoo.co.jp/pickup/6336020
title:Withdraw from NPB Professional Sports Association, href: https://news.yahoo.co.jp/pickup/6336015
title:Ryo Yoshizawa "Unusual pressure", href: https://news.yahoo.co.jp/pickup/6336022
package com.test.jsoup;
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class JsoupHtmlParser {
public static void main(String[] args) throws IOException {
String html = "<h1>HTML fragment parsing</h1><div><p>P1</p>";
Document doc = Jsoup.parseBodyFragment(html);
//If you output the doc as it is, html,A body tag has been added, so be careful when analyzing fragments.
System.out.println(doc.html());
System.out.println("==========================");
//Output elements of body
Element body = doc.body();
System.out.println(body.html());
}
}
<html>
<head></head>
<body>
<h1>HTML fragment parsing</h1>
<div>
<p>P1</p>
</div>
</body>
</html>
==========================
<h1>HTML fragment parsing</h1>
<div>
<p>P1</p>
</div>
In addition, there is easy-to-understand sample code on the site such as HTML analysis, data extraction, and data correction from the file. https://jsoup.org/cookbook/input/load-document-from-file
that's all
Recommended Posts