Introduction

Data migration work is often done during site renewals. You can do it manually, but it costs money, so Functions such as HTML acquisition in batch ⇒ analysis ⇒ introduction to new system may be useful.

In Java, a library called jsoup is famous. In Python, a library called beautifulsoup4 is famous.

jsoup: https://jsoup.org/ beautifulsoup4: https://pypi.org/project/beautifulsoup4/

1 jsoup jsoup is a JAVA library for HTML parsing. You can easily parse HTML with the jquery-like selector. It supports the WHATWG HTML5 specifications.

1-1 Create a JAVA project and install a library

Gradle example:

// https://mvnrepository.com/artifact/org.jsoup/jsoup
compile group: 'org.jsoup', name: 'jsoup', version: '1.12.1'

1-2 Yahoo News title acquisition example

1-2-1 HTML structure

1-2-2 Simple parsing code to extract title and URL

package com.test.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupHtmlParser {

	public static void main(String[] args) throws IOException {
		Document doc = Jsoup.connect("https://news.yahoo.co.jp").get();
		//Get the a tag for each article. Described in the same way as the jQuery selector
		Elements newsHeadlines = doc.select(".topicsList li.topicsListItem a");
		for (Element headline : newsHeadlines) {
			System.out.println("title: " + headline.ownText() + ",  href: " + headline.absUrl("href"));
		}
	}
}

1-2-3 Analysis result

title:Record storm caused by typhoon, killing two people,  href: https://news.yahoo.co.jp/pickup/6336014
title:Narita Airport crowded with 10,000 people,  href: https://news.yahoo.co.jp/pickup/6336017
title:Security company 3.Arrangements to steal 600 million yen,  href: https://news.yahoo.co.jp/pickup/6336018
title:Planned suspension timetable normalization issues,  href: https://news.yahoo.co.jp/pickup/6336013
title:Buzzing college girl tears 50 times on fire,  href: https://news.yahoo.co.jp/pickup/6335993
title:Basketball World Cup 5th loss 3P Remove all,  href: https://news.yahoo.co.jp/pickup/6336020
title:Withdraw from NPB Professional Sports Association,  href: https://news.yahoo.co.jp/pickup/6336015
title:Ryo Yoshizawa "Unusual pressure",  href: https://news.yahoo.co.jp/pickup/6336022

1-3 Parsing HTML strings

1-3-1 Analysis sample

package com.test.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupHtmlParser {

	public static void main(String[] args) throws IOException {
		String html = "<h1>HTML fragment parsing</h1><div><p>P1</p>";
		Document doc = Jsoup.parseBodyFragment(html);
		
		//If you output the doc as it is, html,A body tag has been added, so be careful when analyzing fragments.
		System.out.println(doc.html());
		
		System.out.println("==========================");
		
		//Output elements of body
		Element body = doc.body();
		System.out.println(body.html());
	}
}

1-3-2 Analysis result

<html>
 <head></head>
 <body>
  <h1>HTML fragment parsing</h1>
  <div>
   <p>P1</p>
  </div>
 </body>
</html>
==========================
<h1>HTML fragment parsing</h1>
<div>
 <p>P1</p>
</div>

In addition, there is easy-to-understand sample code on the site such as HTML analysis, data extraction, and data correction from the file. https://jsoup.org/cookbook/input/load-document-from-file

that's all

HTML parsing with JAVA (scraping)