HTML parsing with JAVA (scraping)

Introduction

Data migration work is often done during site renewals. You can do it manually, but it costs money, so Functions such as HTML acquisition in batch ⇒ analysis ⇒ introduction to new system may be useful.

In Java, a library called jsoup is famous. In Python, a library called beautifulsoup4 is famous.

jsoup: https://jsoup.org/ beautifulsoup4: https://pypi.org/project/beautifulsoup4/

1 jsoup jsoup is a JAVA library for HTML parsing. You can easily parse HTML with the jquery-like selector. It supports the WHATWG HTML5 specifications.

1-1 Create a JAVA project and install a library

Gradle example:

// https://mvnrepository.com/artifact/org.jsoup/jsoup
compile group: 'org.jsoup', name: 'jsoup', version: '1.12.1'

1-2 Yahoo News title acquisition example

1-2-1 HTML structure

image.png

1-2-2 Simple parsing code to extract title and URL

package com.test.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class JsoupHtmlParser {

	public static void main(String[] args) throws IOException {
		Document doc = Jsoup.connect("https://news.yahoo.co.jp").get();
		//Get the a tag for each article. Described in the same way as the jQuery selector
		Elements newsHeadlines = doc.select(".topicsList li.topicsListItem a");
		for (Element headline : newsHeadlines) {
			System.out.println("title: " + headline.ownText() + ",  href: " + headline.absUrl("href"));
		}
	}
}

1-2-3 Analysis result

title:Record storm caused by typhoon, killing two people,  href: https://news.yahoo.co.jp/pickup/6336014
title:Narita Airport crowded with 10,000 people,  href: https://news.yahoo.co.jp/pickup/6336017
title:Security company 3.Arrangements to steal 600 million yen,  href: https://news.yahoo.co.jp/pickup/6336018
title:Planned suspension timetable normalization issues,  href: https://news.yahoo.co.jp/pickup/6336013
title:Buzzing college girl tears 50 times on fire,  href: https://news.yahoo.co.jp/pickup/6335993
title:Basketball World Cup 5th loss 3P Remove all,  href: https://news.yahoo.co.jp/pickup/6336020
title:Withdraw from NPB Professional Sports Association,  href: https://news.yahoo.co.jp/pickup/6336015
title:Ryo Yoshizawa "Unusual pressure",  href: https://news.yahoo.co.jp/pickup/6336022

1-3 Parsing HTML strings

1-3-1 Analysis sample

package com.test.jsoup;

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;

public class JsoupHtmlParser {

	public static void main(String[] args) throws IOException {
		String html = "<h1>HTML fragment parsing</h1><div><p>P1</p>";
		Document doc = Jsoup.parseBodyFragment(html);
		
		//If you output the doc as it is, html,A body tag has been added, so be careful when analyzing fragments.
		System.out.println(doc.html());
		
		System.out.println("==========================");
		
		//Output elements of body
		Element body = doc.body();
		System.out.println(body.html());
	}
}

1-3-2 Analysis result

<html>
 <head></head>
 <body>
  <h1>HTML fragment parsing</h1>
  <div>
   <p>P1</p>
  </div>
 </body>
</html>
==========================
<h1>HTML fragment parsing</h1>
<div>
 <p>P1</p>
</div>

In addition, there is easy-to-understand sample code on the site such as HTML analysis, data extraction, and data correction from the file. https://jsoup.org/cookbook/input/load-document-from-file

that's all

Recommended Posts

HTML parsing with JAVA (scraping)
Html5 development with Java using TeaVM
Prepare a scraping environment with Docker and Java
Install java with Homebrew
Change seats with java
Install Java with Ansible
Comfortable download with JAVA
Scraping practice using Java ②
Switch java with direnv
Scraping practice using Java ①
Download Java with Ansible
Let's scrape with Java! !!
Build Java with Wercker
Endian conversion with JAVA
Easy BDD with (Java) Spectrum?
Use Lambda Layers with Java
Java multi-project creation with Gradle
Getting Started with Java Collection
[Java / Kotlin] Escape (sanitize) HTML5 support with unbescape [Spring Boot]
Java Config with Spring MVC
Basic Authentication with Java 11 HttpClient
Let's experiment with Java inlining
Run batch with docker-compose with Java batch
[Template] MySQL connection with Java
Rewrite Java try-catch with Optional
Install Java 7 with Homebrew (cask)
[Java] JSON communication with jackson
Java to play with Function
Try scraping using java [Notes]
Enable Java EE with NetBeans 9
[Java] JavaConfig with Static InnerClass
Let's operate Excel with Java! !!
Version control Java with SDKMAN
RSA encryption / decryption with java 8
Paging PDF with Java + PDFBox.jar
Save Java HTML as PDF
Implementation of a math parser with recursive descent parsing (Java)
[Java] Content acquisition with HttpCliient
Java version control with jenv
Troubleshooting with Java Flight Recorder
2 Implement simple parsing in Java
Connect to DB with Java
Connect to MySQL 8 with Java
Error when playing with java
Getting Started with Java Basics
Easy web scraping with Jsoup
Seasonal display with Java switch
Use SpatiaLite with Java / JDBC
Compare Java 8 Optional with Swift
Run Java VM with WebAssembly
Screen transition with swing, java
Java unit tests with Mockito
[Java 8] Duplicate deletion (& duplicate check) with Stream
Java lambda expressions learned with Comparator
Build a Java project with Gradle
Install java with Ubuntu 16.04 based Docker
Java to learn with ramen [Part 1]
Morphological analysis in Java with Kuromoji
Use java with MSYS and Cygwin
Distributed tracing with OpenCensus and Java
100% Pure Java BDD with JGiven (Introduction)