Let's scrape with Java! !!

Development environment

What is scraping?

It refers to the process of extracting data such as specific images and titles from HTML on a website!

Library required for scraping

To scrape, use a library called ** "jsoup" **!

jsoup is a library for parsing HTML, and various classes for parsing can be used!

Now, let's write the following in pom.xml.

python


<dependencies>

//abridgement

    <dependency>
        <groupId>org.jsoup</groupId>
        <artifactId>jsoup</artifactId>
        <version>1.10.2</version>
    </dependency>
</dependencies>

Steps to scrape

① Get HTML information from the website (2) Search the information of the specified tag element from the HTML information ③ Let's extract text and attribute values from HTML information

① Get HTML information from the website

Use ** "Document Class" ** to work with HTML information. Create a variable of Documennt class and assign the acquired HTML information to the variable. The description below!

python


Document document = Jsoup.connect("url").get();

By specifying the URL string in the argument of the connect method, you can get the HTML of the website at that URL. Assign that information to a variable in the Document class.


(2) Search the information of the specified tag element from the HTML information

To get the obtained tag information, use ** "select method" **.

python


Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3");

You are using the select method on the right side of the second line description. Since h3 is described as a character string in the argument, the information of the h3 element is obtained from the website of the specified URL and assigned to the variable of the Elements class. The Elements class is a class that holds the Element class in the form of a list, and the Element class is a class that represents HTML elements.


③ Let's extract text and attribute values from HTML information

Use the ** "text method" ** to get the HTML text, and the ** "attr method" ** if you want to get the value of the attribute.

python


Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3");

for (Element element : elements) {
    System.out.println(element.text());
}

Extract the text from the information of the "h3" element obtained by the select method and display it on the console!

python


Document document = Jsoup.connect("url").get();
Elements elements = document.select("h3 a");

for (Element element : elements) {
    System.out.println(element.attr("href"));
}

Extract the href attribute from the "h3 a" element information obtained by the select method and display it on the console!

Recommended Posts

Let's scrape with Java! !!
Let's experiment with Java inlining
Let's operate Excel with Java! !!
Let's study Java
Let's try WebSocket with Java and javascript!
Install java with Homebrew
Let's write Java file input / output with NIO
Let's touch on Java
[LeJOS] Let's control the EV3 motor with Java
Change seats with java
Install Java with Ansible
Let's create a timed process with Java Timer! !!
Comfortable download with JAVA
Switch java with direnv
Build Java with Wercker
Endian conversion with JAVA
[Java basics] Let's make a triangle with a for statement
[LeJOS] Let's remotely control the EV3 motor with Java
Java multi-project creation with Gradle
Getting Started with Java Collection
Basic Authentication with Java 11 HttpClient
Run batch with docker-compose with Java batch
[Template] MySQL connection with Java
Rewrite Java try-catch with Optional
Install Java 7 with Homebrew (cask)
[Java] JSON communication with jackson
Java to play with Function
Try DB connection with Java
Amazing Java programming (let's stop)
Enable Java EE with NetBeans 9
[Java] JavaConfig with Static InnerClass
Try gRPC with Java, Maven
[Form_with] Let's unify form with form_with.
Version control Java with SDKMAN
RSA encryption / decryption with java 8
Paging PDF with Java + PDFBox.jar
Sort strings functionally with java
Object-oriented (java) with Strike Gundam
[Java] Content acquisition with HttpCliient
Troubleshooting with Java Flight Recorder
Streamline Java testing with Spock
Connect to DB with Java
Error when playing with java
Using Mapper with Java (Spring)
Java study memo 2 with Progate
Getting Started with Java Basics
Seasonal display with Java switch
Use SpatiaLite with Java / JDBC
Study Java with Progate Note 1
Compare Java 8 Optional with Swift
HTML parsing with JAVA (scraping)
Run Java VM with WebAssembly
Screen transition with swing, java
Java unit tests with Mockito
[Java] Let's replace data objects with a mapper ~ BeanMapper Orika ~
[Java 8] Duplicate deletion (& duplicate check) with Stream
Create an immutable class with JAVA
Let's get started with parallel programming
[Java, Scala] Image resizing with ImageIO
Build a Java project with Gradle
Java to learn with ramen [Part 1]