Scraping practice using Java ①

Scraping using Java ①

Reference / Site IT toy box (https://ittoybox.com/archives/385) Dried squid and glasses (http://tm8r.hateblo.jp/entry/2013/11/26/125937) Mr. Terry③ (https://qiita.com/Terry3/items/0c1829130111967773bf) Mr. takahiroSakamoto (https://qiita.com/takahiroSakamoto/items/c2b269c07e15a04f5861)

backbone

Inexperienced person I just learned the java grammar. Since it is operated in a memorandum form, imitating it does not mean that it will go well. Rather, I would like to ask for teaching from everyone.

Scraping preparation

I heard that jsoup should be used for scraping, so I prepared to use it. I'm using an Eclipse iphone.

Download the jar file from the following site (https://jsoup.org/download)

Create the package "Scraping" and create the "lib" file directly under it. Copy the previous jar file into the "lib" file. 1.PNG

Then pass the classpath. This area is explained in "IT Toy Box" with an image. I am very grateful.

Scraping description.

First of all, from the import of the introduced jsoup

import java.io.IOException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

It seems that the jsoup statement needs to be enclosed in a try-catch statement, so I also imported IOException at the same time.

Well, it is a target of scraping, but there are still many parts that I do not understand well, so let's scrape the top page of "Yahoo! Japan" first.

	public static void main(String[] args) {
		
		//try-Need catch statement
		try {
			
			//Document A = Jsoup.connect("url").get();Scraping target on url
			Document doc = Jsoup.connect("https://www.yahoo.co.jp/").get();
			
			//Elements B = A.select("tag"); この形でソースに含まれるtagで指定された範囲を書き出す。
			Elements elm = doc.select("title");
			
			//Extended for statement
			for(Element elms : elm) {
				String title = elms.text();
				System.out.println(title); //Result Yahoo!JAPAN
			}
		
			//Exception handling
		}catch(IOException e) {
			e.printStackTrace();
		}
	}
}

It seems that jsoup has a connect method and a select method, and you can specify the url and tag respectively. By specifying the url tag, you can easily scrape unused things such as javascript.

I see, I found out somehow, and experimented on other sites.

Anyway, I will scrape the page called the 2018 lecture list of the Japanese Archaeological Association to make the difference easy to understand in tabular format. Target page (http://archaeology.jp/learning/university/2018kougiichiran/#)

The code is almost the same

public static void main(String[] args) {

//try-catch文が必要 try {

// Document A = Jsoup.connect ("url"). get (); scraping target in url Document doc = Jsoup.connect("http://archaeology.jp/learning/university/2018kougiichiran/#").get();

//Elements B = A.select ("tag"); Write out the range specified by the tag contained in the source in this form. Elements elm = doc.select("tbody");

//拡張for文 for(Element elms : elm) { String title = elms.text(); System.out.println(title); }

//例外処理 }catch(IOException e) { e.printStackTrace(); } }

In this case, the result output to the console is

● Kokugakuin University Hokkaido Junior College Archeology A / B [Summer Concentration] Concurrent Lecturer Takashi Aoki ● Sapporo Gakuin University Archeology A (first half) Specially Appointed Lecturer Yoshiaki Otsuka Archeology B (second half) Part-time Lecturer Kenichiro Koshida Archeology Academic Research Method (Late) Specially Appointed Lecturer Yoshiaki Otsuka Archeology Training Professor Isao Usuki Specially Appointed Lecturer Yoshiaki Otsuka Introduction to Cultural Properties (Late) Professor Isao Usuki Northern History and Culture Part-time Lecturer Gen Sawai Hokkaido History Research B (Late) Professor Isao Usuki ・ ・ ・ ・

The shape is lined up side by side.

This is hard to see, so

//Elements B = A.select ("tag"); Write out the range specified by the tag contained in the source in this form. Elements elm = doc.select("tbody tr"); If you add a tag that separates the line from ("tbody tr")

● Kokugakuin University Hokkaido Junior College
Department of Japanese Literature
Archeology A / B [Summer Concentration] Concurrent Lecturer Takashi Aoki ● Sapporo Gakuin University Archeology A (first half) Specially Appointed Lecturer Yoshiaki Otsuka Archeology B (Late) Part-time Lecturer Kenichiro Koshida Archaeological Research Method (Late) Specially Appointed Lecturer Yoshiaki Otsuka Archeology Practice Professor Isao Usuki ...

It seems that it can be exported almost according to the homepage.

Once practiced here ① Finished. The ultimate goal is to be able to scrape horse racing information. Basically javascript gets in the way, so how to read it seems to be the point.

Recommended Posts