Reference / Site IT toy box (https://ittoybox.com/archives/385) Dried squid and glasses (http://tm8r.hateblo.jp/entry/2013/11/26/125937) Mr. Terry③ (https://qiita.com/Terry3/items/0c1829130111967773bf) Mr. takahiroSakamoto (https://qiita.com/takahiroSakamoto/items/c2b269c07e15a04f5861)
Inexperienced person I just learned the java grammar. Since it is operated in a memorandum form, imitating it does not mean that it will go well. Rather, I would like to ask for teaching from everyone.
I heard that jsoup should be used for scraping, so I prepared to use it. I'm using an Eclipse iphone.
Download the jar file from the following site (https://jsoup.org/download)
Create the package "Scraping" and create the "lib" file directly under it. Copy the previous jar file into the "lib" file.
Then pass the classpath. This area is explained in "IT Toy Box" with an image. I am very grateful.
First of all, from the import of the introduced jsoup
import java.io.IOException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
It seems that the jsoup statement needs to be enclosed in a try-catch statement, so I also imported IOException at the same time.
Well, it is a target of scraping, but there are still many parts that I do not understand well, so let's scrape the top page of "Yahoo! Japan" first.
public static void main(String[] args) {
//try-Need catch statement
try {
//Document A = Jsoup.connect("url").get();Scraping target on url
Document doc = Jsoup.connect("https://www.yahoo.co.jp/").get();
//Elements B = A.select("tag"); この形でソースに含まれるtagで指定された範囲を書き出す。
Elements elm = doc.select("title");
//Extended for statement
for(Element elms : elm) {
String title = elms.text();
System.out.println(title); //Result Yahoo!JAPAN
}
//Exception handling
}catch(IOException e) {
e.printStackTrace();
}
}
}
It seems that jsoup has a connect method and a select method, and you can specify the url and tag respectively. By specifying the url tag, you can easily scrape unused things such as javascript.
I see, I found out somehow, and experimented on other sites.
Anyway, I will scrape the page called the 2018 lecture list of the Japanese Archaeological Association to make the difference easy to understand in tabular format. Target page (http://archaeology.jp/learning/university/2018kougiichiran/#)
The code is almost the same
public static void main(String[] args) {
//try-catch文が必要 try {
// Document A = Jsoup.connect ("url"). get (); scraping target in url Document doc = Jsoup.connect("http://archaeology.jp/learning/university/2018kougiichiran/#").get();
//Elements B = A.select ("tag"); Write out the range specified by the tag contained in the source in this form. Elements elm = doc.select("tbody");
//拡張for文 for(Element elms : elm) { String title = elms.text(); System.out.println(title); }
//例外処理 }catch(IOException e) { e.printStackTrace(); } }
In this case, the result output to the console is
The shape is lined up side by side.
This is hard to see, so
//Elements B = A.select ("tag"); Write out the range specified by the tag contained in the source in this form. Elements elm = doc.select("tbody tr"); If you add a tag that separates the line from ("tbody tr")
It seems that it can be exported almost according to the homepage.
Once practiced here ① Finished. The ultimate goal is to be able to scrape horse racing information. Basically javascript gets in the way, so how to read it seems to be the point.
Recommended Posts