Scraping practice using Java ②

Click here for the previous article (https://qiita.com/suisen/items/a856c06accdab922153c)

Scraping using Java ②

Reference / Site Java a little reference (https://java-reference.com/java_string_tonumber.html) Mr. TECH PROJIN (https://tech.pjin.jp/blog/2017/10/17/ [java] CSV output sample code /) Mr. Samurai Yamashita (https://www.sejuku.net/blog/20746) Let's Programming (https://www.javadrive.jp/start/stream/index6.html)

Last review

① I wanted to scrape. (2) I downloaded jsoup.jar, set it, and created a sample code. (3) It was confirmed that characters can be extracted from the tag specification on Yahoo! and other sites.

What I want to do this time

(1) Obtain information on horse racing results from netKeiba (http://www.netkeiba.com/) (2) Export to a csv file that can be used with Excel etc.

Main story

So, I actually assembled it. I'm still immature, so I don't care about time and efficiency. I think it is meaningful to acquire it for the time being. Also, be careful not to overdo it because scraping is often heard as an act that puts a load on the other party.

First of all, from the code (I think it should be classified into classes, but since it is a sample, it is put together in one)

import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

public class Test6 {

	/**
	 *Preparation for exporting in csv format ⇒ Specify URL ⇒ Web scraping with jsoup ⇒ JavaBeans list
	 *⇒ Write the list in csv format ⇒ Exception handling with catch statement ⇒ Close flow
	 */
	public static void main(String[] args) {

		//For execution time measurement
		long start = System.currentTimeMillis();

		//Initialize csv export
		BufferedWriter bw=null;


		//try-Need catch statement
		try {
			//Describe the name of the export destination file in the first column
			bw=new BufferedWriter(new FileWriter("D:\\sakura\\ScrapingHtml\\scraping.csv", true));
			bw.write("Horse name,date,Held,weather,Race name,Horse number,Popular,Order of arrival,Jockey,distance,Status,time");
			bw.newLine();

			//Replace the last sentence of the url with a number.
			for(int j = 2015100001; j<=2015100010; j++) {

				//Generate List to be stored in TestBeans prepared separately
				List<TestBeans> list = new ArrayList<TestBeans>();

				//Document A = Jsoup.connect("url").get();Scraping target on url
				Document doc = Jsoup.connect("http://db.netkeiba.com/horse/"+j).get();

				//Elements B = A.select("tag"); この形でソースに含まれるtagで指定された範囲を書き出す。
				Elements elm = doc.select("tbody tr");
				Elements title = doc.select("title");

				//Preparing to display the horse name
				//Roughly take the number of characters, and if there is a space from there, cut it off.
				String tstr = title.text().substring(0, 10);
				int i = tstr.indexOf(" ");

				if(i==-1) {
					i=10;
				}
				String tstrs = tstr.substring(0, i);

				//Initialize the string for use below
				String str=null;
				//Store Elements in TestBeans with extended for statement
				for(Element a : elm) {
					str = a.text();
					//I want to exclude exclusion, cancellation, and cancellation of the race.
					if(str.indexOf("Exclusion")!=-1 || str.indexOf("Tori")!=-1 ||str.indexOf("During ~")!=-1) {
						continue;
					}
					//I only wanted race information, but I couldn't narrow it down by just specifying the above tags, so I decided by the number of characters.
					if(str.length()>=70) {
						String hairetsu[] = str.split(" ");
						TestBeans bean = new TestBeans();
						bean.setDate(hairetsu[0]);
						bean.setPlace(hairetsu[1]);
						bean.setWeather(hairetsu[2]);
						bean.setRaceName(hairetsu[4]);
						bean.setHorseNo(Integer.parseInt(hairetsu[7]));
						bean.setFamous(Integer.parseInt(hairetsu[9]));
						bean.setScore(Integer.parseInt(hairetsu[10]));
						bean.setJockey(hairetsu[11]);
						bean.setCycle(hairetsu[13]);
						bean.setSituation(hairetsu[14]);
						bean.setTime(hairetsu[16]);

						//Store in list
						list.add(bean);
					}
				}

				//Export to a csv file with an extended for statement. Separated with commas for clarity
				for(TestBeans tb : list) {
					bw.write(tstrs);
					bw.write(",");
					bw.write(tb.getDate()+","+tb.getPlace()+","+tb.getWeather()+","+tb.getRaceName()+","+tb.getHorseNo()+","+tb.getFamous()+","+tb.getScore()+","+tb.getJockey()+","+tb.getCycle()+","+tb.getSituation()+","+tb.getTime());
					bw.newLine();
				}
			}
			//close processing.
			bw.close();
			System.out.println("Done");

		//Exception handling
		}catch(IOException e) {
			e.printStackTrace();
		}catch(NumberFormatException e) {
			e.printStackTrace();

		//Just in case, make sure to close it with a finally statement. Not sure if it is needed.
		}finally {
			try {
				if(bw!=null) {
					bw.close();
				}
			}catch(IOException e) {
				e.printStackTrace();
			}
		}

		//For execution time measurement
		long end = System.currentTimeMillis();
		System.out.println((end - start)  + "ms");
		System.out.println((end-start)/1000 + "Seconds");
	}
}

The result of this is properly written out as a csv file. 2.PNG 3.PNG I know it takes too long.

Supplementary explanation

Not many have been added since the last code.

Execution time measurement

First, just to measure the execution speed

//実行時間計測用 long start = System.currentTimeMillis();

//実行時間計測用 long end = System.currentTimeMillis(); System.out.println((end - start) + "ms"); System.out.println ((end-start) / 1000 + "seconds");

Introduced. This is just a measurement, so it has nothing to do with this purpose.

Export to csv format

//csv書き出しの初期化 BufferedWriter bw=null;

//close処理。 bw.close(); System.out.println ("Done");

//例外処理 }catch(IOException e) { e.printStackTrace(); }catch(NumberFormatException e) { e.printStackTrace();

//念のためfinally文で確実にcloseできるよう図る。必要かどうかは不明。 }finally { try { if(bw!=null) { bw.close(); } }catch(IOException e) { e.printStackTrace(); } } Export to the outside with the BufferedWriter class. It is convenient to set the initialization null first. I will write more about it later, but since I have to close it, I close it immediately after the big for statement ends. I've also added "Complete" to the console to indicate that it's done.

In addition, exception handling is required, so it will be collected together with IOException. Regarding NumberFormatException that is included together, it is an exception at that time because String type is converted to Integer type. I close it, but I was worried if I could do it, so I wrote it in the finally sentence. It's subtle whether it's necessary here, and there seems to be a way to write it well.

Iterative processing of url

In the case of the horse page, netkeiba used this time seemed to have numbers at the end of the url in order, so I should repeat it with a for statement.

			//Replace the last sentence of the url with a number.
			for(int j = 2015100001; j<=2015100010; j++) {

				//Generate List to be stored in TestBeans prepared separately
				List<TestBeans> list = new ArrayList<TestBeans>();

				//Document A = Jsoup.connect("url").get();Scraping target on url
				Document doc = Jsoup.connect("http://db.netkeiba.com/horse/"+j).get();
Subject

This is the main process. It is stored in the list of things read using JavaBeans. The JavaBeans used this time are as follows.

public class TestBeans {

	private String date;
	private String place;
	private String weather;
	private int race;
	private String raceName;
	private int member;
	private int groupNo;
	private int horseNo;
	private float oz;
	private int famous;
	private int score;
	private String jockey;
	private int kinryo;
	private String cycle;
	private String situation;
	private String time;
	private int weight;
	public String getDate() {
		return date;
	}
	public void setDate(String date) {
		this.date = date;
	}
	public String getPlace() {
		return place;
	}
	public void setPlace(String place) {
		this.place = place;
	}
	public String getWeather() {
		return weather;
	}
	public void setWeather(String weather) {
		this.weather = weather;
	}
	public int getRace() {
		return race;
	}
	public void setRace(int race) {
		this.race = race;
	}
	public String getRaceName() {
		return raceName;
	}
	public void setRaceName(String raceName) {
		this.raceName = raceName;
	}
	public int getMember() {
		return member;
	}
	public void setMember(int member) {
		this.member = member;
	}
	public int getGroupNo() {
		return groupNo;
	}
	public void setGroupNo(int groupNo) {
		this.groupNo = groupNo;
	}
	public int getHorseNo() {
		return horseNo;
	}
	public void setHorseNo(int horseNo) {
		this.horseNo = horseNo;
	}
	public float getOz() {
		return oz;
	}
	public void setOz(float oz) {
		this.oz = oz;
	}
	public int getFamous() {
		return famous;
	}
	public void setFamous(int famous) {
		this.famous = famous;
	}
	public int getScore() {
		return score;
	}
	public void setScore(int score) {
		this.score = score;
	}
	public String getJockey() {
		return jockey;
	}
	public void setJockey(String jockey) {
		this.jockey = jockey;
	}
	public int getKinryo() {
		return kinryo;
	}
	public void setKinryo(int kinryo) {
		this.kinryo = kinryo;
	}
	public String getCycle() {
		return cycle;
	}
	public void setCycle(String cycle) {
		this.cycle = cycle;
	}
	public String getSituation() {
		return situation;
	}
	public void setSituation(String situation) {
		this.situation = situation;
	}
	public String getTime() {
		return time;
	}
	public void setTime(String time) {
		this.time = time;
	}
	public int getWeight() {
		return weight;
	}
	public void setWeight(int weight) {
		this.weight = weight;
	}



}

Eclipse is excellent, isn't it? If you set the type and name in private, the getter and setter will be set automatically. Thank you.

This is stored in the list by the following processing.

				//Initialize the string for use below
				String str=null;
				//Store Elements in TestBeans with extended for statement
				for(Element a : elm) {   ㋐
					str = a.text();   ㋑
					//I want to exclude exclusion, cancellation, and cancellation of the race.
					if(str.indexOf("Exclusion")!=-1 || str.indexOf("Tori")!=-1 ||str.indexOf("During ~")!=-1) {
						continue;
					}   ㋒
					//I only wanted race information, but I couldn't narrow it down by just specifying the above tags, so I decided by the number of characters.
					if(str.length()>=70) {
						String hairetsu[] = str.split(" ");  ㋓
						TestBeans bean = new TestBeans();
						bean.setDate(hairetsu[0]);
						bean.setPlace(hairetsu[1]);
						bean.setWeather(hairetsu[2]);
						bean.setRaceName(hairetsu[4]);
						bean.setHorseNo(Integer.parseInt(hairetsu[7]));
						bean.setFamous(Integer.parseInt(hairetsu[9]));
						bean.setScore(Integer.parseInt(hairetsu[10]));
						bean.setJockey(hairetsu[11]);
						bean.setCycle(hairetsu[13]);
						bean.setSituation(hairetsu[14]);
						bean.setTime(hairetsu[16]);  ㋔

						//Store in list
						list.add(bean);  ㋕
					}
				}

There is nothing like that. ㋐ Convert the read multiple elm (Elements type) to a (Element type) consisting of short sentences with an extended for statement. ㋑ Set to str (String type) that has been initialized in advance. ㋒ If the information you want to exclude is included, continue to get out of the for statement. ㋓ Since one sentence is separated by spaces, get them as a String type array. ㋔ Store the necessary parts in the array in TestBeans using the TestBeans setter. Change from String type to Integer type as appropriate. ㋕ Add to the list of TestBean types.

Writing smarter was not possible with my ability at this stage.

Is this about the details? Anyway, I was able to output in csv format. I've only tried 10 of them yet, so I'd like to see how long it will take with a slightly larger number.

Please let us know in the comments, etc. if you notice something, something strange, or a better way.

Postscript

Thank you for your advice and comments from saka1029. It's certainly easier than distinguishing by the number of characters. I will devote myself.

Also, I realized that it would be easier to scrape the competition result screen, but I pretended not to notice it.

Recommended Posts

Scraping practice using Java ②
Scraping practice using Java ①
Try scraping using java [Notes]
java practice part 1
Rock-paper-scissors game java practice
Java8 Stream API practice
I tried scraping a stock chart using Java (Jsoup)
Try using RocksDB in Java
Using Mapper with Java (Spring)
I tried using Java REPL
Using Docker from Java Gradle
HTML parsing with JAVA (scraping)
Make a rhombus using Java
Bubble sort using ArrayList (JAVA)
Using Java on OSX 10.15 (Catalina) β
Export issues using JIRA's Java API
Encrypt using RSA cryptography in Java
Upload a file using Java HttpURLConnection
Java comparison using the compareTo () method
Try using Redis with Java (jar)
[Java] Boilerplate code elimination using Lombok
Handling of time zones using Java
Create a Java project using Eclipse
Java
[Practice! ] Minimum settings when using MyBatis
Unexpected exception when using Java DateTimeFormatter
I tried using Java8 Stream API
Using Java with AWS Lambda-Eclipse Preparation
[Java] Boilerplate code elimination using Lombok 2
[Java] Try to implement using generics
HTTPS connection using tls1.2 in Java 6
I tried using JWT in Java
Html5 development with Java using TeaVM
Formatting an enum using formatter-maven-plugin (Java)
[Practice! ] Java database linkage (Connector / J 8.0.20)
Try using IBM Java method tracing
Java
Deleting files using recursive processing [Java]
Summary of object-oriented programming using Java
Sample code using Minio from Java
Using proxy service with Java crawling
I tried using Java memo LocalDate
Try using Hyperledger Iroha's Java SDK
[Java] Where did you try using java?
I tried using GoogleHttpClient of Java
Try using Java framework Nablarch [Web application]
I tried using Elasticsearch API in Java
Using Java with AWS Lambda-Implementation-Check CloudWatch Arguments
Memory measurement for Java apps using jstat
Using Java with AWS Lambda-Implementation-Stop / Launch EC2
About Spring Dependency Injection using Java, Kotlin
Newcomer training using the Web-Basic programming using Java-
Try using the Stream API in Java
Using JupyterLab + Java with WSL on Windows 10
Map without using an array in java
Notes on operators using Java ~ String type ~
[Java] Send an email using Amazon SES
[Java + jsoup] Scraping Mercari's products for sale
Study Java Try using Scanner or Map
Using JavaScript from Java in Rhino 2021 version
Sobel filter using OpenCV on Android (Java)