Click here for the previous article (https://qiita.com/suisen/items/a856c06accdab922153c)
Reference / Site Java a little reference (https://java-reference.com/java_string_tonumber.html) Mr. TECH PROJIN (https://tech.pjin.jp/blog/2017/10/17/ [java] CSV output sample code /) Mr. Samurai Yamashita (https://www.sejuku.net/blog/20746) Let's Programming (https://www.javadrive.jp/start/stream/index6.html)
① I wanted to scrape. (2) I downloaded jsoup.jar, set it, and created a sample code. (3) It was confirmed that characters can be extracted from the tag specification on Yahoo! and other sites.
(1) Obtain information on horse racing results from netKeiba (http://www.netkeiba.com/) (2) Export to a csv file that can be used with Excel etc.
So, I actually assembled it. I'm still immature, so I don't care about time and efficiency. I think it is meaningful to acquire it for the time being. Also, be careful not to overdo it because scraping is often heard as an act that puts a load on the other party.
First of all, from the code (I think it should be classified into classes, but since it is a sample, it is put together in one)
import java.io.BufferedWriter;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
public class Test6 {
/**
*Preparation for exporting in csv format ⇒ Specify URL ⇒ Web scraping with jsoup ⇒ JavaBeans list
*⇒ Write the list in csv format ⇒ Exception handling with catch statement ⇒ Close flow
*/
public static void main(String[] args) {
//For execution time measurement
long start = System.currentTimeMillis();
//Initialize csv export
BufferedWriter bw=null;
//try-Need catch statement
try {
//Describe the name of the export destination file in the first column
bw=new BufferedWriter(new FileWriter("D:\\sakura\\ScrapingHtml\\scraping.csv", true));
bw.write("Horse name,date,Held,weather,Race name,Horse number,Popular,Order of arrival,Jockey,distance,Status,time");
bw.newLine();
//Replace the last sentence of the url with a number.
for(int j = 2015100001; j<=2015100010; j++) {
//Generate List to be stored in TestBeans prepared separately
List<TestBeans> list = new ArrayList<TestBeans>();
//Document A = Jsoup.connect("url").get();Scraping target on url
Document doc = Jsoup.connect("http://db.netkeiba.com/horse/"+j).get();
//Elements B = A.select("tag"); この形でソースに含まれるtagで指定された範囲を書き出す。
Elements elm = doc.select("tbody tr");
Elements title = doc.select("title");
//Preparing to display the horse name
//Roughly take the number of characters, and if there is a space from there, cut it off.
String tstr = title.text().substring(0, 10);
int i = tstr.indexOf(" ");
if(i==-1) {
i=10;
}
String tstrs = tstr.substring(0, i);
//Initialize the string for use below
String str=null;
//Store Elements in TestBeans with extended for statement
for(Element a : elm) {
str = a.text();
//I want to exclude exclusion, cancellation, and cancellation of the race.
if(str.indexOf("Exclusion")!=-1 || str.indexOf("Tori")!=-1 ||str.indexOf("During ~")!=-1) {
continue;
}
//I only wanted race information, but I couldn't narrow it down by just specifying the above tags, so I decided by the number of characters.
if(str.length()>=70) {
String hairetsu[] = str.split(" ");
TestBeans bean = new TestBeans();
bean.setDate(hairetsu[0]);
bean.setPlace(hairetsu[1]);
bean.setWeather(hairetsu[2]);
bean.setRaceName(hairetsu[4]);
bean.setHorseNo(Integer.parseInt(hairetsu[7]));
bean.setFamous(Integer.parseInt(hairetsu[9]));
bean.setScore(Integer.parseInt(hairetsu[10]));
bean.setJockey(hairetsu[11]);
bean.setCycle(hairetsu[13]);
bean.setSituation(hairetsu[14]);
bean.setTime(hairetsu[16]);
//Store in list
list.add(bean);
}
}
//Export to a csv file with an extended for statement. Separated with commas for clarity
for(TestBeans tb : list) {
bw.write(tstrs);
bw.write(",");
bw.write(tb.getDate()+","+tb.getPlace()+","+tb.getWeather()+","+tb.getRaceName()+","+tb.getHorseNo()+","+tb.getFamous()+","+tb.getScore()+","+tb.getJockey()+","+tb.getCycle()+","+tb.getSituation()+","+tb.getTime());
bw.newLine();
}
}
//close processing.
bw.close();
System.out.println("Done");
//Exception handling
}catch(IOException e) {
e.printStackTrace();
}catch(NumberFormatException e) {
e.printStackTrace();
//Just in case, make sure to close it with a finally statement. Not sure if it is needed.
}finally {
try {
if(bw!=null) {
bw.close();
}
}catch(IOException e) {
e.printStackTrace();
}
}
//For execution time measurement
long end = System.currentTimeMillis();
System.out.println((end - start) + "ms");
System.out.println((end-start)/1000 + "Seconds");
}
}
The result of this is properly written out as a csv file. I know it takes too long.
Not many have been added since the last code.
First, just to measure the execution speed
//実行時間計測用 long start = System.currentTimeMillis();
//実行時間計測用 long end = System.currentTimeMillis(); System.out.println((end - start) + "ms"); System.out.println ((end-start) / 1000 + "seconds");
Introduced. This is just a measurement, so it has nothing to do with this purpose.
//csv書き出しの初期化 BufferedWriter bw=null;
//close処理。 bw.close(); System.out.println ("Done");
//例外処理 }catch(IOException e) { e.printStackTrace(); }catch(NumberFormatException e) { e.printStackTrace();
//念のためfinally文で確実にcloseできるよう図る。必要かどうかは不明。 }finally { try { if(bw!=null) { bw.close(); } }catch(IOException e) { e.printStackTrace(); } } Export to the outside with the BufferedWriter class. It is convenient to set the initialization null first. I will write more about it later, but since I have to close it, I close it immediately after the big for statement ends. I've also added "Complete" to the console to indicate that it's done.
In addition, exception handling is required, so it will be collected together with IOException. Regarding NumberFormatException that is included together, it is an exception at that time because String type is converted to Integer type. I close it, but I was worried if I could do it, so I wrote it in the finally sentence. It's subtle whether it's necessary here, and there seems to be a way to write it well.
In the case of the horse page, netkeiba used this time seemed to have numbers at the end of the url in order, so I should repeat it with a for statement.
//Replace the last sentence of the url with a number.
for(int j = 2015100001; j<=2015100010; j++) {
//Generate List to be stored in TestBeans prepared separately
List<TestBeans> list = new ArrayList<TestBeans>();
//Document A = Jsoup.connect("url").get();Scraping target on url
Document doc = Jsoup.connect("http://db.netkeiba.com/horse/"+j).get();
This is the main process. It is stored in the list of things read using JavaBeans. The JavaBeans used this time are as follows.
public class TestBeans {
private String date;
private String place;
private String weather;
private int race;
private String raceName;
private int member;
private int groupNo;
private int horseNo;
private float oz;
private int famous;
private int score;
private String jockey;
private int kinryo;
private String cycle;
private String situation;
private String time;
private int weight;
public String getDate() {
return date;
}
public void setDate(String date) {
this.date = date;
}
public String getPlace() {
return place;
}
public void setPlace(String place) {
this.place = place;
}
public String getWeather() {
return weather;
}
public void setWeather(String weather) {
this.weather = weather;
}
public int getRace() {
return race;
}
public void setRace(int race) {
this.race = race;
}
public String getRaceName() {
return raceName;
}
public void setRaceName(String raceName) {
this.raceName = raceName;
}
public int getMember() {
return member;
}
public void setMember(int member) {
this.member = member;
}
public int getGroupNo() {
return groupNo;
}
public void setGroupNo(int groupNo) {
this.groupNo = groupNo;
}
public int getHorseNo() {
return horseNo;
}
public void setHorseNo(int horseNo) {
this.horseNo = horseNo;
}
public float getOz() {
return oz;
}
public void setOz(float oz) {
this.oz = oz;
}
public int getFamous() {
return famous;
}
public void setFamous(int famous) {
this.famous = famous;
}
public int getScore() {
return score;
}
public void setScore(int score) {
this.score = score;
}
public String getJockey() {
return jockey;
}
public void setJockey(String jockey) {
this.jockey = jockey;
}
public int getKinryo() {
return kinryo;
}
public void setKinryo(int kinryo) {
this.kinryo = kinryo;
}
public String getCycle() {
return cycle;
}
public void setCycle(String cycle) {
this.cycle = cycle;
}
public String getSituation() {
return situation;
}
public void setSituation(String situation) {
this.situation = situation;
}
public String getTime() {
return time;
}
public void setTime(String time) {
this.time = time;
}
public int getWeight() {
return weight;
}
public void setWeight(int weight) {
this.weight = weight;
}
}
Eclipse is excellent, isn't it? If you set the type and name in private, the getter and setter will be set automatically. Thank you.
This is stored in the list by the following processing.
//Initialize the string for use below
String str=null;
//Store Elements in TestBeans with extended for statement
for(Element a : elm) { ㋐
str = a.text(); ㋑
//I want to exclude exclusion, cancellation, and cancellation of the race.
if(str.indexOf("Exclusion")!=-1 || str.indexOf("Tori")!=-1 ||str.indexOf("During ~")!=-1) {
continue;
} ㋒
//I only wanted race information, but I couldn't narrow it down by just specifying the above tags, so I decided by the number of characters.
if(str.length()>=70) {
String hairetsu[] = str.split(" "); ㋓
TestBeans bean = new TestBeans();
bean.setDate(hairetsu[0]);
bean.setPlace(hairetsu[1]);
bean.setWeather(hairetsu[2]);
bean.setRaceName(hairetsu[4]);
bean.setHorseNo(Integer.parseInt(hairetsu[7]));
bean.setFamous(Integer.parseInt(hairetsu[9]));
bean.setScore(Integer.parseInt(hairetsu[10]));
bean.setJockey(hairetsu[11]);
bean.setCycle(hairetsu[13]);
bean.setSituation(hairetsu[14]);
bean.setTime(hairetsu[16]); ㋔
//Store in list
list.add(bean); ㋕
}
}
There is nothing like that. ㋐ Convert the read multiple elm (Elements type) to a (Element type) consisting of short sentences with an extended for statement. ㋑ Set to str (String type) that has been initialized in advance. ㋒ If the information you want to exclude is included, continue to get out of the for statement. ㋓ Since one sentence is separated by spaces, get them as a String type array. ㋔ Store the necessary parts in the array in TestBeans using the TestBeans setter. Change from String type to Integer type as appropriate. ㋕ Add to the list of TestBean types.
Writing smarter was not possible with my ability at this stage.
Is this about the details? Anyway, I was able to output in csv format. I've only tried 10 of them yet, so I'd like to see how long it will take with a slightly larger number.
Please let us know in the comments, etc. if you notice something, something strange, or a better way.
Thank you for your advice and comments from saka1029. It's certainly easier than distinguishing by the number of characters. I will devote myself.
Also, I realized that it would be easier to scrape the competition result screen, but I pretended not to notice it.
Recommended Posts