Overview

-Introduction to machine learning using ** Java ** and ** Apache Spark ** --Hands-on from how to use ** Apache Spark ** to actual machine learning (learning, regression) with a step-up method based on "price estimation" --For the machine learning algorithm, use ** supervised learning ** ** gradient boosting tree ** [^ 1] --Post to the step-up method in multiple times --Click here for full source code https://github.com/riversun/spark-gradient-boosting-tree-regression-example

[^ 1]: The gradient boosting tree in Spark's ** spark.ml ** is now popular with ** LightGBM **, ** XGBoost ** (if you choose the tree model) and ** gradient booths. The basic theory of boosting ** is the same, but the design (concurrency, etc.) and performance (hyperparameter tuning to prevent overfitting, etc.) are quite different and not the same. The Spark development team also [watches] these (https://issues.apache.org/jira/browse/SPARK-8547) and xgboost4j-spark by the XGBoost head family. / xgboost / tree / master / jvm-packages) etc. are also available.

environment

Apache Spark 2.4.3
Java 8～

Target

――People who want to try machine learning first without thinking about difficult things

Thing you want to do

There is a price list of accessories such as:

"How much is the ** silver bracelet price ** in the bottom row?"

I would like to predict this value by machine learning using ** Java ** and ** Apache Spark **.

So-called ** price estimation **.

id	Material	Shape	Weight (g)	Brand	Dealer	Price (yen)
0	Silver	Bracelet	56	Overseas Famous Brands	Department Stores td>	40864
1	Gold	Rings	48	Domestic famous brands	Directly managed stores	/ td>	63055
2	Dia	Earlings	37	Domestic famous brands	Directly managed stores	/ td>	112159
3	Diamond	Necklace	20	Famous overseas brands	Directly managed stores	/ td>	216053
4	Dia	Necklace	33	Famous overseas brands	Department stores td>	219666
5	Silver	Brooches	55	Domestic Famous Brands	Department Stores td>	16482
6	Platinum	Brooches	58	Overseas super famous brands	Directly managed stores	377919
7	Gold	Earlings	49	Domestic famous brands	Directly managed stores	/ td>	60484
8	Silver	Necklace	59	No brand	Cheap shop < / td>	6256
9	Gold	Ring	13	Domestic Famous Brands	Department Stores td>	37514
・・・
x	Silver	Bracelet	56	Overseas super famous brand	Directly managed store	How much does it cost?

Use Apache Spark on your local PC

Operating environment

Apache Spark is standalone operation

** Apache Spark ** is a member of ** Hadoop ** that can easily handle big data ** Distributed processing framework **, which demonstrates its strength in a distributed environment where multiple machines are used in a cluster configuration. However, stand-alone operation on one local PC is easy .

This time, we will run ** Apache Spark ** standalone and use a machine learning function called ** spark.ml **.

OS As long as the OS includes ** Java runtime **, Windows [^ 2], Linux, or Mac is OK.

Development language is Java

** Apache Spark ** itself is written in Scala and has APIs that can be used from Scala, Java, Python, and R.

This article uses ** Java **.

Set library dependency

Set up dependent libraries to embed and use ** Apache Spark ** in your Java app.

Just prepare POM.xml (or build.gradle) and add the ** Apache Spark ** related libraries to dependencies as shown below. ** Jackson ** is also used in ** Apache Spark **, so add it.

`POM.xml(Excerpt)`


<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-core_2.12</artifactId>
	<version>2.4.3</version>
</dependency>
<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-mllib_2.12</artifactId>
	<version>2.4.3</version>
</dependency>
<dependency>
	<groupId>com.fasterxml.jackson.module</groupId>
	<artifactId>jackson-module-scala_2.12</artifactId>
	<version>2.9.9</version>
</dependency>

<dependency>
	<groupId>com.fasterxml.jackson.core</groupId>
	<artifactId>jackson-databind</artifactId>
	<version>2.9.9</version>
</dependency>

Prepare the dataset

Next, let's prepare the dataset to be used for training.

The file is located here [https://raw.githubusercontent.com/riversun/spark-gradient-boosting-tree-regression-example/master/dataset/gem_price_ja.csv).

This time, I would like to use machine learning to predict the prices of accessories such as necklaces and rings made from raw materials such as diamonds and platinum.

The following CSV format data is used for machine learning.

`gem_price_ja.csv`


id,material,shape,weight,brand,shop,price
0,Silver,bracelet,56,Famous overseas brands,Department store,40864
1,gold,ring,48,Domestic famous brand,Directly managed store,63055
2,Dia,Earrings,37,Domestic famous brand,Directly managed store,112159
3,Dia,necklace,20,Famous overseas brands,Directly managed store,216053
4,Dia,necklace,33,Famous overseas brands,Department store,219666
5,Silver,brooch,55,Domestic famous brand,Department store,16482
6,platinum,brooch,58,Overseas super famous brand,Directly managed store,377919
7,gold,Earrings,49,Domestic famous brand,Directly managed store,60484
8,Silver,necklace,59,No brand,Cheap shop,6256
・ ・ ・ ・

I put this CSV format data in here, but ** id , Material, shape, weight, brand, shop, price **, in that order, ** 500 cases in total ** ..

In other words, there are ** 6 types of variables ** (6 types of ** material, shape, weight, brand, shop, price ** excluding id), which is ** 500 records **.

Try loading data with Spark

Create a directory called ** dataset ** directly under the working directory and place the learning data ** gem_price_ja.csv ** there.

The code that reads this ** CSV ** file and enables Spark to handle it is as follows.


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class GBTRegressionStep1 {

  public static void main(String[] args) {

    SparkSession spark = SparkSession
        .builder()
        .appName("GradientBoostingTreeGegression")
        .master("local[*]")// (1)
        .getOrCreate();

    spark.sparkContext().setLogLevel("OFF");// (2)

    Dataset<Row> dataset = spark
        .read()
        .format("csv")// (3)
        .option("header", "true")// (4)
        .option("inferSchema", "true")// (5)
        .load("dataset/gem_price_ja.csv");// (6)

    dataset.show();// (7)

    dataset.printSchema();// (8)

  }

}

Code commentary

** (1) ** .master ("local [ \ * ]") ・・・ Run Spark in local mode. " \ * " allocates worker threads for the number of logical cores of the CPU. If ***. master ("local") ***, the number of worker threads is fixed at ** 1 **. If ***. master ("local [2]") ***, the number of worker threads will be ** 2 **.

** (2) ** ・・・ Turn off logging.

Or to set it directly in ** log4j **, do the following.

org.apache.log4j.Logger.getLogger("org").setLevel(org.apache.log4j.Level.ERROR); org.apache.log4j.Logger.getLogger("akka").setLevel(org.apache.log4j.Level.ERROR);

** (3) ** ***. Format ("csv") *** ・・・ Read the data file as CSV format

** (4) ** ***. Option ("header", "true") *** ・・・ If this is set to true, the first line of the CSV file will be used as the column name. Like RDBMS, Spark treats data used for learning as tabular data. When preparing the training data in CSV format, it is convenient because the definition in the first row can be mapped to the column name as shown below.

First line of CSV file

id,material,shape,weight,brand,shop,price

** (5) ** ***. Option ("inferSchema", "true") *** ・・・ Estimate the schema of the input data If this is set to true, Spark will automatically estimate and set the ** type ** of each data when reading a CSV file. It's convenient. For example, the following line is simple, so the ** type ** of each data is estimated as ** integer, string, string, integer, string, string, integer **. If you make a mistake, you can define your own schema exactly.

** (6) ** ***. Load ("dataset / gem_price_ja.csv"); *** ・・・ Read the data file

Such simple data can be managed with inferSchema

0,Silver,bracelet,56,Famous overseas brands,Department store,40864

** (7) ** *** dataset.show (); *** ・・・ Display the read dataset

When this is done, it looks like this:

+---+--------+------------+------+------------------+------------+------+ | id|material| shape|weight| brand| shop| price| +---+--------+------------+------+------------------+------------+------+ | 0|Silver|bracelet| 56|Famous overseas brands|Department store| 40864| | 1|gold|ring| 48|Domestic famous brand|Directly managed store| 63055| | 2|Dia|Earrings| 37|Domestic famous brand|Directly managed store|112159| | 3|Dia|necklace| 20|Famous overseas brands|Directly managed store|216053| | 4|Dia|necklace| 33|Famous overseas brands|Department store|219666| | 5|Silver|brooch| 55|Domestic famous brand|Department store| 16482| | 6|platinum|brooch| 58|Overseas super famous brand|Directly managed store|377919| | 7|gold|Earrings| 49|Domestic famous brand|Directly managed store| 60484| | 8|Silver|necklace| 59|No brand|Cheap shop| 6256| | 9|gold|ring| 13|Domestic famous brand|Department store| 37514| | 10|platinum|necklace| 23|Domestic famous brand|Cheap shop| 48454| | 11|Dia|Earrings| 28|Famous overseas brands|Directly managed store|233614| | 12|Silver|necklace| 54|Domestic famous brand|Cheap shop| 12235| | 13|platinum|brooch| 28|No brand|Department store| 34285| | 14|Silver|necklace| 49|No brand|Cheap shop| 5970| | 15|platinum|bracelet| 40|Domestic famous brand|Department store| 82960| | 16|Silver|necklace| 21|Famous overseas brands|Department store| 28852| | 17|gold|ring| 11|Domestic famous brand|Department store| 34980| | 18|platinum|bracelet| 44|Overseas super famous brand|Department store|340849| | 19|Silver|ring| 11|Overseas super famous brand|Directly managed store| 47053| +---+--------+------------+------+------------------+------------+------+ only showing top 20 rows

** (8) ** *** dataset.printSchema (); *** ・・・ Display the schema

I did ***. Option ("inferSchema", "true") *** earlier, but Spark estimated the type properly as shown below.

root |-- id: integer (nullable = true) |-- material: string (nullable = true) |-- shape: string (nullable = true) |-- weight: integer (nullable = true) |-- brand: string (nullable = true) |-- shop: string (nullable = true) |-- price: integer (nullable = true)

** Continue to Next "# 2 Data Preprocessing (Handling of Category Variables)" **

[^ 2]: If you get the following error message when running Spark in a Windows environment, download ** winutils.exe ** and place it in an appropriate directory. Explained in the supplement at the end of the sentence

Note: When running Apache Spark on Windows

If you get the following error message when running Spark in a Windows environment, download ** winutils.exe ** and place it in an appropriate directory.

Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

winutils.exe is a utility used by Hadoop to emulate Unix commands on Windows.

Download this from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe, for example, under ** c: / Temp / ** ** c: / Temp / Place the directory so that it becomes winutil / bin / winutil.exe **.

Then set as follows at the beginning of the code.

System.setProperty("hadoop.home.dir", "c:\\Temp\\winutil\\");

Recommended Posts
Getting Started with Machine Learning with Spark "Price Estimate" # 1 Loading Datasets with Apache Spark (Java)

[Machine learning with Apache Spark] Sparse Vector (sparse vector) and Dense Vector (dense vector)

Getting Started with Java Basics

Introduction to Machine Learning with Spark "Price Estimate" # 2 Data Preprocessing (Handling of Category Variables)

Introduction to Machine Learning with Spark "Price Estimate" # 3 Make a [Price Estimate Engine] by learning with training data

Getting Started with Ruby for Java Engineers

Getting Started with Java Starting from 0 Part 1

Links & memos for getting started with Java (for myself)

Getting started with Java programs using Visual Studio Code

Getting Started with DBUnit

Getting Started with Swift

Getting Started with Docker

Getting Started with Doma-Transactions

Getting started with Java and creating an AsciiDoc editor with JavaFX

Getting Started with Doma-Annotation Processing

Getting Started with JSP & Servlet

Getting Started with Spring Boot

Getting Started with Ruby Modules

Returning to the beginning, getting started with Java ② Control statements, loop statements

Getting Started with Machine Learning with Spark "Price Estimate" # 1 Loading Datasets with Apache Spark (Java)

Overview

environment

Target

Thing you want to do

Use Apache Spark on your local PC

Operating environment

Apache Spark is standalone operation

Development language is Java

Set library dependency

`POM.xml(Excerpt)`

Prepare the dataset

`gem_price_ja.csv`

Try loading data with Spark

Code commentary

`First line of CSV file`

`Such simple data can be managed with inferSchema`

Note: When running Apache Spark on Windows

Getting Started with Machine Learning with Spark "Price Estimate" # 1 Loading Datasets with Apache Spark (Java)

Overview

environment

Target

Thing you want to do

Use Apache Spark on your local PC

Operating environment

** Apache Spark is standalone operation **

Development language is Java

Set library dependency

POM.xml(Excerpt)

Prepare the dataset

gem_price_ja.csv

Try loading data with Spark

Code commentary

First line of CSV file

Such simple data can be managed with inferSchema

Note: When running Apache Spark on Windows

Apache Spark is standalone operation

`POM.xml(Excerpt)`

`gem_price_ja.csv`

`First line of CSV file`

`Such simple data can be managed with inferSchema`