Getting Started with Machine Learning with Spark "Price Estimate" # 1 Loading Datasets with Apache Spark (Java)

Overview

-Introduction to machine learning using ** Java ** and ** Apache Spark ** --Hands-on from how to use ** Apache Spark ** to actual machine learning (learning, regression) with a step-up method based on "price estimation" --For the machine learning algorithm, use ** supervised learning ** ** gradient boosting tree ** [^ 1] --Post to the step-up method in multiple times --Click here for full source code https://github.com/riversun/spark-gradient-boosting-tree-regression-example

[^ 1]: The gradient boosting tree in Spark's ** spark.ml ** is now popular with ** LightGBM **, ** XGBoost ** (if you choose the tree model) and ** gradient booths. The basic theory of boosting ** is the same, but the design (concurrency, etc.) and performance (hyperparameter tuning to prevent overfitting, etc.) are quite different and not the same. The Spark development team also [watches] these (https://issues.apache.org/jira/browse/SPARK-8547) and xgboost4j-spark by the XGBoost head family. / xgboost / tree / master / jvm-packages) etc. are also available.

environment

Target

――People who want to try machine learning first without thinking about difficult things

Thing you want to do

There is a price list of accessories such as:

"How much is the ** silver bracelet price ** in the bottom row?"

I would like to predict this value by machine learning using ** Java ** and ** Apache Spark **.

So-called ** price estimation **.

id Material Shape Weight (g) Brand Dealer Price (yen)
0 Silver Bracelet 56 Overseas Famous Brands Department Stores td> 40864
1 Gold Rings 48 Domestic famous brands Directly managed stores / td> 63055
2 Dia Earlings 37 Domestic famous brands Directly managed stores / td> 112159
3 Diamond Necklace 20 Famous overseas brands Directly managed stores / td> 216053
4 Dia Necklace 33 Famous overseas brands Department stores td> 219666
5 Silver Brooches 55 Domestic Famous Brands Department Stores td> 16482
6 Platinum Brooches 58 Overseas super famous brands Directly managed stores 377919
7 Gold Earlings 49 Domestic famous brands Directly managed stores / td> 60484
8 Silver Necklace 59 No brand Cheap shop < / td> 6256
9 Gold Ring 13 Domestic Famous Brands Department Stores td> 37514



x Silver Bracelet 56 Overseas super famous brand Directly managed store How much does it cost?

Use Apache Spark on your local PC

Operating environment

** Apache Spark is standalone operation **

** Apache Spark ** is a member of ** Hadoop ** that can easily handle big data ** Distributed processing framework **, which demonstrates its strength in a distributed environment where multiple machines are used in a cluster configuration. However, stand-alone operation on one local PC is easy </ font>.

This time, we will run ** Apache Spark ** standalone and use a machine learning function called ** spark.ml **.

OS As long as the OS includes ** Java runtime **, Windows [^ 2], Linux, or Mac is OK.

Development language is Java

** Apache Spark ** itself is written in Scala and has APIs that can be used from Scala, Java, Python, and R.

This article uses ** Java **.

Set library dependency

Set up dependent libraries to embed and use ** Apache Spark ** in your Java app.

Just prepare POM.xml (or build.gradle) and add the ** Apache Spark ** related libraries to dependencies as shown below. ** Jackson ** is also used in ** Apache Spark **, so add it.

POM.xml(Excerpt)


<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-core_2.12</artifactId>
	<version>2.4.3</version>
</dependency>
<dependency>
	<groupId>org.apache.spark</groupId>
	<artifactId>spark-mllib_2.12</artifactId>
	<version>2.4.3</version>
</dependency>
<dependency>
	<groupId>com.fasterxml.jackson.module</groupId>
	<artifactId>jackson-module-scala_2.12</artifactId>
	<version>2.9.9</version>
</dependency>

<dependency>
	<groupId>com.fasterxml.jackson.core</groupId>
	<artifactId>jackson-databind</artifactId>
	<version>2.9.9</version>
</dependency>

Prepare the dataset

Next, let's prepare the dataset to be used for training.

The file is located here [https://raw.githubusercontent.com/riversun/spark-gradient-boosting-tree-regression-example/master/dataset/gem_price_ja.csv).

This time, I would like to use machine learning to predict the prices of accessories such as necklaces and rings made from raw materials such as diamonds and platinum.

The following CSV format data is used for machine learning.

gem_price_ja.csv


id,material,shape,weight,brand,shop,price
0,Silver,bracelet,56,Famous overseas brands,Department store,40864
1,gold,ring,48,Domestic famous brand,Directly managed store,63055
2,Dia,Earrings,37,Domestic famous brand,Directly managed store,112159
3,Dia,necklace,20,Famous overseas brands,Directly managed store,216053
4,Dia,necklace,33,Famous overseas brands,Department store,219666
5,Silver,brooch,55,Domestic famous brand,Department store,16482
6,platinum,brooch,58,Overseas super famous brand,Directly managed store,377919
7,gold,Earrings,49,Domestic famous brand,Directly managed store,60484
8,Silver,necklace,59,No brand,Cheap shop,6256
・ ・ ・ ・

I put this CSV format data in here, but ** id , Material, shape, weight, brand, shop, price **, in that order, ** 500 cases in total ** ..

In other words, there are ** 6 types of variables ** (6 types of ** material, shape, weight, brand, shop, price ** excluding id), which is ** 500 records **.

Try loading data with Spark

Create a directory called ** dataset ** directly under the working directory and place the learning data ** gem_price_ja.csv ** there.

The code that reads this ** CSV ** file and enables Spark to handle it is as follows.


import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class GBTRegressionStep1 {

  public static void main(String[] args) {

    SparkSession spark = SparkSession
        .builder()
        .appName("GradientBoostingTreeGegression")
        .master("local[*]")// (1)
        .getOrCreate();

    spark.sparkContext().setLogLevel("OFF");// (2)

    Dataset<Row> dataset = spark
        .read()
        .format("csv")// (3)
        .option("header", "true")// (4)
        .option("inferSchema", "true")// (5)
        .load("dataset/gem_price_ja.csv");// (6)

    dataset.show();// (7)

    dataset.printSchema();// (8)

  }

}

Code commentary

** (1) ** .master ("local [ \ * </ span>]") </ b> </ i> ・ ・ ・ Run Spark in local mode. " \ * </ span>" </ b> allocates worker threads for the number of logical cores of the CPU. If ***. master ("local") ***, the number of worker threads is fixed at ** 1 **. If ***. master ("local [2]") ***, the number of worker threads will be ** 2 **.

** (2) ** ・ ・ ・ Turn off logging.

Or to set it directly in ** log4j **, do the following.

org.apache.log4j.Logger.getLogger("org").setLevel(org.apache.log4j.Level.ERROR);
org.apache.log4j.Logger.getLogger("akka").setLevel(org.apache.log4j.Level.ERROR);

** (3) ** ***. Format ("csv") *** ・ ・ ・ Read the data file as CSV format

** (4) ** ***. Option ("header", "true") *** ・ ・ ・ If this is set to true, the first line of the CSV file will be used as the column name. Like RDBMS, Spark treats data used for learning as tabular data. When preparing the training data in CSV format, it is convenient because the definition in the first row can be mapped to the column name as shown below.

First line of CSV file


id,material,shape,weight,brand,shop,price

** (5) ** ***. Option ("inferSchema", "true") *** ・ ・ ・ Estimate the schema of the input data If this is set to true, Spark will automatically estimate and set the ** type ** of each data when reading a CSV file. It's convenient. For example, the following line is simple, so the ** type ** of each data is estimated as ** integer, string, string, integer, string, string, integer **. If you make a mistake, you can define your own schema exactly.

** (6) ** ***. Load ("dataset / gem_price_ja.csv"); *** ・ ・ ・ Read the data file

Such simple data can be managed with inferSchema


0,Silver,bracelet,56,Famous overseas brands,Department store,40864

** (7) ** *** dataset.show (); *** ・ ・ ・ Display the read dataset

When this is done, it looks like this:

+---+--------+------------+------+------------------+------------+------+
| id|material|       shape|weight|             brand|        shop| price|
+---+--------+------------+------+------------------+------------+------+
|  0|Silver|bracelet|    56|Famous overseas brands|Department store| 40864|
|  1|gold|ring|    48|Domestic famous brand|Directly managed store| 63055|
|  2|Dia|Earrings|    37|Domestic famous brand|Directly managed store|112159|
|  3|Dia|necklace|    20|Famous overseas brands|Directly managed store|216053|
|  4|Dia|necklace|    33|Famous overseas brands|Department store|219666|
|  5|Silver|brooch|    55|Domestic famous brand|Department store| 16482|
|  6|platinum|brooch|    58|Overseas super famous brand|Directly managed store|377919|
|  7|gold|Earrings|    49|Domestic famous brand|Directly managed store| 60484|
|  8|Silver|necklace|    59|No brand|Cheap shop|  6256|
|  9|gold|ring|    13|Domestic famous brand|Department store| 37514|
| 10|platinum|necklace|    23|Domestic famous brand|Cheap shop| 48454|
| 11|Dia|Earrings|    28|Famous overseas brands|Directly managed store|233614|
| 12|Silver|necklace|    54|Domestic famous brand|Cheap shop| 12235|
| 13|platinum|brooch|    28|No brand|Department store| 34285|
| 14|Silver|necklace|    49|No brand|Cheap shop|  5970|
| 15|platinum|bracelet|    40|Domestic famous brand|Department store| 82960|
| 16|Silver|necklace|    21|Famous overseas brands|Department store| 28852|
| 17|gold|ring|    11|Domestic famous brand|Department store| 34980|
| 18|platinum|bracelet|    44|Overseas super famous brand|Department store|340849|
| 19|Silver|ring|    11|Overseas super famous brand|Directly managed store| 47053|
+---+--------+------------+------+------------------+------------+------+
only showing top 20 rows

** (8) ** *** dataset.printSchema (); *** ・ ・ ・ Display the schema

I did ***. Option ("inferSchema", "true") *** earlier, but Spark estimated the type properly as shown below.

root
 |-- id: integer (nullable = true)
 |-- material: string (nullable = true)
 |-- shape: string (nullable = true)
 |-- weight: integer (nullable = true)
 |-- brand: string (nullable = true)
 |-- shop: string (nullable = true)
 |-- price: integer (nullable = true)

** Continue to Next "# 2 Data Preprocessing (Handling of Category Variables)" **

[^ 2]: If you get the following error message when running Spark in a Windows environment, download ** winutils.exe ** and place it in an appropriate directory. Explained in the supplement at the end of the sentence

Note: When running Apache Spark on Windows

If you get the following error message when running Spark in a Windows environment, download ** winutils.exe ** and place it in an appropriate directory.

Could not locate executable null\bin\winutils.exe in the Hadoop binaries.

winutils.exe is a utility used by Hadoop to emulate Unix commands on Windows.

Download this from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe, for example, under ** c: / Temp / ** ** c: / Temp / Place the directory so that it becomes winutil / bin / winutil.exe **.

Then set as follows at the beginning of the code.

System.setProperty("hadoop.home.dir", "c:\\Temp\\winutil\\");

Recommended Posts