-Introduction to machine learning using ** Java ** and ** Apache Spark ** --Hands-on from how to use ** Apache Spark ** to actual machine learning (learning, regression) with a step-up method based on "price estimation" --For the machine learning algorithm, use ** supervised learning ** ** gradient boosting tree ** [^ 1] --Post to the step-up method in multiple times --Click here for full source code https://github.com/riversun/spark-gradient-boosting-tree-regression-example
[^ 1]: The gradient boosting tree in Spark's ** spark.ml ** is now popular with ** LightGBM **, ** XGBoost ** (if you choose the tree model) and ** gradient booths. The basic theory of boosting ** is the same, but the design (concurrency, etc.) and performance (hyperparameter tuning to prevent overfitting, etc.) are quite different and not the same. The Spark development team also [watches] these (https://issues.apache.org/jira/browse/SPARK-8547) and xgboost4j-spark by the XGBoost head family. / xgboost / tree / master / jvm-packages) etc. are also available.
――People who want to try machine learning first without thinking about difficult things
There is a price list of accessories such as:
"How much is the ** silver bracelet price ** in the bottom row?"
I would like to predict this value by machine learning using ** Java ** and ** Apache Spark **.
So-called ** price estimation **.
id td> | Material td> | Shape td> | Weight (g) td> | Brand td> | Dealer td> | Price (yen) td> tr> | |
0 td> | Silver td> | Bracelet td> | 56 td> | Overseas Famous Brands td> | Department Stores td> td> | 40864 td> tr> | |
1 td> | Gold td> | Rings td> | 48 td> | Domestic famous brands td> | Directly managed stores | / td> | 63055 td> tr> |
2 td> | Dia td> | Earlings td> | 37 td> | Domestic famous brands td> | Directly managed stores | / td> | 112159 td> tr> |
3 td> | Diamond td> | Necklace td> | 20 td> | Famous overseas brands td> | Directly managed stores | / td> | 216053 td> tr> |
4 td> | Dia td> | Necklace td> | 33 td> | Famous overseas brands td> | Department stores td> td> | 219666 td> tr> | |
5 td> | Silver td> | Brooches td> | 55 td> | Domestic Famous Brands td> | Department Stores td> td> | 16482 td> tr> | |
6 td> | Platinum td> | Brooches td> | 58 td> | Overseas super famous brands td> | Directly managed stores td> | 377919 td> tr> | |
7 td> | Gold td> | Earlings td> | 49 td> | Domestic famous brands td> | Directly managed stores | / td> | 60484 td> tr> |
8 td> | Silver td> | Necklace td> | 59 td> | No brand td> | Cheap shop < / td> | 6256 td> tr> | |
9 td> | Gold td> | Ring td> | 13 td> | Domestic Famous Brands td> | Department Stores td> td> | 37514 td> tr> | |
・ ・ ・ td> | td> | td> | td> | td> | td> | td> tr> | |
x td> | Silver td> | Bracelet td> | 56 td> | Overseas super famous brand td> | Directly managed store td> | How much does it cost? b> font> td> tr> |
** Apache Spark ** is a member of ** Hadoop ** that can easily handle big data ** Distributed processing framework **, which demonstrates its strength in a distributed environment where multiple machines are used in a cluster configuration. However, stand-alone operation on one local PC is easy </ font>.
This time, we will run ** Apache Spark ** standalone and use a machine learning function called ** spark.ml **.
OS As long as the OS includes ** Java runtime **, Windows [^ 2], Linux, or Mac is OK.
** Apache Spark ** itself is written in Scala and has APIs that can be used from Scala, Java, Python, and R.
This article uses ** Java **.
Set up dependent libraries to embed and use ** Apache Spark ** in your Java app.
Just prepare POM.xml (or build.gradle) and add the ** Apache Spark ** related libraries to dependencies as shown below. ** Jackson ** is also used in ** Apache Spark **, so add it.
POM.xml(Excerpt)
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.12</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.module</groupId>
<artifactId>jackson-module-scala_2.12</artifactId>
<version>2.9.9</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.9.9</version>
</dependency>
Next, let's prepare the dataset to be used for training.
The file is located here [https://raw.githubusercontent.com/riversun/spark-gradient-boosting-tree-regression-example/master/dataset/gem_price_ja.csv).
This time, I would like to use machine learning to predict the prices of accessories such as necklaces and rings made from raw materials such as diamonds and platinum.
The following CSV format data is used for machine learning.
gem_price_ja.csv
id,material,shape,weight,brand,shop,price
0,Silver,bracelet,56,Famous overseas brands,Department store,40864
1,gold,ring,48,Domestic famous brand,Directly managed store,63055
2,Dia,Earrings,37,Domestic famous brand,Directly managed store,112159
3,Dia,necklace,20,Famous overseas brands,Directly managed store,216053
4,Dia,necklace,33,Famous overseas brands,Department store,219666
5,Silver,brooch,55,Domestic famous brand,Department store,16482
6,platinum,brooch,58,Overseas super famous brand,Directly managed store,377919
7,gold,Earrings,49,Domestic famous brand,Directly managed store,60484
8,Silver,necklace,59,No brand,Cheap shop,6256
・ ・ ・ ・
I put this CSV format data in here, but ** id , Material, shape, weight, brand, shop, price **, in that order, ** 500 cases in total ** ..
In other words, there are ** 6 types of variables ** (6 types of ** material, shape, weight, brand, shop, price ** excluding id), which is ** 500 records **.
Create a directory called ** dataset ** directly under the working directory and place the learning data ** gem_price_ja.csv ** there.
The code that reads this ** CSV ** file and enables Spark to handle it is as follows.
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;
public class GBTRegressionStep1 {
public static void main(String[] args) {
SparkSession spark = SparkSession
.builder()
.appName("GradientBoostingTreeGegression")
.master("local[*]")// (1)
.getOrCreate();
spark.sparkContext().setLogLevel("OFF");// (2)
Dataset<Row> dataset = spark
.read()
.format("csv")// (3)
.option("header", "true")// (4)
.option("inferSchema", "true")// (5)
.load("dataset/gem_price_ja.csv");// (6)
dataset.show();// (7)
dataset.printSchema();// (8)
}
}
** (1) ** .master ("local [ \ * </ span>]") </ b> </ i> ・ ・ ・ Run Spark in local mode. " \ * </ span>" </ b> allocates worker threads for the number of logical cores of the CPU. If ***. master ("local") ***, the number of worker threads is fixed at ** 1 **. If ***. master ("local [2]") ***, the number of worker threads will be ** 2 **.
** (2) ** ・ ・ ・ Turn off logging.
Or to set it directly in ** log4j **, do the following.
org.apache.log4j.Logger.getLogger("org").setLevel(org.apache.log4j.Level.ERROR);
org.apache.log4j.Logger.getLogger("akka").setLevel(org.apache.log4j.Level.ERROR);
** (3) ** ***. Format ("csv") *** ・ ・ ・ Read the data file as CSV format
** (4) ** ***. Option ("header", "true") *** ・ ・ ・ If this is set to true, the first line of the CSV file will be used as the column name. Like RDBMS, Spark treats data used for learning as tabular data. When preparing the training data in CSV format, it is convenient because the definition in the first row can be mapped to the column name as shown below.
First line of CSV file
id,material,shape,weight,brand,shop,price
** (5) ** ***. Option ("inferSchema", "true") *** ・ ・ ・ Estimate the schema of the input data If this is set to true, Spark will automatically estimate and set the ** type ** of each data when reading a CSV file. It's convenient. For example, the following line is simple, so the ** type ** of each data is estimated as ** integer, string, string, integer, string, string, integer **. If you make a mistake, you can define your own schema exactly.
** (6) ** ***. Load ("dataset / gem_price_ja.csv"); *** ・ ・ ・ Read the data file
Such simple data can be managed with inferSchema
0,Silver,bracelet,56,Famous overseas brands,Department store,40864
** (7) ** *** dataset.show (); *** ・ ・ ・ Display the read dataset
When this is done, it looks like this:
+---+--------+------------+------+------------------+------------+------+
| id|material| shape|weight| brand| shop| price|
+---+--------+------------+------+------------------+------------+------+
| 0|Silver|bracelet| 56|Famous overseas brands|Department store| 40864|
| 1|gold|ring| 48|Domestic famous brand|Directly managed store| 63055|
| 2|Dia|Earrings| 37|Domestic famous brand|Directly managed store|112159|
| 3|Dia|necklace| 20|Famous overseas brands|Directly managed store|216053|
| 4|Dia|necklace| 33|Famous overseas brands|Department store|219666|
| 5|Silver|brooch| 55|Domestic famous brand|Department store| 16482|
| 6|platinum|brooch| 58|Overseas super famous brand|Directly managed store|377919|
| 7|gold|Earrings| 49|Domestic famous brand|Directly managed store| 60484|
| 8|Silver|necklace| 59|No brand|Cheap shop| 6256|
| 9|gold|ring| 13|Domestic famous brand|Department store| 37514|
| 10|platinum|necklace| 23|Domestic famous brand|Cheap shop| 48454|
| 11|Dia|Earrings| 28|Famous overseas brands|Directly managed store|233614|
| 12|Silver|necklace| 54|Domestic famous brand|Cheap shop| 12235|
| 13|platinum|brooch| 28|No brand|Department store| 34285|
| 14|Silver|necklace| 49|No brand|Cheap shop| 5970|
| 15|platinum|bracelet| 40|Domestic famous brand|Department store| 82960|
| 16|Silver|necklace| 21|Famous overseas brands|Department store| 28852|
| 17|gold|ring| 11|Domestic famous brand|Department store| 34980|
| 18|platinum|bracelet| 44|Overseas super famous brand|Department store|340849|
| 19|Silver|ring| 11|Overseas super famous brand|Directly managed store| 47053|
+---+--------+------------+------+------------------+------------+------+
only showing top 20 rows
** (8) ** *** dataset.printSchema (); *** ・ ・ ・ Display the schema
I did ***. Option ("inferSchema", "true") *** earlier, but Spark estimated the type properly as shown below.
root
|-- id: integer (nullable = true)
|-- material: string (nullable = true)
|-- shape: string (nullable = true)
|-- weight: integer (nullable = true)
|-- brand: string (nullable = true)
|-- shop: string (nullable = true)
|-- price: integer (nullable = true)
** Continue to Next "# 2 Data Preprocessing (Handling of Category Variables)" **
[^ 2]: If you get the following error message when running Spark in a Windows environment, download ** winutils.exe ** and place it in an appropriate directory. Explained in the supplement at the end of the sentence
If you get the following error message when running Spark in a Windows environment, download ** winutils.exe ** and place it in an appropriate directory.
Could not locate executable null\bin\winutils.exe in the Hadoop binaries.
winutils.exe is a utility used by Hadoop to emulate Unix commands on Windows.
Download this from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe, for example, under ** c: / Temp / ** ** c: / Temp / Place the directory so that it becomes winutil / bin / winutil.exe **.
Then set as follows at the beginning of the code.
System.setProperty("hadoop.home.dir", "c:\\Temp\\winutil\\");
Recommended Posts