[JAVA] Introduction to Machine Learning with Spark "Price Estimate" # 2 Data Preprocessing (Handling of Category Variables)

Overview

-Introduction to machine learning using ** Java ** and ** Apache Spark ** --Hands-on from how to use ** Apache Spark ** to actual machine learning (learning, regression) with a step-up method based on "price estimation" --Use ** Gradient Boosting Tree ** of ** Supervised Learning ** for machine learning algorithms --Post to the step-up method in multiple times --Click here for full source code https://github.com/riversun/spark-gradient-boosting-tree-regression-example

environment

Pre-process with consideration for the type of data

Last time has done so far from reading CSV data as Spark tabular data.

Now, let's take a look at the data used for learning this time.

The data can be found here [https://raw.githubusercontent.com/riversun/spark-gradient-boosting-tree-regression-example/master/dataset/gem_price_ja.csv).

This is the data.

gem_price_ja.csv


id,material,shape,weight,brand,shop,price
0,Silver,bracelet,56,Famous overseas brands,Department store,40864
1,gold,ring,48,Domestic famous brand,Directly managed store,63055
2,Dia,Earrings,37,Domestic famous brand,Directly managed store,112159
3,Dia,necklace,20,Famous overseas brands,Directly managed store,216053
4,Dia,necklace,33,Famous overseas brands,Department store,219666
5,Silver,brooch,55,Domestic famous brand,Department store,16482
6,platinum,brooch,58,Overseas super famous brand,Directly managed store,377919
7,gold,Earrings,49,Domestic famous brand,Directly managed store,60484
8,Silver,necklace,59,No brand,Cheap shop,6256
・ ・ ・ ・

Qualitative and quantitative data

By the way, if you look at what values are available for each variable name (material, shape, etc.), you can organize them as follows.

image.png

In this way, the data on the left (** material, brand, shop, shape **) is the value of the character string that indicates the category. Such data that cannot be measured directly as a numerical value is called ** qualitative data **.

On the other hand, the data on the right (** weight, price **) are numbers that take continuous values. Data that can be directly measured as a numerical value or can be operated on four arithmetic operations in this way is called ** quantitative data **.

Categorical variable and Continuous variable

As mentioned above, variables such as ** material, brand, shop, shape ** that take ** qualitative data ** are called ** Categorical variables **.

Also, ** quantitative data **, that is, variables such as ** weight, price ** that take continuous values (continuous values), are called ** Categorical variables **.

image.png

Handling of categorical variables

Since machine learning is a numerical calculation after all, the qualitative data of categorical variables must be converted into some numerical value before processing.

Let's take a closer look at categorical variables.

The categorical variables ** material ** and ** shape ** are the material of the accessory and the shape after processing (** shape **), but there is a semantic difference between them.

image.png

** material ** are *** diamond, platinum, gold, silver ***, but we know from experience that diamonds are more expensive than silver, so ** material ** has There seems to be a hierarchy in the data.

On the other hand, ** shape ** is *** rings, necklaces, earrings, brooches, bracelets ***, but there is no order in which bracelets are superior to rings, and all options are on an equal footing. Seems like.

Category variables are divided into ordinal variables and nominal variables.

Such categorical variable values that are ordered and whose magnitude can be compared are called ** ordinal variables ** (or ordinal variables). [^ 1]

[^ 1]: In this example, it seems that it is not clear which one is higher because gold is more expensive than gold and platinum of the same weight at present, but in an easy-to-understand example ** top, middle, bottom * Variables with * or ** high, medium, low **, etc. can be clearly defined as ordinal variables.

On the other hand, if the values of categorical variables are not ranked, they are called ** Nominal variables ** (or nominal scale variables).

This content appears at the beginning of statistics textbooks, but it is important to remember because it is important knowledge for advancing machine learning.

Looking at the data this time, it looks like the following.

image.png

This time, we will not clearly distinguish categorical variables into sequential variables and nominal variables, but will process them as categorical variables and continuous variables.

Quantify (index) categorical variables

As mentioned above, categorical variables need to be quantified for machine learning.

There are several quantification techniques, but this time we will simply index each variable.

↓ Such an image.

image.png

Handle categorical variables in Spark

Now, let's write the code to quantify the categorical variables. First, do the following code.


import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class GBTRegressionStep02 {

  public static void main(String[] args) {

    System.setProperty("hadoop.home.dir", "c:\\Temp\\winutil\\");

    org.apache.log4j.Logger.getLogger("org").setLevel(org.apache.log4j.Level.ERROR);
    org.apache.log4j.Logger.getLogger("akka").setLevel(org.apache.log4j.Level.ERROR);

    SparkSession spark = SparkSession
        .builder()
        .appName("GradientBoostingTreeGegression")
        .master("local[*]")
        .getOrCreate();

    Dataset<Row> dataset = spark
        .read()
        .format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load("dataset/gem_price_ja.csv");

    StringIndexer materialIndexer = new StringIndexer()// (1)
        .setInputCol("material")// (2)
        .setOutputCol("materialIndex");// (3)

    Dataset<Row> materialIndexAddedDataSet = materialIndexer.fit(dataset).transform(dataset);// (4)

    materialIndexAddedDataSet.show(10);// (5)

  }

}

Code commentary

** (1) ** ・ ・ ・ Use *** StringIndexer *** to quantify (index) categorical variables. It has the role of inputting a certain variable and outputting the encoded (converted) result as another variable. Encoding here is just a process of converting a string to a number.

** (2) ** ***. SetInputCol ("material") *** ・ ・ ・ Specify the variable name to be entered in this *** StringIndexer ***. When you load the CSV data into Spark, it is a tabular ** Dataset **, so you can think of a ** variable ** as a ** column ** of a table. col is ** column ** col, isn't it? In other words, the column name on the input side of *** StringIndexer ** is *** "material" ***

** (3) ** ***. SetOutputCol ("materialIndex") *** ・ ・ ・ Here, specify the column name on the output side.

** (4) ** ・ ・ ・ ** Create a learning model for digitizing categorical variables with the fit () ** method. When a Dataset is given to the resulting learning model with the transform method, a new Dataset is generated.

Create a learning model for quantifying categorical variables with the fit method

but, Here, the ** fit () ** method of the *** StringIndexer *** object named ** materialIndexer ** is obvious, but it is not actually learned. It just creates a ** device ** [^ 2] that quantifies the character string given as a categorical variable.

[^ 2]: *** StringIndexer *** inherits a class called ** Estimator **. ** Estimator ** is originally a class for ** learners **, but when doing pipeline processing, which is a feature of ** spark.ml **, like *** StringIndexer *** By equipping the pre-processing system and the learner with the same interface, this interface is used to realize the flow of ** pre-processing → pre-processing → learning ** as a pipeline.

In the ** transform ** method, the process of creating a new indexed variable called ** materialIndex ** is executed from the column name ** material ** in the actual input Dataset.

I think this is the most unclear, but I'll understand the meaning later, so now it's OK with spells.

** (5) ** ・ ・ ・ ** Displays the Dataset after processing StringIndexer **.

When I run the code, it looks like this:

image.png

A column called ** materialIndex ** has been added to the right!

In this way, categorical variables cannot be used in machine learning unless they are quantified in some way, so other categorical variables are also quantified.

So, I will try to quantify (index) other categorical variables ** shape, brand, shop ** using *** StringIndexer ** in the same way.

Then it will be as follows

import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class GBTRegressionStep02_part02 {

  public static void main(String[] args) {
    SparkSession spark = SparkSession
        .builder()
        .appName("GradientBoostingTreeGegression")
        .master("local[*]")
        .getOrCreate();

    Dataset<Row> dataset = spark
        .read()
        .format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load("dataset/gem_price_ja.csv");

    StringIndexer materialIndexer = new StringIndexer()// (1)
        .setInputCol("material")
        .setOutputCol("materialIndex");

    StringIndexer shapeIndexer = new StringIndexer()// (2)
        .setInputCol("shape")
        .setOutputCol("shapeIndex");

    StringIndexer brandIndexer = new StringIndexer()// (3)
        .setInputCol("brand")
        .setOutputCol("brandIndex");

    StringIndexer shopIndexer = new StringIndexer()// (4)
        .setInputCol("shop")
        .setOutputCol("shopIndex");

    Dataset<Row> dataset1 = materialIndexer.fit(dataset).transform(dataset);// (5)
    Dataset<Row> dataset2 = shapeIndexer.fit(dataset).transform(dataset1);// (6)
    Dataset<Row> dataset3 = brandIndexer.fit(dataset).transform(dataset2);// (7)
    Dataset<Row> dataset4 = shopIndexer.fit(dataset).transform(dataset3);// (8)

    dataset4.show(10);
  }
}

For ** (1) to (4) **, *** StringIndexer *** is prepared for each variable. ** (5)-(8) ** are indexed one after another in the order of ** materialIndexer → shapeIndexer → brandIndexer → shopIndexer ** to create a new Dataset.

When you do this, you get:

image.png

You can see that ** materialIndex, shapeIndex, brandIndex, shopIndex ** have been added to the right.

By the way, ↓ Is this something that can't be done?

    Dataset<Row> dataset1 = materialIndexer.fit(dataset).transform(dataset);// (5)
    Dataset<Row> dataset2 = shapeIndexer.fit(dataset).transform(dataset1);// (6)
    Dataset<Row> dataset3 = brandIndexer.fit(dataset).transform(dataset2);// (7)
    Dataset<Row> dataset4 = shopIndexer.fit(dataset).transform(dataset3);// (8)

How pretty.

When a certain process is completed, such as ** materialIndexer → shapeIndexer → brandIndexer → shopIndexer **, you can write smarter that the next process is performed with that process as input.

That is the mechanism called ** Pipeline ** that we will see next.

Pipeline data preprocessing

↓ Processing

Before


    Dataset<Row> dataset1 = materialIndexer.fit(dataset).transform(dataset);
    Dataset<Row> dataset2 = shapeIndexer.fit(dataset).transform(dataset1);
    Dataset<Row> dataset3 = brandIndexer.fit(dataset).transform(dataset2);
    Dataset<Row> dataset4 = shopIndexer.fit(dataset).transform(dataset3);

You can rewrite it as ↓ using *** Pipeline ***.

After


    Pipeline pipeline = new Pipeline()
        .setStages(new PipelineStage[] { materialIndexer, shapeIndexer, brandIndexer, shopIndexer });// (1)

    PipelineModel pipelineModel = pipeline.fit(dataset);// (2)

    Dataset<Row> indexedDataset = pipelineModel.transform(dataset);// (3)

** (1) ** ・ ・ ・ Create a *** Pipeline *** object like this, and describe *** StringIndexer *** in the order you want to execute processing with *** setStages ***. I will. By simply arranging them in this way, these series of processes can be executed as a group as *** Pipeline ***.

** (2) ** ・ ・ ・ *** Get pipelineModel with pipeline # fit ***. At this stage, I'm still using only *** StringIndexer ***, so I haven't added any learning processing, but when learning processing is included, I can get it with the *** fit () *** method *** PipelineModel *** means a learning model based on the specified Dataset.

** (3) ** ・ ・ ・ *** When PipelineModel # transform *** is executed, a series of *** StringIndexer *** processes are executed at once, and a new Dataset is created as a result. This also means that if the processing for training is included in Pipeline, the training model will be applied to the specified Dataset.

The source code and execution result are posted below.


import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class GBTRegressionStep02_part03 {

  public static void main(String[] args) {

    System.setProperty("hadoop.home.dir", "c:\\Temp\\winutil\\");

    org.apache.log4j.Logger.getLogger("org").setLevel(org.apache.log4j.Level.ERROR);
    org.apache.log4j.Logger.getLogger("akka").setLevel(org.apache.log4j.Level.ERROR);

    SparkSession spark = SparkSession
        .builder()
        .appName("GradientBoostingTreeGegression")
        .master("local[*]")
        .getOrCreate();

    Dataset<Row> dataset = spark
        .read()
        .format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load("dataset/gem_price_ja.csv");

    StringIndexer materialIndexer = new StringIndexer()
        .setInputCol("material")
        .setOutputCol("materialIndex");

    StringIndexer shapeIndexer = new StringIndexer()
        .setInputCol("shape")
        .setOutputCol("shapeIndex");

    StringIndexer brandIndexer = new StringIndexer()
        .setInputCol("brand")
        .setOutputCol("brandIndex");

    StringIndexer shopIndexer = new StringIndexer()
        .setInputCol("shop")
        .setOutputCol("shopIndex");

    Pipeline pipeline = new Pipeline()
        .setStages(new PipelineStage[] { materialIndexer, shapeIndexer, brandIndexer, shopIndexer });// (1)

    PipelineModel pipelineModel = pipeline.fit(dataset);// (2)

    Dataset<Row> indexedDataset = pipelineModel.transform(dataset);// (3)
    indexedDataset.show(10);

  }

}

Execution result

image.png

The result is the same because they are doing the same thing semantically.

Write the indexing process a little smarter

Well, the execution part of the process was refreshed using *** Pipeline ***, but the following part is also a bit redundant.

    StringIndexer materialIndexer = new StringIndexer()
        .setInputCol("material")
        .setOutputCol("materialIndex");

    StringIndexer shapeIndexer = new StringIndexer()
        .setInputCol("shape")
        .setOutputCol("shapeIndex");

    StringIndexer brandIndexer = new StringIndexer()
        .setInputCol("brand")
        .setOutputCol("brandIndex");

    StringIndexer shopIndexer = new StringIndexer()
        .setInputCol("shop")
        .setOutputCol("shopIndex");

    Pipeline pipeline = new Pipeline()
        .setStages(new PipelineStage[] { materialIndexer, shapeIndexer, brandIndexer, shopIndexer });/

This is more of a Java writing issue than Spark, but let's make it a little cleaner.

I rewrote it as follows.

    List<String> categoricalColNames = Arrays.asList("material", "shape", "brand", "shop");// (1)

    List<StringIndexer> stringIndexers = categoricalColNames.stream()
        .map(col -> new StringIndexer()
            .setInputCol(col)
            .setOutputCol(col + "Index"))// (2)
        .collect(Collectors.toList());

    Pipeline pipeline = new Pipeline()
        .setStages(stringIndexers.toArray(new PipelineStage[0]));// (3)

** (1) ** ・ ・ ・ Set the category variable name to List

** (2) ** ・ ・ ・ *** List # stream *** is used to generate *** StringIndexer ***. In *** setOutputCol ***, the column name on the output side is ** category variable name + "Index" ** (that is, ** materialIndex, shapeIndex, brandIndex, shopIndex **). The processing result will be ** List ** of *** StringIndexer ***.

** (3) ** ・ ・ ・ Set to the Pipeline stage so that it is an array of *** StringIndexer ***

So, the source code of this time is summarized as follows. It was pretty refreshing.


import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;

import org.apache.spark.ml.Pipeline;
import org.apache.spark.ml.PipelineModel;
import org.apache.spark.ml.PipelineStage;
import org.apache.spark.ml.feature.StringIndexer;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.SparkSession;

public class GBTRegressionStep02_part04 {

  public static void main(String[] args) {

    // System.setProperty("hadoop.home.dir", "c:\\Temp\\winutil\\");//for windows

    org.apache.log4j.Logger.getLogger("org").setLevel(org.apache.log4j.Level.ERROR);
    org.apache.log4j.Logger.getLogger("akka").setLevel(org.apache.log4j.Level.ERROR);

    SparkSession spark = SparkSession
        .builder()
        .appName("GradientBoostingTreeGegression")
        .master("local[*]")
        .getOrCreate();

    Dataset<Row> dataset = spark
        .read()
        .format("csv")
        .option("header", "true")
        .option("inferSchema", "true")
        .load("dataset/gem_price_ja.csv");

    List<String> categoricalColNames = Arrays.asList("material", "shape", "brand", "shop");

    List<StringIndexer> stringIndexers = categoricalColNames.stream()
        .map(col -> new StringIndexer()
            .setInputCol(col)
            .setOutputCol(col + "Index"))
        .collect(Collectors.toList());

    Pipeline pipeline = new Pipeline()
        .setStages(stringIndexers.toArray(new PipelineStage[0]));

    PipelineModel pipelineModel = pipeline.fit(dataset);

    Dataset<Row> indexedDataset = pipelineModel.transform(dataset);
    indexedDataset.show(10);

  }

}

If you do this, you'll get a dataset with all the categorical variables indexed, as before.

StringIndexer quantification policy

I mentioned earlier that categorical variables are divided into ordinal variables and nominal variables, but what kind of policy does Spark's ** StringIndexer ** assign index numbers? Is it?

** StringIndexer ** orders by numbering as an index, but the default ordering rule is ** frequency of occurrence **.

In addition to ** frequency of occurrence **, you can also specify ** alphabetical order **.

A setting example is shown below.

Frequency of appearance-descending order(Default)


    StringIndexer materialIndexer1 = new StringIndexer()
        .setStringOrderType("frequencyDesc")
        .setInputCol("material")
        .setOutputCol("materialIndex");

Alphabetical order-descending order


    StringIndexer materialIndexer2 = new StringIndexer()
        .setStringOrderType("alphabetDesc")
        .setInputCol("material")
        .setOutputCol("materialIndex");

Alphabetical order-ascending order


    StringIndexer materialIndexer3 = new StringIndexer()
        .setStringOrderType("alphabetAsc")
        .setInputCol("material")
        .setOutputCol("materialIndex");

I won't use it this time, but if you want to make the index order meaningful, you can adjust the variable names and sort by ** setStringOrderType **.

Apache Spark ** spark.ml ** pipeline processing summary

image.png

** Continued to Next time "Learning with # 3 training data and making [Price estimation engine]" **

Next time, we will actually learn the data and create a "price estimation engine".

Recommended Posts

Introduction to Machine Learning with Spark "Price Estimate" # 2 Data Preprocessing (Handling of Category Variables)
Introduction to Machine Learning with Spark "Price Estimate" # 3 Make a [Price Estimate Engine] by learning with training data
Getting Started with Machine Learning with Spark "Price Estimate" # 1 Loading Datasets with Apache Spark (Java)
[Introduction to Computer Science Part 3: Let's try machine learning] Let's implement k-means clustering in Java-Center of data set-
[Machine learning with Apache Spark] Associate the importance (Feature Importance) of variables in a tree model with variable names (explanatory variable names)
Introduction to SpringBoot + In-Memory Data Grid (Event Handling)
[Machine learning with Apache Spark] Sparse Vector (sparse vector) and Dense Vector (dense vector)
Introduction to RSpec 4. Create test data with Factory Bot