[JAVA] [Machine learning with Apache Spark] Sparse Vector (sparse vector) and Dense Vector (dense vector)

Overview

--Organization of Apache Spark's Sparse Vector and Dense Vector

environment

What is Sparse Vector?

Sparse means "sparse".

When the elements of a vector contain many 0s For example

[0.1,0.0,0.0,0.0,0.3]

When there was a vector called

To express this vector

"** The value of the first element is 0.1 **, the value of the last element is 0.3 **, and the number of elements is 5 **."

It is based on the idea that the information is enough. (The values of other elements are ** 0.0 **)

In this way, the amount of information can be reduced, so the benefit is that ** memory can be saved **.

Something like a sparse vector is implemented in most machine learning libraries.

Sparce Vector in Spark

Now, how to make a sparse vector in Spark is easy.

Use ** org.apache.spark.ml.linalg.SparseVector ** in the ** spark.ml ** package.

** SparseVector ** class initializes by specifying an array of indexes (indices) and an array of values (values)

If you want to create [0.1,0.0,0.0,0.0,0.3], you can create an array of indexes (indices) new int [] {0, 2} and an array of values (values) new double. Initialize with [] {0.1, 0.5} OK

image.png

SparseVector


// SparseVector(Sparse vector)
int size = 3;//Vector element size
int[] indices = new int[] { 0, 4 };
double[] svalues = new double[] { 0.1, 0.5 };
Vector svec = new SparseVector(size, indices, svalues);
System.out.println("SparseVector=" + Arrays.toString(svec.toArray()));

Execution result


SparseVector=[0.1, 0.0, 0.0, 0.0, 0.5]

It can be arrayed with Vector # toArray, but of course the memory consumed is saved because only indices (array of subscripts) and values (array of values) are retained when not needed.

What is Dense Vector?

Pair with sparse vector. It holds all the values of the vector elements like a general array.

[0.1,0.0,0.0,0.0,0.3]

Dense Vector in Spark

DenseVecotr initializes by specifying an array of values (values)

If you want to create [0.1,0.0,0.0,0.0,0.3], pass the array new double [] {0.1, 0.0, 0.0, 0.0, 0.5} of the number of elements to create it. Is Dense Vector

image.png

DenseVector


// DenseVector(Dense vector)
double[] dvalues = new double[] { 0.1, 0.0, 0.0, 0.0, 0.5 };
Vector dvec = new DenseVector(dvalues);
System.out.println("DenseVector=" + Arrays.toString(dvec.toArray()));

Execution result


DenseVector=[0.1, 0.0, 0.0, 0.0, 0.5]

Full source code (Java)

Use ** Apache Spark ** from Java

SparkVectorExamples.java


package org.riversun.spark;

import java.util.Arrays;

import org.apache.spark.ml.linalg.DenseVector;
import org.apache.spark.ml.linalg.SparseVector;
import org.apache.spark.ml.linalg.Vector;

public class SparkVectorExamples {

	public static void main(String[] args) {

		// DenseVector(Dense vector)
		double[] dvalues = new double[] { 0.1, 0.0, 0.0, 0.0, 0.5 };
		Vector dvec = new DenseVector(dvalues);
		System.out.println("DenseVector=" + Arrays.toString(dvec.toArray()));

		// SparseVector(Sparse vector)
		int size = 5;//Vector element size
		int[] indices = new int[] { 0, 4 };
		double[] svalues = new double[] { 0.1, 0.5 };
		Vector svec = new SparseVector(size, indices, svalues);
		System.out.println("SparseVector=" + Arrays.toString(svec.toArray()));
	}
}

Execution result


DenseVector=[0.1, 0.0, 0.0, 0.0, 0.5]
SparseVector=[0.1, 0.0, 0.0, 0.0, 0.5]

Recommended Posts

[Machine learning with Apache Spark] Sparse Vector (sparse vector) and Dense Vector (dense vector)
Getting Started with Machine Learning with Spark "Price Estimate" # 1 Loading Datasets with Apache Spark (Java)
Data linkage with Spark and Cassandra
[Machine learning with Apache Spark] Associate the importance (Feature Importance) of variables in a tree model with variable names (explanatory variable names)
Introduction to Machine Learning with Spark "Price Estimate" # 2 Data Preprocessing (Handling of Category Variables)
Introduction to Machine Learning with Spark "Price Estimate" # 3 Make a [Price Estimate Engine] by learning with training data