[Introduction to Computer Science Part 1: Let's try machine learning] Let's implement k-means clustering in Java-About the concept of coordinates-

Introduction

Hello, this is Sumiyama water.

I will write the continuation of the previous [Introduction to Computer Science No. 0: Let's try machine learning] Let's implement k-means clustering in Java.

In the last article, I introduced what you can finally do with clustering.

image.png

Today in Part 1, I will explain and implement the concept of "coordinates" to express the position of this point as numerical data [^ 1].

environment

I wrote about the preparation of this environment in this article. Notes when starting new development with IntelliJ + Gradle + SpringBoot + JUnit5 (jupiter)

Representing the analysis target numerically

As I mentioned in 0th, the target of clustering is "site access time and staying time" or "credit card usage date and usage amount". It is numerical data that exists in real life such as.

However, when learning the technique of clustering, it is easier to understand the concept if it is abstracted once and it is a position on a plane or a solid, so I will proceed with the story by saying "that kind of thing".

Position of points on a plane (2D)

image.png

Aside from the difficult story, please see the figure. As shown in the figure, a point placed 0.8 in the horizontal direction and 0.9 in the vertical direction is expressed as [0.8, 0.9]. This will be called the position **.

Now, this properly placed point can be treated as data with the numerical value [0.8,0.9].

Consider the position on the 3D (3D) for the time being

image.png

Shall we expand it a little more?

As shown in the figure, if you place points appropriately in a three-dimensional solid, it is assumed that the position is 0.8 horizontal, 0.9 vertical, and 1.2 high. Then this point can be expressed as [0.8, 0.9, 1.2].

If you feel like it, you can express it numerically like [0.8, 0.9, 1.2, 1.7, 0.1] regardless of whether it is 4D or 10D, but so far it can be drawn with a figure that can be understood by the human eye. Since 3D is the best, don't make a figure ...

Let's implement

I've only been touching Java lately, so I'll implement it in Java.

For the time being, every time the axis increases to 2D, 3D, 4D

Since the data type is "numerical value" and "multiple numbers are held", I think that Double type List is good for the position of one point.

So, as I wrote in 0th, multiple points are finally processed, so there are multiple Double Lists. so,

    private final List<List<Double>> points;

If you create a class with such a field, you can express "data to be analyzed" in the class.

After that, let's make it possible to store data in the constructor.

I wonder if the requirements for storing data will be like this

--It is impossible to advance null, so if null is included in Double type, Exception --The position cannot be null, so if there is null in List , Exception --If a nullpo occurs at the time of the constructor, please handle it at the caller. --Because points with different dimensions such as planes and solids should not coexist, if 2D data and 3D data are mixed, that is, if data with different sizes of List are mixed, Exception

So, if you implement this

package net.tan3sugarless.clusteringsample.lib.data;

import lombok.EqualsAndHashCode;
import lombok.Getter;
import lombok.ToString;
import lombok.Value;
import net.tan3sugarless.clusteringsample.exception.DimensionNotUnifiedException;
import net.tan3sugarless.clusteringsample.exception.NullCoordinateException;

import java.util.List;

/**
 *Coordinate set on Euclidean metric space
 */
@Getter
@ToString
@EqualsAndHashCode
@Value
public class EuclideanSpace {

    private final List<List<Double>> points;

    /**
     *Set a list of n-dimensional coordinates
     *
     * DimensionNotUnifiedException
     *I set a list where the dimensions of the coordinates are not unified
     *
     * NullCoordinateException
     *Null was included in the numerical value of the coordinates
     *
     * NullPointerException
     *Passed null data or data containing null elements
     *
     * @param points :List of n-dimensional coordinates
     */
    public EuclideanSpace(List<List<Double>> points){
        if(points.stream().mapToInt(List::size).distinct().count()>1){
            throw new DimensionNotUnifiedException();
        }
        if(points.stream().anyMatch(point -> point.stream().anyMatch(x -> x == null))){
            throw new NullCoordinateException();
        }

        this.points = points;
    }
}

I have defined Exception by myself.

Suddenly the term "Euclid" comes up, but if you start to say something like space, the story will derail, so please go through.

So I'll write a constructor test as well.

package net.tan3sugarless.clusteringsample.lib.data;

import net.tan3sugarless.clusteringsample.exception.DimensionNotUnifiedException;
import net.tan3sugarless.clusteringsample.exception.NullCoordinateException;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.stream.Stream;

public class EuclideanSpaceTest {

    //Whole null,Sky,1 element,Multiple elements
    //Each element contains null,Including the sky,All empty(0 dimension),1D,n dimensions
    //Contains null coordinates within each element,Including 0,Does not include null
    //Dimension check All the same dimension,Different dimensions
    static Stream<Arguments> testConstructorProvider(){
        return Stream.of(
            Arguments.of(null,new NullPointerException()),
            Arguments.of(Collections.emptyList(),null),
            Arguments.of(Arrays.asList(Arrays.asList(1.5,-2.1)),null),
            Arguments.of(Arrays.asList(Arrays.asList(1.2,0.1),Arrays.asList(0.0,1.5)),null),
            Arguments.of(Arrays.asList(null,Arrays.asList(0,1.5),Arrays.asList(-0.9,0.1)),new NullPointerException()),
            Arguments.of(Arrays.asList(Arrays.asList(-0.9,0.1),Arrays.asList(0.0,1.5),Collections.emptyList()),new DimensionNotUnifiedException()),
            Arguments.of(Arrays.asList(Collections.emptyList(),Collections.emptyList(),Collections.emptyList()),null),
            Arguments.of(Arrays.asList(Arrays.asList(1.5),Arrays.asList(0.0),Arrays.asList(-2.2)),null),
            Arguments.of(Arrays.asList(Arrays.asList(1.5,2.2,-1.9),Arrays.asList(0.0,0.0,0.0),Arrays.asList(0.9,5.0,2.2)),null),
            Arguments.of(Arrays.asList(Arrays.asList(1.5,null,-1.9),Arrays.asList(0.0,0.0,0.0),Arrays.asList(0.9,5.0,2.2)),new NullCoordinateException()),
            Arguments.of(Arrays.asList(Arrays.asList(1.5,2.1,-1.9),Arrays.asList(0.0,0.0),Arrays.asList(0.9,5.0,2.2)),new DimensionNotUnifiedException()),
            Arguments.of(Arrays.asList(Arrays.asList(2.1,-1.9),Arrays.asList(0,0,0),Arrays.asList(0.9,5.0,2.2)),new DimensionNotUnifiedException())

        );
    }

    @ParameterizedTest
    @MethodSource("testConstructorProvider")
    @DisplayName("Constructor testing")
    void testConstructor(List<List<Double>> points, RuntimeException e){
        if(e==null){
            Assertions.assertDoesNotThrow(()->new EuclideanSpace(points));
        }else{
            Assertions.assertThrows(e.getClass(),()->new EuclideanSpace(points));
        }
    }
}

Now you can represent "a set of points in any dimension" in a Java class.

The code I wrote here can be found on GitHub.

https://github.com/tan3nonsugar/clusteringsample/tree/v0.0.1

Next time, I will explain how to express "how far the dots are" in Java.

Next time! Thank you for reading this far.

next time [Introduction to Computer Science Part 2: Let's try machine learning] Let's implement k-means clustering in Java-distance between data-

[^ 1]: Actually, the order is reversed (only the numerical data is expressed in a two-dimensional space in an easy-to-understand manner), but please forgive that it lacks strictness due to the emphasis on comprehensibility.

Recommended Posts

[Introduction to Computer Science Part 1: Let's try machine learning] Let's implement k-means clustering in Java-About the concept of coordinates-
[Introduction to Computer Science Part 3: Let's try machine learning] Let's implement k-means clustering in Java-Center of data set-
[Introduction to Computer Science Part 2: Let's try machine learning] Let's implement k-means clustering in Java-Distance between data-
[Introduction to Computer Science No. 0: Try Machine Learning] Let's implement k-means clustering in Java
The part I was addicted to in "Introduction to Ajax in Java Web Applications" of NetBeans
Understand the characteristics of Scala in 5 minutes (Introduction to Scala)
Quick learning Java "Introduction?" Part 2 Let's write the process
[Swift] I tried to implement the function of the vending machine
Let's try the implementation to find the area of the triangle that we did in the training for newcomers
Let's implement the condition that the circumference and the inside of the Ougi shape are included in Java [Part 2]
Let's implement the condition that the circumference and the inside of the Ougi shape are included in Java [Part 1]
Introduction to Machine Learning with Spark "Price Estimate" # 3 Make a [Price Estimate Engine] by learning with training data
Getting Started with Machine Learning with Spark "Price Estimate" # 1 Loading Datasets with Apache Spark (Java)
[Introduction to Computer Science Part 3: Let's try machine learning] Let's implement k-means clustering in Java-Center of data set-
[Machine learning with Apache Spark] Associate the importance (Feature Importance) of variables in a tree model with variable names (explanatory variable names)
Introduction to SpringBoot + In-Memory Data Grid (Event Handling)
[Machine learning with Apache Spark] Sparse Vector (sparse vector) and Dense Vector (dense vector)
Introduction to RSpec 4. Create test data with Factory Bot