[Introduction to Computer Science Part 2: Let's try machine learning] Let's implement k-means clustering in Java-Distance between data-

Introduction

Hello, this is Sumiyama water.

Continuing from the previous session. [Introduction to Computer Science Part 1: Let's try machine learning] Let's implement k-means clustering in Java-About the concept of coordinates-

Last time, I tried the concept of coordinates for arranging data on a plane or space, and even expressing it in Java.

This time, I will talk about the concept of distance between data.

environment

I wrote about the preparation of this environment in this article. Notes when starting new development with IntelliJ + Gradle + SpringBoot + JUnit5 (jupiter)

When you think about it again, what is the distance?

As I mentioned in Part 0, what I want to do with clustering is to ** collect close data **, but then I will define that ** data are close to each other **. is needed.

For example, in the case shown in the figure

image.png

It's obvious to the human eye, but since it's a computer that calculates, it's possible to express ** near ** / ** far ** numerically.

Therefore, in the computer, by using the value ** distance **,

――The larger the distance value, the farther --The smaller the distance value, the closer

You will be able to compare numerically like this [^ 1].

Use the Euclidean distance

Actually, there are so many ways to calculate the distance that research can be done by itself, but this time I will use the ** Euclidean distance **, which is easy to calculate and the mechanism is visually easy to understand.

It has a big name, but in short, it is a normal distance ** that can be seen when measuring between two points with a ruler. However, it is not possible to let the computer use a ruler for each data, so it is calculated by calculation.

image.png

Please see the figure. Do you remember the ** Three Square Theorem **? The point is that. The points placed at the coordinates [1,4] and the points placed at the coordinates [5,1] are the horizontal direction (the first element of each coordinate) and the vertical direction (the second element of each coordinate). ) Is the square of the difference, the sum is the square root, and the distance is 5.

By the way, it can be expanded as much as you want because it can be calculated in the same way in 3D and higher dimensions.

The three-dimensional distances [3,1,2] and [1,0,6] are image.png It will be. As the number of dimensions increases, it is OK if you increase the number to be added.

Let's implement

package net.tan3sugarless.clusteringsample.lib.data;

import lombok.EqualsAndHashCode;
import lombok.Getter;
import lombok.ToString;
import lombok.Value;
import net.tan3sugarless.clusteringsample.exception.DimensionNotUnifiedException;
import net.tan3sugarless.clusteringsample.exception.NullCoordinateException;

import java.util.List;
import java.util.Objects;
import java.util.stream.Collectors;

/**
 *Coordinate set on Euclidean metric space
 */
@Getter
@ToString
@EqualsAndHashCode
@Value
public class EuclideanSpace {

    private final List<List<Double>> points;
    private final int dimension;

    /**
     *Set a list of n-dimensional coordinates and dimensions
     *
     * @param points :List of n-dimensional coordinates
     * @throws DimensionNotUnifiedException :I set a list where the dimensions of the coordinates are not unified
     * @throws NullCoordinateException      :Null was included in the numerical value of the coordinates
     * @throws NullPointerException         :Passed null data or data containing null elements
     */
    public EuclideanSpace(List<List<Double>> points) {
        if (points.stream().mapToInt(List::size).distinct().count() > 1) {
            throw new DimensionNotUnifiedException();
        }
        if (points.stream().anyMatch(point -> point.stream().anyMatch(Objects::isNull))) {
            throw new NullCoordinateException();
        }

        this.points = points;
        this.dimension = points.stream().findFirst().map(List::size).orElse(0);
    }

    /**
     *Calculate the distance of each coordinate stored in this instance from any coordinate
     *
     * @param target The coordinates of the reference point where you want to get the distance from each coordinate
     * @A list of Euclidean distances from the return target
     * @throws DimensionNotUnifiedException :Target and instance dimensions are different
     * @throws NullCoordinateException      :target contains null
     */
    public List<Double> distanceFromTarget(List<Double> target) {
        if (target.size() != dimension) {
            throw new DimensionNotUnifiedException();
        }

        if (target.stream().anyMatch(Objects::isNull)) {
            throw new NullCoordinateException();
        }

        return points.stream().map(point -> {
            double squareOfDistance = 0.0;
            for (int i = 0; i < target.size(); i++) {
                squareOfDistance += Math.pow(point.get(i) - target.get(i), 2);
            }

            return Math.sqrt(squareOfDistance);
        }).collect(Collectors.toList());
    }
}

We will make it in the same class as Last time.

First, we've modified the constructors and fields a bit to make it easier to check dimensions.

    private final int dimension;

I made it possible to store dimensions in the int field,

        this.dimension = points.stream().findFirst().map(List::size).orElse(0);

It is an addition of logic to calculate the dimension of the acquired data.

And here is the method to actually calculate the distance. In future implementation, we want to calculate "distance to a certain point and all coordinates", so it looks like this.

    /**
     *Calculate the distance of each coordinate stored in this instance from any coordinate
     *
     * @param target The coordinates of the reference point where you want to get the distance from each coordinate
     * @A list of Euclidean distances from the return target
     * @throws DimensionNotUnifiedException :Target and instance dimensions are different
     * @throws NullCoordinateException      :target contains null
     */
    public List<Double> distanceFromTarget(List<Double> target) {
        if (target.size() != dimension) {
            throw new DimensionNotUnifiedException();
        }

        if (target.stream().anyMatch(Objects::isNull)) {
            throw new NullCoordinateException();
        }

        return points.stream().map(point -> {
            double squareOfDistance = 0.0;
            for (int i = 0; i < target.size(); i++) {
                squareOfDistance += Math.pow(point.get(i) - target.get(i), 2);
            }

            return Math.sqrt(squareOfDistance);
        }).collect(Collectors.toList());
    }

Suddenly, Math :: pow or Math :: sqrt appears, but pow is a power and sqrt is a method that takes a square root. Math is a class that you are not familiar with in Web services and business applications, but remember that it is often used in arithmetic calculations.

I want to do something like this when illustrated.

image.png

And test

package net.tan3sugarless.clusteringsample.lib.data;

import net.tan3sugarless.clusteringsample.exception.DimensionNotUnifiedException;
import net.tan3sugarless.clusteringsample.exception.NullCoordinateException;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;

import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.stream.Stream;

class EuclideanSpaceTest {

    //Whole null,Sky,1 element,Multiple elements
    //Each element contains null,Including the sky,All empty(0 dimension),1D,n dimensions
    //Contains null coordinates within each element,Including 0,Does not include null
    //Dimension check All the same dimension,Different dimensions
    static Stream<Arguments> testConstructorProvider() {
        //@formatter:off
        return Stream.of(
                Arguments.of(null, new NullPointerException(), 0),
                Arguments.of(Collections.emptyList(), null, 0),
                Arguments.of(Arrays.asList(Arrays.asList(1.5, -2.1)), null, 2),
                Arguments.of(Arrays.asList(Arrays.asList(1.2, 0.1), Arrays.asList(0.0, 1.5)), null, 2),
                Arguments.of(Arrays.asList(null, Arrays.asList(0, 1.5), Arrays.asList(-0.9, 0.1)), new NullPointerException(), 0),
                Arguments.of(Arrays.asList(Arrays.asList(-0.9, 0.1), Arrays.asList(0.0, 1.5), Collections.emptyList()), new DimensionNotUnifiedException(), 0),
                Arguments.of(Arrays.asList(Collections.emptyList(), Collections.emptyList(), Collections.emptyList()), null, 0),
                Arguments.of(Arrays.asList(Arrays.asList(1.5), Arrays.asList(0.0), Arrays.asList(-2.2)), null, 1),
                Arguments.of(Arrays.asList(Arrays.asList(1.5, 2.2, -1.9), Arrays.asList(0.0, 0.0, 0.0), Arrays.asList(0.9, 5.0, 2.2)), null, 3),
                Arguments.of(Arrays.asList(Arrays.asList(1.5, null, -1.9), Arrays.asList(0.0, 0.0, 0.0), Arrays.asList(0.9, 5.0, 2.2)), new NullCoordinateException(), 0),
                Arguments.of(Arrays.asList(Arrays.asList(1.5, 2.1, -1.9), Arrays.asList(0.0, 0.0), Arrays.asList(0.9, 5.0, 2.2)), new DimensionNotUnifiedException(), 0),
                Arguments.of(Arrays.asList(Arrays.asList(2.1, -1.9), Arrays.asList(0, 0, 0), Arrays.asList(0.9, 5.0, 2.2)), new DimensionNotUnifiedException(), 0)
        );
        //@formatter:on
    }

    @ParameterizedTest
    @MethodSource("testConstructorProvider")
    @DisplayName("Constructor testing")
    void testConstructor(List<List<Double>> points, RuntimeException e, int dimension) {
        if (e == null) {
            Assertions.assertDoesNotThrow(() -> new EuclideanSpace(points));

            EuclideanSpace actual = new EuclideanSpace(points);
            Assertions.assertEquals(dimension, actual.getDimension());
        } else {
            Assertions.assertThrows(e.getClass(), () -> new EuclideanSpace(points));
        }
    }

    // points :0 cases/1/2 cases,0 dimension/1D/2D, 0/Positive/negative
    // target : null/Sky/1D/2D,Including null/Not included, 0/Positive/negative/Same coordinates
    static Stream<Arguments> testDistanceFromTargetProvider() {
        return Stream.of(
                //@formatter:off
                Arguments.of(Collections.emptyList(), Collections.emptyList(), null, Collections.emptyList()),
                Arguments.of(Collections.emptyList(), Arrays.asList(0.1), new DimensionNotUnifiedException(), Collections.emptyList()),
                Arguments.of(Arrays.asList(Collections.emptyList()), Collections.emptyList(), null, Arrays.asList(0.0)),
                Arguments.of(Arrays.asList(Collections.emptyList()), Arrays.asList(0.1), new DimensionNotUnifiedException(), Collections.emptyList()),
                Arguments.of(Arrays.asList(Arrays.asList(3.0)), Arrays.asList(1.0), null, Arrays.asList(2.0)),
                Arguments.of(Arrays.asList(Arrays.asList(3.0)), Arrays.asList(1.0, 2.0), new DimensionNotUnifiedException(), Collections.emptyList()),
                Arguments.of(Arrays.asList(Arrays.asList(-1.0, 0.0)), Arrays.asList(2.0, -4.0), null, Arrays.asList(5.0)),
                Arguments.of(Arrays.asList(Arrays.asList(-1.0, 0.0)), Arrays.asList(null, -4.0), new NullCoordinateException(), Collections.emptyList()),
                Arguments.of(Arrays.asList(Arrays.asList(-3.0, 0.0), Arrays.asList(0.0, -4.0)), Arrays.asList(0.0, -4.0), null, Arrays.asList(5.0, 0.0))
                //@formatter:on
        );
    }

    @ParameterizedTest
    @MethodSource("testDistanceFromTargetProvider")
    @DisplayName("Distance calculation test")
    void testDistanceFromTarget(List<List<Double>> points, List<Double> target, RuntimeException e, List<Double> distances) {
        EuclideanSpace space = new EuclideanSpace(points);

        if (e == null) {
            Assertions.assertEquals(distances, space.distanceFromTarget(target));
        } else {
            Assertions.assertThrows(e.getClass(), () -> space.distanceFromTarget(target));
        }
    }
}

The version implemented this time is released in this version of GitHub. https://github.com/tan3nonsugar/clusteringsample/releases/tag/v0.0.2

Thank you for reading this far. Next time, I will talk about the idea of "the center of gravity of data". Then ~ Noshi

[^ 1]: Strictly speaking, we should explain dissimilarity / similarity first, or talk about dissatisfaction that meets the mathematical definition of distance, but prioritize getting a sense. I will omit it here.

Recommended Posts

[Introduction to Computer Science Part 2: Let's try machine learning] Let's implement k-means clustering in Java-Distance between data-
[Introduction to Computer Science Part 3: Let's try machine learning] Let's implement k-means clustering in Java-Center of data set-
[Introduction to Computer Science No. 0: Try Machine Learning] Let's implement k-means clustering in Java
[Introduction to Computer Science Part 1: Let's try machine learning] Let's implement k-means clustering in Java-About the concept of coordinates-
Try to implement Yubaba in Ruby
Try to implement Yubaba in Java
Try to implement n-ary addition in Java