Hello, this is Sumiyama water.
Continuing from the previous session. [Introduction to Computer Science Part 1: Let's try machine learning] Let's implement k-means clustering in Java-About the concept of coordinates-
Last time, I tried the concept of coordinates for arranging data on a plane or space, and even expressing it in Java.
This time, I will talk about the concept of distance between data.
I wrote about the preparation of this environment in this article. Notes when starting new development with IntelliJ + Gradle + SpringBoot + JUnit5 (jupiter)
As I mentioned in Part 0, what I want to do with clustering is to ** collect close data **, but then I will define that ** data are close to each other **. is needed.
For example, in the case shown in the figure
It's obvious to the human eye, but since it's a computer that calculates, it's possible to express ** near ** / ** far ** numerically.
Therefore, in the computer, by using the value ** distance **,
――The larger the distance value, the farther --The smaller the distance value, the closer
You will be able to compare numerically like this [^ 1].
Actually, there are so many ways to calculate the distance that research can be done by itself, but this time I will use the ** Euclidean distance **, which is easy to calculate and the mechanism is visually easy to understand.
It has a big name, but in short, it is a normal distance ** that can be seen when measuring between two points with a ruler. However, it is not possible to let the computer use a ruler for each data, so it is calculated by calculation.
Please see the figure. Do you remember the ** Three Square Theorem **? The point is that. The points placed at the coordinates [1,4] and the points placed at the coordinates [5,1] are the horizontal direction (the first element of each coordinate) and the vertical direction (the second element of each coordinate). ) Is the square of the difference, the sum is the square root, and the distance is 5.
By the way, it can be expanded as much as you want because it can be calculated in the same way in 3D and higher dimensions.
The three-dimensional distances [3,1,2] and [1,0,6] are It will be. As the number of dimensions increases, it is OK if you increase the number to be added.
package net.tan3sugarless.clusteringsample.lib.data;
import lombok.EqualsAndHashCode;
import lombok.Getter;
import lombok.ToString;
import lombok.Value;
import net.tan3sugarless.clusteringsample.exception.DimensionNotUnifiedException;
import net.tan3sugarless.clusteringsample.exception.NullCoordinateException;
import java.util.List;
import java.util.Objects;
import java.util.stream.Collectors;
/**
*Coordinate set on Euclidean metric space
*/
@Getter
@ToString
@EqualsAndHashCode
@Value
public class EuclideanSpace {
private final List<List<Double>> points;
private final int dimension;
/**
*Set a list of n-dimensional coordinates and dimensions
*
* @param points :List of n-dimensional coordinates
* @throws DimensionNotUnifiedException :I set a list where the dimensions of the coordinates are not unified
* @throws NullCoordinateException :Null was included in the numerical value of the coordinates
* @throws NullPointerException :Passed null data or data containing null elements
*/
public EuclideanSpace(List<List<Double>> points) {
if (points.stream().mapToInt(List::size).distinct().count() > 1) {
throw new DimensionNotUnifiedException();
}
if (points.stream().anyMatch(point -> point.stream().anyMatch(Objects::isNull))) {
throw new NullCoordinateException();
}
this.points = points;
this.dimension = points.stream().findFirst().map(List::size).orElse(0);
}
/**
*Calculate the distance of each coordinate stored in this instance from any coordinate
*
* @param target The coordinates of the reference point where you want to get the distance from each coordinate
* @A list of Euclidean distances from the return target
* @throws DimensionNotUnifiedException :Target and instance dimensions are different
* @throws NullCoordinateException :target contains null
*/
public List<Double> distanceFromTarget(List<Double> target) {
if (target.size() != dimension) {
throw new DimensionNotUnifiedException();
}
if (target.stream().anyMatch(Objects::isNull)) {
throw new NullCoordinateException();
}
return points.stream().map(point -> {
double squareOfDistance = 0.0;
for (int i = 0; i < target.size(); i++) {
squareOfDistance += Math.pow(point.get(i) - target.get(i), 2);
}
return Math.sqrt(squareOfDistance);
}).collect(Collectors.toList());
}
}
We will make it in the same class as Last time.
First, we've modified the constructors and fields a bit to make it easier to check dimensions.
private final int dimension;
I made it possible to store dimensions in the int field,
this.dimension = points.stream().findFirst().map(List::size).orElse(0);
It is an addition of logic to calculate the dimension of the acquired data.
And here is the method to actually calculate the distance. In future implementation, we want to calculate "distance to a certain point and all coordinates", so it looks like this.
/**
*Calculate the distance of each coordinate stored in this instance from any coordinate
*
* @param target The coordinates of the reference point where you want to get the distance from each coordinate
* @A list of Euclidean distances from the return target
* @throws DimensionNotUnifiedException :Target and instance dimensions are different
* @throws NullCoordinateException :target contains null
*/
public List<Double> distanceFromTarget(List<Double> target) {
if (target.size() != dimension) {
throw new DimensionNotUnifiedException();
}
if (target.stream().anyMatch(Objects::isNull)) {
throw new NullCoordinateException();
}
return points.stream().map(point -> {
double squareOfDistance = 0.0;
for (int i = 0; i < target.size(); i++) {
squareOfDistance += Math.pow(point.get(i) - target.get(i), 2);
}
return Math.sqrt(squareOfDistance);
}).collect(Collectors.toList());
}
Suddenly, Math :: pow
or Math :: sqrt
appears, but pow is a power and sqrt is a method that takes a square root. Math is a class that you are not familiar with in Web services and business applications, but remember that it is often used in arithmetic calculations.
I want to do something like this when illustrated.
And test
package net.tan3sugarless.clusteringsample.lib.data;
import net.tan3sugarless.clusteringsample.exception.DimensionNotUnifiedException;
import net.tan3sugarless.clusteringsample.exception.NullCoordinateException;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.stream.Stream;
class EuclideanSpaceTest {
//Whole null,Sky,1 element,Multiple elements
//Each element contains null,Including the sky,All empty(0 dimension),1D,n dimensions
//Contains null coordinates within each element,Including 0,Does not include null
//Dimension check All the same dimension,Different dimensions
static Stream<Arguments> testConstructorProvider() {
//@formatter:off
return Stream.of(
Arguments.of(null, new NullPointerException(), 0),
Arguments.of(Collections.emptyList(), null, 0),
Arguments.of(Arrays.asList(Arrays.asList(1.5, -2.1)), null, 2),
Arguments.of(Arrays.asList(Arrays.asList(1.2, 0.1), Arrays.asList(0.0, 1.5)), null, 2),
Arguments.of(Arrays.asList(null, Arrays.asList(0, 1.5), Arrays.asList(-0.9, 0.1)), new NullPointerException(), 0),
Arguments.of(Arrays.asList(Arrays.asList(-0.9, 0.1), Arrays.asList(0.0, 1.5), Collections.emptyList()), new DimensionNotUnifiedException(), 0),
Arguments.of(Arrays.asList(Collections.emptyList(), Collections.emptyList(), Collections.emptyList()), null, 0),
Arguments.of(Arrays.asList(Arrays.asList(1.5), Arrays.asList(0.0), Arrays.asList(-2.2)), null, 1),
Arguments.of(Arrays.asList(Arrays.asList(1.5, 2.2, -1.9), Arrays.asList(0.0, 0.0, 0.0), Arrays.asList(0.9, 5.0, 2.2)), null, 3),
Arguments.of(Arrays.asList(Arrays.asList(1.5, null, -1.9), Arrays.asList(0.0, 0.0, 0.0), Arrays.asList(0.9, 5.0, 2.2)), new NullCoordinateException(), 0),
Arguments.of(Arrays.asList(Arrays.asList(1.5, 2.1, -1.9), Arrays.asList(0.0, 0.0), Arrays.asList(0.9, 5.0, 2.2)), new DimensionNotUnifiedException(), 0),
Arguments.of(Arrays.asList(Arrays.asList(2.1, -1.9), Arrays.asList(0, 0, 0), Arrays.asList(0.9, 5.0, 2.2)), new DimensionNotUnifiedException(), 0)
);
//@formatter:on
}
@ParameterizedTest
@MethodSource("testConstructorProvider")
@DisplayName("Constructor testing")
void testConstructor(List<List<Double>> points, RuntimeException e, int dimension) {
if (e == null) {
Assertions.assertDoesNotThrow(() -> new EuclideanSpace(points));
EuclideanSpace actual = new EuclideanSpace(points);
Assertions.assertEquals(dimension, actual.getDimension());
} else {
Assertions.assertThrows(e.getClass(), () -> new EuclideanSpace(points));
}
}
// points :0 cases/1/2 cases,0 dimension/1D/2D, 0/Positive/negative
// target : null/Sky/1D/2D,Including null/Not included, 0/Positive/negative/Same coordinates
static Stream<Arguments> testDistanceFromTargetProvider() {
return Stream.of(
//@formatter:off
Arguments.of(Collections.emptyList(), Collections.emptyList(), null, Collections.emptyList()),
Arguments.of(Collections.emptyList(), Arrays.asList(0.1), new DimensionNotUnifiedException(), Collections.emptyList()),
Arguments.of(Arrays.asList(Collections.emptyList()), Collections.emptyList(), null, Arrays.asList(0.0)),
Arguments.of(Arrays.asList(Collections.emptyList()), Arrays.asList(0.1), new DimensionNotUnifiedException(), Collections.emptyList()),
Arguments.of(Arrays.asList(Arrays.asList(3.0)), Arrays.asList(1.0), null, Arrays.asList(2.0)),
Arguments.of(Arrays.asList(Arrays.asList(3.0)), Arrays.asList(1.0, 2.0), new DimensionNotUnifiedException(), Collections.emptyList()),
Arguments.of(Arrays.asList(Arrays.asList(-1.0, 0.0)), Arrays.asList(2.0, -4.0), null, Arrays.asList(5.0)),
Arguments.of(Arrays.asList(Arrays.asList(-1.0, 0.0)), Arrays.asList(null, -4.0), new NullCoordinateException(), Collections.emptyList()),
Arguments.of(Arrays.asList(Arrays.asList(-3.0, 0.0), Arrays.asList(0.0, -4.0)), Arrays.asList(0.0, -4.0), null, Arrays.asList(5.0, 0.0))
//@formatter:on
);
}
@ParameterizedTest
@MethodSource("testDistanceFromTargetProvider")
@DisplayName("Distance calculation test")
void testDistanceFromTarget(List<List<Double>> points, List<Double> target, RuntimeException e, List<Double> distances) {
EuclideanSpace space = new EuclideanSpace(points);
if (e == null) {
Assertions.assertEquals(distances, space.distanceFromTarget(target));
} else {
Assertions.assertThrows(e.getClass(), () -> space.distanceFromTarget(target));
}
}
}
The version implemented this time is released in this version of GitHub. https://github.com/tan3nonsugar/clusteringsample/releases/tag/v0.0.2
Thank you for reading this far. Next time, I will talk about the idea of "the center of gravity of data". Then ~ Noshi
[^ 1]: Strictly speaking, we should explain dissimilarity / similarity first, or talk about dissatisfaction that meets the mathematical definition of distance, but prioritize getting a sense. I will omit it here.
Recommended Posts