Hello, this is Sumiyama water.
I will write the continuation of the previous [Introduction to Computer Science No. 0: Let's try machine learning] Let's implement k-means clustering in Java.
In the last article, I introduced what you can finally do with clustering.
Today in Part 1, I will explain and implement the concept of "coordinates" to express the position of this point as numerical data [^ 1].
I wrote about the preparation of this environment in this article. Notes when starting new development with IntelliJ + Gradle + SpringBoot + JUnit5 (jupiter)
As I mentioned in 0th, the target of clustering is "site access time and staying time" or "credit card usage date and usage amount". It is numerical data that exists in real life such as.
However, when learning the technique of clustering, it is easier to understand the concept if it is abstracted once and it is a position on a plane or a solid, so I will proceed with the story by saying "that kind of thing".
Aside from the difficult story, please see the figure. As shown in the figure, a point placed 0.8 in the horizontal direction and 0.9 in the vertical direction is expressed as [0.8, 0.9]. This will be called the position **.
Now, this properly placed point can be treated as data with the numerical value [0.8,0.9].
Shall we expand it a little more?
As shown in the figure, if you place points appropriately in a three-dimensional solid, it is assumed that the position is 0.8 horizontal, 0.9 vertical, and 1.2 high. Then this point can be expressed as [0.8, 0.9, 1.2].
If you feel like it, you can express it numerically like [0.8, 0.9, 1.2, 1.7, 0.1] regardless of whether it is 4D or 10D, but so far it can be drawn with a figure that can be understood by the human eye. Since 3D is the best, don't make a figure ...
I've only been touching Java lately, so I'll implement it in Java.
For the time being, every time the axis increases to 2D, 3D, 4D
Since the data type is "numerical value" and "multiple numbers are held", I think that Double type List is good for the position of one point.
So, as I wrote in 0th, multiple points are finally processed, so there are multiple Double Lists. so,
private final List<List<Double>> points;
If you create a class with such a field, you can express "data to be analyzed" in the class.
After that, let's make it possible to store data in the constructor.
I wonder if the requirements for storing data will be like this
--It is impossible to advance null, so if null is included in Double type, Exception
--The position cannot be null, so if there is null in List
So, if you implement this
package net.tan3sugarless.clusteringsample.lib.data;
import lombok.EqualsAndHashCode;
import lombok.Getter;
import lombok.ToString;
import lombok.Value;
import net.tan3sugarless.clusteringsample.exception.DimensionNotUnifiedException;
import net.tan3sugarless.clusteringsample.exception.NullCoordinateException;
import java.util.List;
/**
*Coordinate set on Euclidean metric space
*/
@Getter
@ToString
@EqualsAndHashCode
@Value
public class EuclideanSpace {
private final List<List<Double>> points;
/**
*Set a list of n-dimensional coordinates
*
* DimensionNotUnifiedException
*I set a list where the dimensions of the coordinates are not unified
*
* NullCoordinateException
*Null was included in the numerical value of the coordinates
*
* NullPointerException
*Passed null data or data containing null elements
*
* @param points :List of n-dimensional coordinates
*/
public EuclideanSpace(List<List<Double>> points){
if(points.stream().mapToInt(List::size).distinct().count()>1){
throw new DimensionNotUnifiedException();
}
if(points.stream().anyMatch(point -> point.stream().anyMatch(x -> x == null))){
throw new NullCoordinateException();
}
this.points = points;
}
}
I have defined Exception by myself.
Suddenly the term "Euclid" comes up, but if you start to say something like space, the story will derail, so please go through.
So I'll write a constructor test as well.
package net.tan3sugarless.clusteringsample.lib.data;
import net.tan3sugarless.clusteringsample.exception.DimensionNotUnifiedException;
import net.tan3sugarless.clusteringsample.exception.NullCoordinateException;
import org.junit.jupiter.api.Assertions;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.params.ParameterizedTest;
import org.junit.jupiter.params.provider.Arguments;
import org.junit.jupiter.params.provider.MethodSource;
import java.util.Arrays;
import java.util.Collections;
import java.util.List;
import java.util.stream.Stream;
public class EuclideanSpaceTest {
//Whole null,Sky,1 element,Multiple elements
//Each element contains null,Including the sky,All empty(0 dimension),1D,n dimensions
//Contains null coordinates within each element,Including 0,Does not include null
//Dimension check All the same dimension,Different dimensions
static Stream<Arguments> testConstructorProvider(){
return Stream.of(
Arguments.of(null,new NullPointerException()),
Arguments.of(Collections.emptyList(),null),
Arguments.of(Arrays.asList(Arrays.asList(1.5,-2.1)),null),
Arguments.of(Arrays.asList(Arrays.asList(1.2,0.1),Arrays.asList(0.0,1.5)),null),
Arguments.of(Arrays.asList(null,Arrays.asList(0,1.5),Arrays.asList(-0.9,0.1)),new NullPointerException()),
Arguments.of(Arrays.asList(Arrays.asList(-0.9,0.1),Arrays.asList(0.0,1.5),Collections.emptyList()),new DimensionNotUnifiedException()),
Arguments.of(Arrays.asList(Collections.emptyList(),Collections.emptyList(),Collections.emptyList()),null),
Arguments.of(Arrays.asList(Arrays.asList(1.5),Arrays.asList(0.0),Arrays.asList(-2.2)),null),
Arguments.of(Arrays.asList(Arrays.asList(1.5,2.2,-1.9),Arrays.asList(0.0,0.0,0.0),Arrays.asList(0.9,5.0,2.2)),null),
Arguments.of(Arrays.asList(Arrays.asList(1.5,null,-1.9),Arrays.asList(0.0,0.0,0.0),Arrays.asList(0.9,5.0,2.2)),new NullCoordinateException()),
Arguments.of(Arrays.asList(Arrays.asList(1.5,2.1,-1.9),Arrays.asList(0.0,0.0),Arrays.asList(0.9,5.0,2.2)),new DimensionNotUnifiedException()),
Arguments.of(Arrays.asList(Arrays.asList(2.1,-1.9),Arrays.asList(0,0,0),Arrays.asList(0.9,5.0,2.2)),new DimensionNotUnifiedException())
);
}
@ParameterizedTest
@MethodSource("testConstructorProvider")
@DisplayName("Constructor testing")
void testConstructor(List<List<Double>> points, RuntimeException e){
if(e==null){
Assertions.assertDoesNotThrow(()->new EuclideanSpace(points));
}else{
Assertions.assertThrows(e.getClass(),()->new EuclideanSpace(points));
}
}
}
Now you can represent "a set of points in any dimension" in a Java class.
The code I wrote here can be found on GitHub.
https://github.com/tan3nonsugar/clusteringsample/tree/v0.0.1
Next time, I will explain how to express "how far the dots are" in Java.
Next time! Thank you for reading this far.
[^ 1]: Actually, the order is reversed (only the numerical data is expressed in a two-dimensional space in an easy-to-understand manner), but please forgive that it lacks strictness due to the emphasis on comprehensibility.
Recommended Posts