Use DataFrame in Java

Do you know DataFrame?

DataFrame is a very useful library object that is often used when working with data in Python or R, or when you want to do machine learning. It has various functions for handling tabular data and 2D array data. https://amalog.hateblo.jp/entry/kaggle-pandas-tips

How convenient it is is that it is indispensable when using 2D data, such as reading data from Excel or CSV at once, extracting an arbitrary matrix of a 2D array, joining tables like SQL operations, etc. It is a level that can not be done.

But in Java, there is no implementation equivalent to DataFrame, so I can not benefit from DataFrame, it takes time to process data, and I am saddened by ranting (?) Why it is Python. That's why.

Morpheus data science framework

However, there is a ray of light there. The Morpheus data science framework provides (probably) equivalent functionality to that DataFrame. https://github.com/zavtech/morpheus-core

Let's try it according to A Simple Example.

Consider a dataset of motor vehicle characteristics accessible here. The code below loads this CSV data into a Morpheus DataFrame, filters the rows to only include those vehicles that have a power to weight ratio > 0.1 (where weight is converted into kilograms), then adds a column to record the relative efficiency between highway and city mileage (MPG), sorts the rows by this newly added column in descending order, and finally records this transformed result to a CSV file.

We will use car characteristic data (in this sample). Load the CSV into the Morpheus DataFrame, filter the rows by conditions where the output (horsepower) weight ratio is greater than 0.1, and add a column for the highway / urban MPG ratio. Sort by the added columns and output the results to a CSV file.

Sample code and execution result

import com.zavtech.morpheus.frame.*;

public class MorpheusTester {

    public static void main(String[] args) {

        DataFrame.read().csv(options -> {
            options.setResource("http://zavtech.com/data/samples/cars93.csv");
            options.setExcludeColumnIndexes(0);
        }).rows().select(row -> {
            double weightKG = row.getDouble("Weight") * 0.453592d;
            double horsepower = row.getDouble("Horsepower");
            return horsepower / weightKG > 0.1d;
        }).cols().add("MPG(Highway/City)", Double.class, v -> {
            double cityMpg = v.row().getDouble("MPG.city");
            double highwayMpg = v.row().getDouble("MPG.highway");
            return highwayMpg / cityMpg;
        }).rows().sort(false, "MPG(Highway/City)").write().csv(options -> {
            options.setFile("./cars93m.csv");
            options.setTitle("DataFrame");
        });

    }
}

Since the return type is DataFrame type, you can execute csv (), select (), add (), sort () in succession in the method chain. The area around csv () is very DataFrame-like.

Data before processing
Data after processing

DataFrame is feature-rich and it's not easy to test for equality, but I was able to see operations like row extraction, column addition, and sorting.

When processing data in Java, you may consider using it.