Use DataFrame in Java

Do you know DataFrame?

DataFrame is a very useful library object that is often used when working with data in Python or R, or when you want to do machine learning. It has various functions for handling tabular data and 2D array data. https://amalog.hateblo.jp/entry/kaggle-pandas-tips

How convenient it is is that it is indispensable when using 2D data, such as reading data from Excel or CSV at once, extracting an arbitrary matrix of a 2D array, joining tables like SQL operations, etc. It is a level that can not be done.

But in Java, there is no implementation equivalent to DataFrame, so I can not benefit from DataFrame, it takes time to process data, and I am saddened by ranting (?) Why it is Python. That's why.

Morpheus data science framework

However, there is a ray of light there. The Morpheus data science framework provides (probably) equivalent functionality to that DataFrame. https://github.com/zavtech/morpheus-core

Let's try it according to A Simple Example.

Consider a dataset of motor vehicle characteristics accessible here. The code below loads this CSV data into a Morpheus DataFrame, filters the rows to only include those vehicles that have a power to weight ratio > 0.1 (where weight is converted into kilograms), then adds a column to record the relative efficiency between highway and city mileage (MPG), sorts the rows by this newly added column in descending order, and finally records this transformed result to a CSV file.

We will use car characteristic data (in this sample). Load the CSV into the Morpheus DataFrame, filter the rows by conditions where the output (horsepower) weight ratio is greater than 0.1, and add a column for the highway / urban MPG ratio. Sort by the added columns and output the results to a CSV file.

Sample code and execution result

import com.zavtech.morpheus.frame.*;

public class MorpheusTester {

    public static void main(String[] args) {

        DataFrame.read().csv(options -> {
            options.setResource("http://zavtech.com/data/samples/cars93.csv");
            options.setExcludeColumnIndexes(0);
        }).rows().select(row -> {
            double weightKG = row.getDouble("Weight") * 0.453592d;
            double horsepower = row.getDouble("Horsepower");
            return horsepower / weightKG > 0.1d;
        }).cols().add("MPG(Highway/City)", Double.class, v -> {
            double cityMpg = v.row().getDouble("MPG.city");
            double highwayMpg = v.row().getDouble("MPG.highway");
            return highwayMpg / cityMpg;
        }).rows().sort(false, "MPG(Highway/City)").write().csv(options -> {
            options.setFile("./cars93m.csv");
            options.setTitle("DataFrame");
        });

    }
}

Since the return type is DataFrame type, you can execute csv (), select (), add (), sort () in succession in the method chain. The area around csv () is very DataFrame-like.

DataFrame is feature-rich and it's not easy to test for equality, but I was able to see operations like row extraction, column addition, and sorting.

When processing data in Java, you may consider using it.

Recommended Posts

Use DataFrame in Java
Use Mean in DataFrame
Facade pattern in Java
Singleton pattern in Java
Flyweight pattern in Java
Use config.ini in Python
Observer pattern in Java
Use dates in Python
Iterator pattern in Java
Use Valgrind in Python
Decorator pattern in Java
Use ujson in requests
Use profiler in Python
Prototype pattern in Java
Proxy pattern in Java
Let's use def in python
Use "$ in" operator with mongo-go-driver
Use let expression in Python
Use Anaconda in pyenv environment
Use callback function in Python
Use parameter store in Python
Use HTTP cache in Python
Use regular expressions in C
Use MongoDB ODM in Python
Use list-keyed dict in Python
Use Random Forest in Python
Use regular expressions in Python
Use Spyder in Python IDE
Template Method pattern in Java
Use Juman ++ in server mode
ยท Address already in use solution
Use <input type = "date"> in Flask
Implement Table Driven Test in Java
Detect and process signals in Java.
Pandas / DataFrame Tips for practical use
Use jinja2 template in excel file
Use optinal type-like in Go language
Use fabric as is in python (fabric3)
How to use classes in Theano
Mock in python-how to use mox
Use watchdog (watchmedo) in test-driven development
How to use SQLite in Python
Chain of Responsibility pattern in Java
Use rospy with virtualenv in Python3
Use API not implemented in twython
How to use Mysql in python
Use Python in pyenv with NeoVim
How to use ChemSpider in Python
Implemented bubble sort in Java (BubbleSort)
How to use PubChem in Python
Use django-debug-toolbar in VirtualBox / Vagrant environment
Use OpenCV with Python 3 in Window
How to use calculated columns in CASTable
Overlapping regular expressions in Python and Java
Use the type features evolved in Sphinx-2.4
Use print in a Python2 lambda expression
Use of constraints file added in pip 7.1
[Python] View dataframe in VScode debug console
In gunicorn (> = 19.2), use max_requests_jitter along with max_requests
Use tensorflow in an environment without root
Easily use your own functions in Python