I am outsourced to analyze data from client companies. The other day, I received a request from a customer to evaluate the product in order to consider introducing Paxata, and I had the opportunity to try Paxata on a trial basis. Paxata is a data preparation tool acquired by DataRobot in 2019 [^ 1]. There are two patterns to use, subscription or having them put in the VM of Azure / AWS, and this time it was the latter.

Impressions

It's just an impression. Whether each has its advantages or disadvantages depends on the time and the case.

――Even though it is non-coding, some programming thinking ability is required --Non-coding tools are not magic --Since the developer himself has to design the combination of parts and process the data, it is a high hurdle for people who can not program at all. --High visibility and easy to understand what kind of processing is being done ――If you are writing a detailed design document, it will be unnecessary --There is a preview function for processing results and an automatic name identification function for character strings. --You can compare before and after replacement with preview by replacement processing. --If there is a difference between Co., Ltd. and Co., Ltd. in the character string, the name will be automatically identified. --You can't do very complicated processing --Processing is serial and cannot be nested, branched or repeated ――No matter who makes it, the finish will be the same level (~~ SIer seems to like it ~~) ――I can't export the created process to Python --Vendor lock-in --Currently, only DataRobot supports cooperation with machine learning. --In order to use the processed data with scikit-learn, you need to export it to a DB or file once. --Difficult to incorporate review and deployment processes --Since there is no concept of deployment such as development / production environment, you will be in direct contact with the product in operation during maintenance. ――It's difficult to review because you can't issue pull requests or see differences like git. --There is no test function like pytest or JUnit --Since there is a version control function, you can revert to the previous version.

Actual operation

Paxata consists of three components:

#	component	Description
1	Library	Manage datasets (project output is also managed here)
2	project	Definition of data processing
3	Project flow	Definition of project processing flow and execution schedule

When developing

Import the dataset into the library
Define the process in the project
Schedule processing in the project flow
Check the processing result in the library

That is the general flow.

Import dataset to library

If you try importing a CSV file, it will look like this. The data was borrowed from here.

A feature called "Profile" will give you information about basic statistics and categories for each column.

Profile results are also managed in the library.

Define processing in project

Let's create a project with the imported data.

If you try to change or replace the data type of a column, you will get a preview of the processing result like this.

You can also create a new column using a function like Excel with a tool called "Calculation".

The grammar was pretty severe.

You can also aggregate with a tool called "aggregate". However, this is a type of aggregation that you add as a new column, such as when you Count Encode.