I am outsourced to analyze data from client companies. The other day, I received a request from a customer to evaluate the product in order to consider introducing Paxata, and I had the opportunity to try Paxata on a trial basis. Paxata is a data preparation tool acquired by DataRobot in 2019 [^ 1]. There are two patterns to use, subscription or having them put in the VM of Azure / AWS, and this time it was the latter.
It's just an impression. Whether each has its advantages or disadvantages depends on the time and the case.
――Even though it is non-coding, some programming thinking ability is required
--Non-coding tools are not magic
--Since the developer himself has to design the combination of parts and process the data, it is a high hurdle for people who can not program at all.
--High visibility and easy to understand what kind of processing is being done
――If you are writing a detailed design document, it will be unnecessary
--There is a preview function for processing results and an automatic name identification function for character strings.
--You can compare before and after replacement with preview by replacement processing.
--If there is a difference between Co., Ltd.
and Co., Ltd.
in the character string, the name will be automatically identified.
--You can't do very complicated processing
--Processing is serial and cannot be nested, branched or repeated
――No matter who makes it, the finish will be the same level (~~ SIer seems to like it ~~)
――I can't export the created process to Python
--Vendor lock-in
--Currently, only DataRobot supports cooperation with machine learning.
--In order to use the processed data with scikit-learn, you need to export it to a DB or file once.
--Difficult to incorporate review and deployment processes
--Since there is no concept of deployment such as development / production environment, you will be in direct contact with the product in operation during maintenance.
――It's difficult to review because you can't issue pull requests or see differences like git.
--There is no test function like pytest or JUnit
--Since there is a version control function, you can revert to the previous version.
Paxata consists of three components:
# | component | Description |
---|---|---|
1 | Library | Manage datasets (project output is also managed here) |
2 | project | Definition of data processing |
3 | Project flow | Definition of project processing flow and execution schedule |
When developing
That is the general flow.
If you try importing a CSV file, it will look like this. The data was borrowed from here.
A feature called "Profile" will give you information about basic statistics and categories for each column.
Profile results are also managed in the library.
Let's create a project with the imported data.
If you try to change or replace the data type of a column, you will get a preview of the processing result like this.
You can also create a new column using a function like Excel with a tool called "Calculation".
The grammar was pretty severe.
You can also aggregate with a tool called "aggregate". However, this is a type of aggregation that you add as a new column, such as when you Count Encode.
For ordinary (?) Aggregation, use a tool called "Shape".
Let's schedule the created project. In addition to the time interval, you can also specify the crontab format.
It looks like this when displayed in a graph. I'm afraid there is only one project ...
When executed, it looks like this.
The processing result is managed in the library as an answer set.
This article is written with permission from our client companies and Paxata distributors.
Recommended Posts