For some reason, when I was looking at the Tableau SDK API Reference, I found the following description. I was surprised.
Insert Data Directly from CSV Files
You can use the PostgreSQL-like COPY command to insert records into a hyper file, which is Tableau's unique data format, at high speed **.
A few years ago, I implemented a process to create a tde file from a CSV file in order to distribute Tableau Reader regularly, but due to the SDK specifications, it is an insert for each record, so processing a large amount of data I have the impression that it took a long time. However, if you can insert it directly with the COPY command, it seems that you can quickly create Tableau's original file (hyper in this case) even with a large amount of data.
This time, I will insert the training data (CSV file) of Kaggle's Titanic dataset into the hyper file using the COPY command.
Click here for the code https://github.com/yolo-kiyoshi/csv2hyper
The complete _API can be installed with pip install tableau hyperapi
. _
It is running in the following directory.
The .hyper
file is created underdata /
.
Directory structure
.
├── data
│ └── train.csv # Titanic train
└── notebook
└── insert_test.ipynb
First, define the I / O file.
#Insert source CSV file path
src_path = '../data/train.csv'
#Hyper file path you want to create
dist_path = '../data/train.hyper'
Then use the COPY command to insert from CSV to hyper. It's a rough explanation, but it's like "(1) create a session to process hyper files", "(2) create a virtual PostgreSQL-like table", "(3) insert all at once from a CSV file", and "(4) spit out the entire table into a hyper file".
from tableauhyperapi import HyperProcess, Telemetry, Connection, CreateMode, NOT_NULLABLE, NULLABLE, SqlType, TableDefinition, escape_string_literal
#Create a session to work with Hyper files
with HyperProcess(telemetry=Telemetry.SEND_USAGE_DATA_TO_TABLEAU) as hyper:
with Connection(
endpoint=hyper.endpoint,
database=dist_path,
create_mode=CreateMode.CREATE_AND_REPLACE
) as connection:
#Define table
table_def = TableDefinition(
table_name='train',
columns=[
TableDefinition.Column("PassengerId", SqlType.big_int(), NOT_NULLABLE),
TableDefinition.Column("Survived", SqlType.big_int(), NOT_NULLABLE),
TableDefinition.Column("Pclass", SqlType.big_int(), NOT_NULLABLE),
TableDefinition.Column("Name", SqlType.text(), NOT_NULLABLE),
TableDefinition.Column("Sex", SqlType.text(), NOT_NULLABLE),
TableDefinition.Column("Age", SqlType.double(), NULLABLE),
TableDefinition.Column("SibSp", SqlType.big_int(), NOT_NULLABLE),
TableDefinition.Column("Parch", SqlType.big_int(), NOT_NULLABLE),
TableDefinition.Column("Ticket", SqlType.text(), NOT_NULLABLE),
TableDefinition.Column("Fare", SqlType.double(), NULLABLE),
TableDefinition.Column("Cabin", SqlType.text(), NULLABLE),
TableDefinition.Column("Embarked", SqlType.text(), NULLABLE)
]
)
#Virtually create a table based on the table definition(default schema is public)
connection.catalog.create_table(table_def)
#Execute COPY command like PostgreSQL
record_count = connection.execute_command(
command=f'''
COPY {table_def.table_name} from {escape_string_literal(src_path)} with (format csv, delimiter ',', header)
'''
)
print(f"The number of rows in table {table_def.table_name} is {record_count}.")
Execution result
The number of rows in table "train" is 891.
After processing, if you look at data / train.hyper
in Tableau ...
I was able to confirm that the hyper file was created successfully!
I was impressed that the Tableau SDK has evolved so far. If you mainly use Tableau Desktop or Tableau Server, you will rarely have the opportunity to use hyper files, but if for some reason you have to distribute reports using Tableau Reader etc., it is quite useful. Will do.
https://help.tableau.com/current/api/hyper_api/en-us/docs/hyper_api_insert_csv.html
Recommended Posts