I tried the OSS visualization tool, superset

Introduction

This is the article on the 11th day of "WACUL Advent Calendar 2016". This time, I will write an impression and introduction of touching the visualization tool superset made by airbnb.

What is superset

It's officially written at http://airbnb.io/superset/index.html. I personally recognize it as the following tool.

  1. Flexible data input / output mechanism
  2. Flask-based web application
  3. Rich built-in management functions

Also, when I tried it with ubuntu 16.04, it was easy to introduce (the procedure is as simple as on the official website).

Flexible data input / output mechanism

superset supports various output formats such as graph output, json, csv, etc. to output the accumulated data. In addition, the input source also supports various types such as MySQL and PostgresSQL.

Dashboard, Slice

The superset has components called "Dashboard" and "Slice". The image is that the Dashboard is a collection of Slices. I understand that Slice is simply a meaningful display unit in the Dashboard.

スクリーンショット 2016-12-04 23.52.29.png

In this image, the entire screen is Dashboard, "Region Filter", "% Rural", "Growth Rate", etc. are Slice.

Settings such as which slice to place on the dashboard screen can be done from the management screen.

スクリーンショット 2016-12-05 0.08.16.png

It's a little hard to see, but write which Slice to put on the Dashboard on the "Slices" form and where to put it on the "Position JSON" form.

Slice, Visualization Type, Datasource

There are two components that make up Slice: the "data itself" and the "display method." The components that play their respective roles are called "Datasource" and "Visualization Type" in superset. There are many Visualization Types available in superset.

スクリーンショット 2016-12-10 23.16.58.png スクリーンショット 2016-12-10 23.17.13.png スクリーンショット 2016-12-10 23.17.36.png

(↑ is a part)

Flask-based web application

superset is a web application built on Flask. If you're used to developing Flask, you can easily customize it.

Rich built-in management functions

There are many methods for authentication and access control.

スクリーンショット 2016-12-10 23.35.36.png

You can set Roles as user attributes, and you can also set the actions allowed for each Role. Also, if it is an update type action, a log will remain.

If you try to make your own, it will take a lot of work, but I'm happy that superset can be realized without writing a single line of code.

By the way, by default, it is a type of password authentication that writes to the self-managed DB, but as introduced earlier, it is a Flask-based application, so it is possible to change the authentication type such as OAuth. (I think that development around view is necessary for OAuth)

Try using superset with your own data

Since it's a big deal, I'll try using superset with my own data. This time I did the following:

  1. Switch the data source to MySQL
  2. Visualize your data with several Visualization Types

Switch data source to MySQL

First, write the following in superset_config.py.

SQLALCHEMY_DATABASE_URI = 'mysql://root:testtest@localhost/tdb1'

You should now be able to connect to the database named root user, password testtest, tdb1. Next, initialize the superset including user creation as follows.

fabmanager create-admin --app superset
superset db upgrade
superset init

That's it. Confirm that the following table has been created.

mysql> show tables;
+-------------------------+
| Tables_in_tdb1          |
+-------------------------+
| ab_permission           |
| ab_permission_view      |
| ab_permission_view_role |
| ab_register_user        |
| ab_role                 |
| ab_user                 |
| ab_user_role            |
| ab_view_menu            |
| access_request          |
| alembic_version         |
| clusters                |
| columns                 |
| css_templates           |
| dashboard_slices        |
| dashboard_user          |
| dashboards              |
| datasources             |
| dbs                     |
| favstar                 |
| logs                    |
| metrics                 |
| query                   |
| slice_user              |
| slices                  |
| sql_metrics             |
| table_columns           |
| tables                  |
| url                     |
+-------------------------+
28 rows in set (0.00 sec)

Visualize data with several Visualization Types

First, prepare the data. This time, I used Google Trends data. From 2015/12/01 to 2015/12/31, we will acquire and visualize daily scores with the keywords "lover" and "rear charge".

First, prepare a table like this

mysql> desc trend201512;
+---------------+---------+------+-----+---------+----------------+
| Field         | Type    | Null | Key | Default | Extra          |
+---------------+---------+------+-----+---------+----------------+
| id            | int(11) | NO   | PRI | NULL    | auto_increment |
| date          | date    | NO   |     | NULL    |                |
| score_koibito | int(11) | NO   |     | NULL    |                |
| score_riajuu  | int(11) | NO   |     | NULL    |                |
+---------------+---------+------+-----+---------+----------------+
4 rows in set (0.00 sec)

After that, import the data and register it as the data source on the superset side.

After making various settings, create a slice with VisualizationType set to table view and line chart to create the following dashboard.

スクリーンショット 2016-12-11 17.43.23.png

For some reason, it was read that the scores on the 24th and 25th were high even though it was a weekday.

By the way, in the process of making this graph, I also saw the (personally) painful points of superset.

First of all, what is the selling point of superset is that "data can be extracted (displayed as a graph or exported) without writing SQL".

However, this was also the reason why I felt painful.

If you try to make your own system to get this graph, for example, you just have to draw the returned data quickly around highcharts while issuing a query like select date, score_koibito, score_riajuu from trend201512; I will. There is nothing difficult.

On the other hand, in superset, if you do it in SQL, you only need to specify the field, and you need to do a separate work called "metrics definition". What is this? It is a mechanism to bite an aggregate function in the field of each table and give it an alias, and by default, count, avg, sum, etc. are prepared. Also set metrics according to the Visualization Type.

In this example, in the case of table view, we don't need an aggregate function in the first place, so we decided to define the metrics type called "identity". As a result, the SQL issued by superset to create the table view looks like this:

SELECT score_koibito AS score_koibito,
       score_riajuu AS score_riajuu
FROM trend201512
WHERE date >= STR_TO_DATE('1916-12-11 08:52:13', '%%Y-%%m-%%d %%H:%%i:%%s')
  AND date <= STR_TO_DATE('2016-12-11 08:52:13', '%%Y-%%m-%%d %%H:%%i:%%s') LIMIT 50

This SQL is automatically generated by manipulating the UI, and it is not possible to manually write and execute the SQL. Also, the date must be specified, and the date specification method uses superset's original DSL-like notation, making it difficult to write.

... I tried to write it hard, but in short, I wanted it to support the operation of data extraction by SQL (although I understand the merit that the operation by UI makes routine work more efficient). Also, since the items that can be specified in the UI differ depending on the Visualization Type (this is natural due to the data structure), I feel that the learning cost is not as cheap as it looks.

I was trying to do it but I couldn't do it

superset seems to be able to do asynchronous execution in cooperation with celery when throwing heavy queries. There are clues on this page http://airbnb.io/superset/sqllab.html

Support for long-running queries - uses the Celery distributed queue

to dispatch query handling to workers
supports defining a “results backend” to persist query results

at the end

I wrote about superset, a visualization tool made by airbnb. I wrote some complaints, but I think it's a great tool because it's packed with a lot of features. Especially, it is recommended for people who mainly analyze data and cannot spend time on visualization. I also thought that it could be used for monitoring if done well.

Recommended Posts

I tried the OSS visualization tool, superset
I tried the changefinder library!
I tried to introduce the block diagram generation tool blockdiag
I tried using Tensorboard, a visualization tool for machine learning
I tried the Naro novel API 2
I tried the TensorFlow tutorial 2nd
I tried the Naruro novel API
I tried to move the ball
I tried using the checkio API
I tried to estimate the interval.
[Updated as appropriate] I tried to organize the basic visualization methods
Continuation: I tried to introduce the block diagram generation tool blockdiag
I tried the TensorFlow tutorial MNIST 3rd
I tried the asynchronous server of Django 3.0
I tried to summarize the umask command
I tried tensorflow for the first time
I tried to recognize the wake word
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
Python: I tried the traveling salesman problem
I tried playing with the image with Pillow
I touched the data preparation tool Paxata
[Visualization] I tried using Bokeh / plotly! 【memorandum】
I tried the Python Tornado Testing Framework
I tried using the BigQuery Storage API
I tried web scraping to analyze the lyrics.
I tried using scrapy for the first time
I tried the pivot table function of pandas
[Python] I tried substituting the function name for the function name
I tried cluster analysis of the weather map
I tried hitting the Qiita API from go
vprof --I tried using the profiler for Python
I tried "differentiating" the image with Python + OpenCV
I tried scraping
I tried PyQ
I tried the least squares method in Python
I tried using PyCaret at the fastest speed
Before the coronavirus, I first tried SARS analysis
I tried using the Google Cloud Vision API
I tried to touch the API of ebay
I tried python programming for the first time.
I tried to correct the keystone of the image
I tried AutoKeras
I tried "binarizing" the image with Python + OpenCV
I tried using the Datetime module by Python
I tried Mind Meld for the first time
I tried papermill
I tried the Pepper NAOqi OS 2.5.5 SLAM feature
Qiita Job I tried to analyze the job offer
I tried running alembic, a Python migration tool
I tried django-slack
I tried Django
I tried using the image filter of OpenCV
LeetCode I tried to summarize the simple ones
I tried spleeter
I tried using the functional programming library toolz
I tried playing with the calculator on tkinter
I tried cgo
I tried to implement the traveling salesman problem
I tried to predict the price of ETF