Purpose of this article

What if you want to see the relationships between many variables at once in your data analysis?

`I think pair plot is typical,More pack! I was wondering if I could put it together recently, Sankey DiagramI knew that, so I drew it.`



 ** Addendum: **
 <font color = "red"> Please read the additional part at the end of the article first. </ Font>


# How to use Plotly Sankey Diagram


#### **`It seems that you can use Plotly,First of all[Official site](https://plot.ly/python/sankey-diagram/)Copy the sample code of,I will check if it works. `**

import plotly.graph_objects as go

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = ["A1", "A2", "B1", "B2", "C1", "C2"],
      color = "blue"
    ),
    link = dict(
      source = [0, 1, 0, 2, 3, 3], # indices correspond to labels, eg A1, A2, A2, B1, ...
      target = [2, 3, 3, 4, 4, 5],
      value = [8, 4, 2, 8, 4, 2]
  ))])

fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()

It worked! It's good to see the details of that part when you hover your mouse over it!

`The code is long and difficult compared to matplotlib and seaborn,The important parts are:.`



```python
label = ["A1", "A2", "B1", "B2", "C1", "C2"],

source = [0, 1, 0, 2, 3, 3], 
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]

For example, in the Sankey Diagram diagram above, `source: A1, target: B2, 2.00``` corresponds to the orange part of the three lists in the link below. It means that "only` `2``` flows from` `label [0]` to label [3]` ``.

If you can create a list that specifies the start and end points of a node and the amount of flow that flows through it, you can draw a Sankey Diagram!

Draw a Sankey Diagram from a data frame

So, let's start creating the Sankey Diagram from the data frame of the main subject.

To show the results first, this time, I created the following figure using the data of the Titanic.

Commentary

Load the library.

import numpy as np
import pandas as pd
import plotly.graph_objects as go

I will download the data.

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

Read the data, this time only the categorical variables and the variables with integer values are displayed, so narrow down the variable names.

filename = "/content/titanic.csv"
df = pd.read_csv(filename, encoding='utf-8')

cate_list = ["Survived", "Pclass", "Sex", "Siblings/Spouses Aboard", "Parents/Children Aboard"]

n = len(cate_list)

Then create a label_list.

label_list = []
source_list = []
target_list = []
value_list = []

for cate in cate_list:
    tmp_label_list=[]
    for v in df[cate].unique():
        lab = "{0}={1}".format(cate, v)
        tmp_label_list.append(lab)
        tmp_label_list.sort()
    label_list.extend(tmp_label_list)

Create three lists of link information.

for i in range(n-1):
    source_cate = cate_list[i]
    target_cate = cate_list[i+1]

    for sc in df[source_cate].unique():
        for tc in df[target_cate].unique():

            v = sum((df[source_cate]==sc) & (df[target_cate]==tc))
            source_lab = "{0}={1}".format(source_cate, sc)
            target_lab = "{0}={1}".format(target_cate, tc)

            source_list.append(source_lab)
            target_list.append(target_lab)
            value_list.append(v)

Finally, `source_list``` and `target_list``` must be specified by index, so

`label_Refer to list and convert.`



```python
source_list = [label_list.index(si) for si in source_list]
target_list = [label_list.index(ti) for ti in target_list]

All you have to do now is run the same code as the sample.

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = label_list,
      color = "blue"
    ),
    link = dict(
      source = source_list,
      target = target_list,
      value = value_list
  ))])

fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()

that's all!

Code list

Show list

import numpy as np
import pandas as pd

!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv

filename = "/content/titanic.csv"
df = pd.read_csv(filename, encoding='utf-8')

cate_list = ["Survived", "Pclass", "Sex", "Siblings/Spouses Aboard", "Parents/Children Aboard"]

n = len(cate_list)

label_list = []
source_list = []
target_list = []
value_list = []

for cate in cate_list:
    tmp_label_list=[]
    for v in df[cate].unique():
        lab = "{0}={1}".format(cate, v)
        tmp_label_list.append(lab)
        tmp_label_list.sort()
    label_list.extend(tmp_label_list)


for i in range(n-1):
    source_cate = cate_list[i]
    target_cate = cate_list[i+1]

    for sc in df[source_cate].unique():
        for tc in df[target_cate].unique():

            v = sum((df[source_cate]==sc) & (df[target_cate]==tc))
            source_lab = "{0}={1}".format(source_cate, sc)
            target_lab = "{0}={1}".format(target_cate, tc)

            source_list.append(source_lab)
            target_list.append(target_lab)
            value_list.append(v)

source_list = [label_list.index(si) for si in source_list]
target_list = [label_list.index(ti) for ti in target_list]

fig = go.Figure(data=[go.Sankey(
    node = dict(
      pad = 15,
      thickness = 20,
      line = dict(color = "black", width = 0.5),
      label = label_list,
      color = "blue"
    ),
    link = dict(
      source = source_list,
      target = target_list,
      value = value_list
  ))])

fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()

Afterword

Actually, I noticed when I wrote the code myself, but this figure only shows the relationship between the variables before and after a certain variable. In the example above, even if you know the relationship between `Survived``` and Pclass``` and `` `` Pclass``` and `` Sex I don't know Survived and Sex. It seems that up to 3 variables can be expressed by color etc., but it seems impossible if it becomes more than that.

(Oh, this isn't a visualization of 4 dimensions or more ...?)

If you know a better way, please let us know in the comments.

Postscript

I've done a lot above, but if I was reading the docs there was an easier way.

fig = px.parallel_categories(df, dimensions=cate_list, color='Survived')
fig.show()

It's amazing to be able to move the order of variables and the order of elements! !!

that's all!

reference

Plotly：Sankey Diagram in Python Plotly：basic-parallel-category-diagram-with-plotlyexpress CS109：A Titanic Probability

[Python] What do you do with visualization of 4 or more variables?