What if you want to see the relationships between many variables at once in your data analysis?
I think pair plot is typical,More pack! I was wondering if I could put it together recently, **Sankey Diagram**I knew that, so I drew it.
** Addendum: **
<font color = "red"> Please read the additional part at the end of the article first. </ Font>
# How to use Plotly Sankey Diagram
#### **`It seems that you can use Plotly,First of all[Official site](https://plot.ly/python/sankey-diagram/)Copy the sample code of,I will check if it works. `**
import plotly.graph_objects as go
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = ["A1", "A2", "B1", "B2", "C1", "C2"],
color = "blue"
),
link = dict(
source = [0, 1, 0, 2, 3, 3], # indices correspond to labels, eg A1, A2, A2, B1, ...
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]
))])
fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()
It worked! It's good to see the details of that part when you hover your mouse over it!
The code is long and difficult compared to matplotlib and seaborn,The important parts are:.
```python
label = ["A1", "A2", "B1", "B2", "C1", "C2"],
source = [0, 1, 0, 2, 3, 3],
target = [2, 3, 3, 4, 4, 5],
value = [8, 4, 2, 8, 4, 2]
For example, in the Sankey Diagram diagram above, `source: A1, target: B2, 2.00``` corresponds to the orange part of the three lists in the link below. It means that "only` `2``` flows from` `label [0]`
to
label [3]` ``.
If you can create a list that specifies the start and end points of a node and the amount of flow that flows through it, you can draw a Sankey Diagram!
So, let's start creating the Sankey Diagram from the data frame of the main subject.
To show the results first, this time, I created the following figure using the data of the Titanic.
Load the library.
import numpy as np
import pandas as pd
import plotly.graph_objects as go
I will download the data.
!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv
Read the data, this time only the categorical variables and the variables with integer values are displayed, so narrow down the variable names.
filename = "/content/titanic.csv"
df = pd.read_csv(filename, encoding='utf-8')
cate_list = ["Survived", "Pclass", "Sex", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
n = len(cate_list)
Then create a label_list
.
label_list = []
source_list = []
target_list = []
value_list = []
for cate in cate_list:
tmp_label_list=[]
for v in df[cate].unique():
lab = "{0}={1}".format(cate, v)
tmp_label_list.append(lab)
tmp_label_list.sort()
label_list.extend(tmp_label_list)
Create three lists of link information.
for i in range(n-1):
source_cate = cate_list[i]
target_cate = cate_list[i+1]
for sc in df[source_cate].unique():
for tc in df[target_cate].unique():
v = sum((df[source_cate]==sc) & (df[target_cate]==tc))
source_lab = "{0}={1}".format(source_cate, sc)
target_lab = "{0}={1}".format(target_cate, tc)
source_list.append(source_lab)
target_list.append(target_lab)
value_list.append(v)
Finally, `source_list``` and
`target_list``` must be specified by index, so
label_Refer to list and convert.
```python
source_list = [label_list.index(si) for si in source_list]
target_list = [label_list.index(ti) for ti in target_list]
All you have to do now is run the same code as the sample.
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = label_list,
color = "blue"
),
link = dict(
source = source_list,
target = target_list,
value = value_list
))])
fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()
that's all!
import numpy as np
import pandas as pd
!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv
filename = "/content/titanic.csv"
df = pd.read_csv(filename, encoding='utf-8')
cate_list = ["Survived", "Pclass", "Sex", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
n = len(cate_list)
label_list = []
source_list = []
target_list = []
value_list = []
for cate in cate_list:
tmp_label_list=[]
for v in df[cate].unique():
lab = "{0}={1}".format(cate, v)
tmp_label_list.append(lab)
tmp_label_list.sort()
label_list.extend(tmp_label_list)
for i in range(n-1):
source_cate = cate_list[i]
target_cate = cate_list[i+1]
for sc in df[source_cate].unique():
for tc in df[target_cate].unique():
v = sum((df[source_cate]==sc) & (df[target_cate]==tc))
source_lab = "{0}={1}".format(source_cate, sc)
target_lab = "{0}={1}".format(target_cate, tc)
source_list.append(source_lab)
target_list.append(target_lab)
value_list.append(v)
source_list = [label_list.index(si) for si in source_list]
target_list = [label_list.index(ti) for ti in target_list]
fig = go.Figure(data=[go.Sankey(
node = dict(
pad = 15,
thickness = 20,
line = dict(color = "black", width = 0.5),
label = label_list,
color = "blue"
),
link = dict(
source = source_list,
target = target_list,
value = value_list
))])
fig.update_layout(title_text="Basic Sankey Diagram", font_size=10)
fig.show()
Actually, I noticed when I wrote the code myself, but this figure only shows the relationship between the variables before and after a certain variable.
In the example above, even if you know the relationship between `Survived``` and
Pclass``` and `` `` Pclass``` and ``
Sex I don't know Survived
and Sex
. It seems that up to 3 variables can be expressed by color etc., but it seems impossible if it becomes more than that.
(Oh, this isn't a visualization of 4 dimensions or more ...?)
If you know a better way, please let us know in the comments.
I've done a lot above, but if I was reading the docs there was an easier way.
fig = px.parallel_categories(df, dimensions=cate_list, color='Survived')
fig.show()
It's amazing to be able to move the order of variables and the order of elements! !!
that's all!
Plotly:Sankey Diagram in Python Plotly:basic-parallel-category-diagram-with-plotlyexpress CS109:A Titanic Probability
Recommended Posts