Introduction

I'm starting to play with Kaggle's Titanic, but before complementing missing values or reviewing hyperparameters, I'd like to take a closer look at the data and look at it. I want to quickly group the read data by the value of Survived and draw a graph, but it doesn't work very well. Because I don't understand Pandas's" GroupBy ".

There are many examples of graph drawing by our predecessors on the net, but I wrote this article thinking that it might be useful for beginners by describing the path of my understanding.

Aiming goal

Draw a graph like the one below.

チケット記号ごとの生存・死亡・不明者数

In this graph, the horizontal axis is the symbol `Ticket```, the vertical axis is the survival ( s```), death (`` d```), unknown (`` na. The number of people in `) is accumulated and sorted in descending order by the total number of people. For example, the ticket symbol of` CA. 2343``` on the far left is 11 people in total, 4 people unknown, and the rest. 7 people have died.

I want to draw such a graph quickly.

Read data

Read the data and check the number for each same symbol in the data of `` `Ticket```.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train_data = pd.read_csv("../train.csv")
test_data = pd.read_csv("../test.csv")
total_data = pd.concat([train_data, test_data]) # train_data and test_Concatenate data

ticket_freq = total_data["Ticket"].value_counts()

CA. 2343        11
CA 2144          8
1601             8
S.O.C. 14879     7
3101295          7
                ..
350404           1
248706           1
367655           1
W./C. 14260      1
350047           1
Name: Ticket, Length: 929, dtype: int64

`CA.2343 11 people,8 CA 2144,And so on.`



# Create data for graphs
## Group by groupby

 First, group ``` total_data``` with a ticket symbol.

```python
total_data_ticket = total_data.groupby("Ticket")

#output
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F5A14327C8>

`The disadvantage of groupby is,It doesn't show the contents of the data.here, GroupedUnderstand in my head,Go to next.`



## Extract only survival information
 Next, retrieve the survival information (``` Survived```).

```python
total_data_ticket = total_data.groupby("Ticket")["Survived"]
total_data_ticket

#output
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001F5A1437B48>

The data is not displayed here either.

Count by survival, death, unknown

Then use `value_counts () ``` to count the number of Survived``` per value. By setting `dropna = False```, N / A is also counted. ..

total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False)
total_data_ticket

#output
Ticket       Survived
110152       1.0         3
110413       1.0         2
             0.0         1
110465       0.0         2
110469       NaN         1
                        ..
W.E.P. 5734  NaN         1
             0.0         1
W/C 14208    0.0         1
WE/P 5735    0.0         1
             1.0         1
Name: Survived, Length: 1093, dtype: int64

Change the shape of data

To draw a graph, change the data such as survival, death, and unknown data in a column direction. Use `` unstack () `.

total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False).unstack()
total_data_ticket

#output
Survived	NaN	0.0	1.0
Ticket			
110152	NaN	NaN	3.0
110413	NaN	1.0	2.0
110465	NaN	2.0	NaN
110469	1.0	NaN	NaN
110489	1.0	NaN	NaN
...	...	...	...
W./C. 6608	1.0	4.0	NaN
W./C. 6609	NaN	1.0	NaN
W.E.P. 5734	1.0	1.0	NaN
W/C 14208	NaN	1.0	NaN
WE/P 5735	NaN	1.0	1.0
929 rows × 3 columns

Draw a graph

Change N / A to numbers

Looking at the output above, the value still has `NaN```, so set `NaN``` to 0.

total_data_ticket.fillna(0, inplace=True)
total_data_ticket

#output
Survived	NaN	0.0	1.0
Ticket			
110152	0.0	0.0	3.0
110413	0.0	1.0	2.0
110465	0.0	2.0	0.0
110469	1.0	0.0	0.0
110489	1.0	0.0	0.0
...	...	...	...
W./C. 6608	1.0	4.0	0.0
W./C. 6609	0.0	1.0	0.0
W.E.P. 5734	1.0	1.0	0.0
W/C 14208	0.0	1.0	0.0
WE/P 5735	0.0	1.0	1.0
929 rows × 3 columns

Change column name

The column names are `NaN```,` 0.0, `` `1.0, but this is awkward, so change the column name.

total_data_ticket.columns = ["nan", "d", "s"]
total_data_ticket

#output
	nan	d	s
Ticket			
110152	0.0	0.0	3.0
110413	0.0	1.0	2.0
110465	0.0	2.0	0.0
110469	1.0	0.0	0.0
110489	1.0	0.0	0.0
...	...	...	...
W./C. 6608	1.0	4.0	0.0
W./C. 6609	0.0	1.0	0.0
W.E.P. 5734	1.0	1.0	0.0
W/C 14208	0.0	1.0	0.0
WE/P 5735	0.0	1.0	1.0
929 rows × 3 columns

Calculate the total number of people per row

I want to sort by total number of people in descending order, so I calculate the total number of people and save it in a new column. I use `sum ()` to calculate the total, but I calculate it in the column direction, so` `` sum (axis = 1) ```.

total_data_ticket["count"] = total_data_ticket.sum(axis=1)
total_data_ticket

#output
	nan	d	s	count
Ticket				
110152	0.0	0.0	3.0	3.0
110413	0.0	1.0	2.0	3.0
110465	0.0	2.0	0.0	2.0
110469	1.0	0.0	0.0	1.0
110489	1.0	0.0	0.0	1.0
...	...	...	...	...
W./C. 6608	1.0	4.0	0.0	5.0
W./C. 6609	0.0	1.0	0.0	1.0
W.E.P. 5734	1.0	1.0	0.0	2.0
W/C 14208	0.0	1.0	0.0	1.0
WE/P 5735	0.0	1.0	1.0	2.0
929 rows × 4 columns

Now you are ready to draw the graph.

Draw a graph

Decide the area of the number of people and sort in descending order

The code is shown first and explained in order.

total_data_ticket[total_data_ticket["count"] > 3].sort_values("count", ascending=False)[["nan", "d", "s"]].plot.bar(figsize=(15,10),stacked=True)

code	Contents
`total_data_ticket[total_data_ticket["count"] > 3]`	`"count"`Data greater than 3
`.sort_values("count", ascending=False)`	`"count"`Sort in descending order
`[["nan", "d", "s"]]`	Extract only the three columns on the left(`"count"`Is not useful)
`.plot.bar(figsize=(15,10),stacked=True)`	Draw a bar graph.Specify the size,I made it a stacking method

Now you can draw the graph shown at the beginning.

チケット記号ごとの生存・死亡・不明者数 (再掲)

Looking at this, people with `CA. 2343``` and `` CA 2144can imagine `Survived = 0``` ...

Whole code

Finally, the whole code is shown.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

train_data = pd.read_csv("../train.csv")
test_data = pd.read_csv("../test.csv")
total_data = pd.concat([train_data, test_data])

ticket_freq = total_data["Ticket"].value_counts()
ticket_freq

total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False).unstack()

total_data_ticket.fillna(0, inplace=True)
total_data_ticket.columns = ["nan", "d", "s"]
total_data_ticket["count"] = total_data_ticket.sum(axis=1)
total_data_ticket[total_data_ticket["count"] > 3].sort_values("count", ascending=False)[["nan", "d", "s"]].plot.bar(figsize=(15,10),stacked=True)

in conclusion

Using this technique, we also check other non-numeric data such as the surnames and titles of `Embarked```, `Cabin, and `` `` Name.

reference

-How to use Pandas groupby -Create a graph with the pandas plot method and visualize the data -Data overview with Pandas

Draw a graph by processing with Pandas groupby

Introduction

Aiming goal

Read data

CA.2343 11 people,8 CA 2144,And so on.

The disadvantage of groupby is,It doesn't show the contents of the data.here, ***Grouped***Understand in my head,Go to next.

Count by survival, death, unknown

Change the shape of data

Draw a graph

Change N / A to numbers

Change column name

Calculate the total number of people per row

Draw a graph

Decide the area of the number of people and sort in descending order

Whole code

in conclusion

reference

`CA.2343 11 people,8 CA 2144,And so on.`

`The disadvantage of groupby is,It doesn't show the contents of the data.here, GroupedUnderstand in my head,Go to next.`