I'm starting to play with Kaggle's Titanic, but before complementing missing values or reviewing hyperparameters, I'd like to take a closer look at the data and look at it. I want to quickly group the read data by the value of Survived
and draw a graph, but it doesn't work very well. Because I don't understand Pandas's" GroupBy ".
There are many examples of graph drawing by our predecessors on the net, but I wrote this article thinking that it might be useful for beginners by describing the path of my understanding.
Draw a graph like the one below.
In this graph, the horizontal axis is the symbol `Ticket```, the vertical axis is the survival (
s```), death (`` d```), unknown (``
na. The number of people in `) is accumulated and sorted in descending order by the total number of people. For example, the ticket symbol of`
CA. 2343``` on the far left is 11 people in total, 4 people unknown, and the rest. 7 people have died.
I want to draw such a graph quickly.
Read the data and check the number for each same symbol in the data of `` `Ticket```.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
train_data = pd.read_csv("../train.csv")
test_data = pd.read_csv("../test.csv")
total_data = pd.concat([train_data, test_data]) # train_data and test_Concatenate data
ticket_freq = total_data["Ticket"].value_counts()
CA. 2343 11
CA 2144 8
1601 8
S.O.C. 14879 7
3101295 7
..
350404 1
248706 1
367655 1
W./C. 14260 1
350047 1
Name: Ticket, Length: 929, dtype: int64
CA.2343 11 people,8 CA 2144,And so on.
# Create data for graphs
## Group by groupby
First, group ``` total_data``` with a ticket symbol.
```python
total_data_ticket = total_data.groupby("Ticket")
#output
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001F5A14327C8>
The disadvantage of groupby is,It doesn't show the contents of the data.here, ***Grouped***Understand in my head,Go to next.
## Extract only survival information
Next, retrieve the survival information (``` Survived```).
```python
total_data_ticket = total_data.groupby("Ticket")["Survived"]
total_data_ticket
#output
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001F5A1437B48>
The data is not displayed here either.
Then use `value_counts () ``` to count the number of
Survived``` per value. By setting
`dropna = False```, N / A is also counted. ..
total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False)
total_data_ticket
#output
Ticket Survived
110152 1.0 3
110413 1.0 2
0.0 1
110465 0.0 2
110469 NaN 1
..
W.E.P. 5734 NaN 1
0.0 1
W/C 14208 0.0 1
WE/P 5735 0.0 1
1.0 1
Name: Survived, Length: 1093, dtype: int64
To draw a graph, change the data such as survival, death, and unknown data in a column direction. Use `` unstack ()
`.
total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False).unstack()
total_data_ticket
#output
Survived NaN 0.0 1.0
Ticket
110152 NaN NaN 3.0
110413 NaN 1.0 2.0
110465 NaN 2.0 NaN
110469 1.0 NaN NaN
110489 1.0 NaN NaN
... ... ... ...
W./C. 6608 1.0 4.0 NaN
W./C. 6609 NaN 1.0 NaN
W.E.P. 5734 1.0 1.0 NaN
W/C 14208 NaN 1.0 NaN
WE/P 5735 NaN 1.0 1.0
929 rows × 3 columns
Looking at the output above, the value still has `NaN```, so set
`NaN``` to 0.
total_data_ticket.fillna(0, inplace=True)
total_data_ticket
#output
Survived NaN 0.0 1.0
Ticket
110152 0.0 0.0 3.0
110413 0.0 1.0 2.0
110465 0.0 2.0 0.0
110469 1.0 0.0 0.0
110489 1.0 0.0 0.0
... ... ... ...
W./C. 6608 1.0 4.0 0.0
W./C. 6609 0.0 1.0 0.0
W.E.P. 5734 1.0 1.0 0.0
W/C 14208 0.0 1.0 0.0
WE/P 5735 0.0 1.0 1.0
929 rows × 3 columns
The column names are `NaN```,`
0.0, `` `1.0
, but this is awkward, so change the column name.
total_data_ticket.columns = ["nan", "d", "s"]
total_data_ticket
#output
nan d s
Ticket
110152 0.0 0.0 3.0
110413 0.0 1.0 2.0
110465 0.0 2.0 0.0
110469 1.0 0.0 0.0
110489 1.0 0.0 0.0
... ... ... ...
W./C. 6608 1.0 4.0 0.0
W./C. 6609 0.0 1.0 0.0
W.E.P. 5734 1.0 1.0 0.0
W/C 14208 0.0 1.0 0.0
WE/P 5735 0.0 1.0 1.0
929 rows × 3 columns
I want to sort by total number of people in descending order, so I calculate the total number of people and save it in a new column. I use `sum ()`
to calculate the total, but I calculate it in the column direction, so` `` sum (axis = 1) ```.
total_data_ticket["count"] = total_data_ticket.sum(axis=1)
total_data_ticket
#output
nan d s count
Ticket
110152 0.0 0.0 3.0 3.0
110413 0.0 1.0 2.0 3.0
110465 0.0 2.0 0.0 2.0
110469 1.0 0.0 0.0 1.0
110489 1.0 0.0 0.0 1.0
... ... ... ... ...
W./C. 6608 1.0 4.0 0.0 5.0
W./C. 6609 0.0 1.0 0.0 1.0
W.E.P. 5734 1.0 1.0 0.0 2.0
W/C 14208 0.0 1.0 0.0 1.0
WE/P 5735 0.0 1.0 1.0 2.0
929 rows × 4 columns
Now you are ready to draw the graph.
The code is shown first and explained in order.
total_data_ticket[total_data_ticket["count"] > 3].sort_values("count", ascending=False)[["nan", "d", "s"]].plot.bar(figsize=(15,10),stacked=True)
code | Contents |
---|---|
total_data_ticket[total_data_ticket["count"] > 3] |
"count" Data greater than 3 |
.sort_values("count", ascending=False) |
"count" Sort in descending order |
[["nan", "d", "s"]] |
Extract only the three columns on the left("count" Is not useful) |
.plot.bar(figsize=(15,10),stacked=True) |
Draw a bar graph.Specify the size,I made it a stacking method |
Now you can draw the graph shown at the beginning.
Looking at this, people with `CA. 2343``` and
`` CA 2144can imagine
`Survived = 0``` ...
Finally, the whole code is shown.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
train_data = pd.read_csv("../train.csv")
test_data = pd.read_csv("../test.csv")
total_data = pd.concat([train_data, test_data])
ticket_freq = total_data["Ticket"].value_counts()
ticket_freq
total_data_ticket = total_data.groupby("Ticket")["Survived"].value_counts(dropna=False).unstack()
total_data_ticket.fillna(0, inplace=True)
total_data_ticket.columns = ["nan", "d", "s"]
total_data_ticket["count"] = total_data_ticket.sum(axis=1)
total_data_ticket[total_data_ticket["count"] > 3].sort_values("count", ascending=False)[["nan", "d", "s"]].plot.bar(figsize=(15,10),stacked=True)
Using this technique, we also check other non-numeric data such as the surnames and titles of `Embarked```,
`Cabin, and `` `` Name
.
-How to use Pandas groupby -Create a graph with the pandas plot method and visualize the data -Data overview with Pandas