14th
About NaN of DataFrame
import numpy as np
import pandas as pd
df = pd.read_csv('train.csv')
df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
All records containing NaN have been dropped. (index remains the same. Basically, index is not reassigned unless .reset_index ()
df.dropna().head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
6 |
7 |
0 |
1 |
McCarthy, Mr. Timothy J |
male |
54.0 |
0 |
0 |
17463 |
51.8625 |
E46 |
S |
10 |
11 |
1 |
3 |
Sandstrom, Miss. Marguerite Rut |
female |
4.0 |
1 |
1 |
PP 9549 |
16.7000 |
G6 |
S |
11 |
12 |
1 |
1 |
Bonnell, Miss. Elizabeth |
female |
58.0 |
0 |
0 |
113783 |
26.5500 |
C103 |
S |
You can drop a column containing NaN by taking axis = 1 as an argument (default is line with axis = 0).
do not use much. When building a model, it is used in the strategy of reducing the variables (explanatory variables) that explain the data without reducing the number of data, but it is unlikely that "reduce the explanatory variables because there is even one NaN". .. Which explanatory variables are important for model building is a very important and careful issue.
df.dropna(axis=1) .head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
SibSp |
Parch |
Ticket |
Fare |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
1 |
0 |
A/5 21171 |
7.2500 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
1 |
0 |
PC 17599 |
71.2833 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
1 |
0 |
113803 |
53.1000 |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
0 |
0 |
373450 |
8.0500 |
By passing a list of column names to the subset argument, only the rows containing NaN in that column will be dropped.
You only need to drop the NaN row in a particular column. Remember, it's very convenient.
Of course, .dropna () does not overwrite the original df. If you want to update the original df, reassign it with the familiar inplace = True or df = df.dropna ().
df.dropna(subset=['Age']).head() ###index=888 is dropped.
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
.fillna(value)
Assign a specific Value to NaN.
df.fillna('THIS IS IT').head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22 |
1 |
0 |
A/5 21171 |
7.2500 |
THIS IS IT |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
THIS IS IT |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35 |
0 |
0 |
373450 |
8.0500 |
THIS IS IT |
S |
Substitute the column mean for the NaN contained in a particular column
df['Age'].mean()
29.69911764705882
df['Age'].fillna(df['Age'].mean()).head()
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
df['Age'] = df['Age'].fillna(df['Age'].mean())
df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
pd.isna(df).head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
1 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
2 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
3 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
4 |
False |
False |
False |
False |
False |
False |
False |
False |
False |
False |
True |
False |
Substitute the NaN judgment result of Cabin using the Cabin_nan column
df['Cabin_nan'] = pd.isna(df['Cabin'])
df
15th
groupby with the .groupby () function ()
df = pd.read_csv('train.csv')
df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
Group by in Pclass
Since df.groupby ('Pclass') alone does not know what to do after groupby is grouped, call functions such as .mean () and .count () after that.
df.groupby('Pclass').mean()
|
PassengerId |
Survived |
Age |
SibSp |
Parch |
Fare |
Pclass |
|
|
|
|
|
|
1 |
461.597222 |
0.629630 |
38.233441 |
0.416667 |
0.356481 |
84.154687 |
2 |
445.956522 |
0.472826 |
29.877630 |
0.402174 |
0.380435 |
20.662183 |
3 |
439.154786 |
0.242363 |
25.140620 |
0.615071 |
0.393075 |
13.675550 |
df = df[df['Pclass']==1]
df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
6 |
7 |
0 |
1 |
McCarthy, Mr. Timothy J |
male |
54.0 |
0 |
0 |
17463 |
51.8625 |
E46 |
S |
11 |
12 |
1 |
1 |
Bonnell, Miss. Elizabeth |
female |
58.0 |
0 |
0 |
113783 |
26.5500 |
C103 |
S |
23 |
24 |
1 |
1 |
Sloper, Mr. William Thompson |
male |
28.0 |
0 |
0 |
113788 |
35.5000 |
A6 |
S |
Take the statistic for each column with'Pclass'== 1
df[df['Pclass']==1].describe()
|
PassengerId |
Survived |
Pclass |
Age |
SibSp |
Parch |
Fare |
count |
216.000000 |
216.000000 |
216.0 |
186.000000 |
216.000000 |
216.000000 |
216.000000 |
mean |
461.597222 |
0.629630 |
1.0 |
38.233441 |
0.416667 |
0.356481 |
84.154687 |
std |
246.737616 |
0.484026 |
0.0 |
14.802856 |
0.611898 |
0.693997 |
78.380373 |
min |
2.000000 |
0.000000 |
1.0 |
0.920000 |
0.000000 |
0.000000 |
0.000000 |
25% |
270.750000 |
0.000000 |
1.0 |
27.000000 |
0.000000 |
0.000000 |
30.923950 |
50% |
472.000000 |
1.000000 |
1.0 |
37.000000 |
0.000000 |
0.000000 |
60.287500 |
75% |
670.500000 |
1.000000 |
1.0 |
49.000000 |
1.000000 |
0.000000 |
93.500000 |
max |
890.000000 |
1.000000 |
1.0 |
80.000000 |
3.000000 |
4.000000 |
512.329200 |
Extract only mean
It's hard to take them out one by one, so it's a good idea to use .groupby () to see them all together.
df[df['Pclass']==1].describe().loc['mean']
PassengerId 461.597222
Survived 0.629630
Pclass 1.000000
Age 38.233441
SibSp 0.416667
Parch 0.356481
Fare 84.154687
Name: mean, dtype: float64
After groupby, take the value specified in by, which is the first argument of groupby, in index. In the above example, it is the value of Pclass (1, 2, 3). Of course, the result of grouby is also a DataFrame, so you can fetch a specific group of Series with .loc [].
df.groupby('Pclass').mean().loc[1]
PassengerId 461.597222
Survived 0.629630
Age 38.233441
SibSp 0.416667
Parch 0.356481
Fare 84.154687
Name: 1, dtype: float64
df.groupby('Pclass').count().loc[1] #count()And sum()Etc. are also possible
PassengerId 216
Survived 216
Name 216
Sex 216
Age 186
SibSp 216
Parch 216
Ticket 216
Fare 216
Cabin 176
Embarked 214
Name: 1, dtype: int64
df.groupby('Pclass').describe()
|
PassengerId |
Survived |
Age |
SibSp |
Parch |
Fare |
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
count |
mean |
std |
min |
25% |
50% |
75% |
max |
count |
mean |
std |
min |
25% |
50% |
75% |
max |
count |
mean |
std |
min |
25% |
50% |
75% |
max |
count |
mean |
std |
min |
25% |
50% |
75% |
max |
count |
mean |
std |
min |
25% |
50% |
75% |
max |
Pclass |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
216.0 |
461.597222 |
246.737616 |
2.0 |
270.75 |
472.0 |
670.5 |
890.0 |
216.0 |
0.629630 |
0.484026 |
0.0 |
0.0 |
1.0 |
1.0 |
1.0 |
186.0 |
38.233441 |
14.802856 |
0.92 |
27.0 |
37.0 |
49.0 |
80.0 |
216.0 |
0.416667 |
0.611898 |
0.0 |
0.0 |
0.0 |
1.0 |
3.0 |
216.0 |
0.356481 |
0.693997 |
0.0 |
0.0 |
0.0 |
0.0 |
4.0 |
216.0 |
84.154687 |
78.380373 |
0.0 |
30.92395 |
60.2875 |
93.5 |
512.3292 |
2 |
184.0 |
445.956522 |
250.852161 |
10.0 |
234.50 |
435.5 |
668.0 |
887.0 |
184.0 |
0.472826 |
0.500623 |
0.0 |
0.0 |
0.0 |
1.0 |
1.0 |
173.0 |
29.877630 |
14.001077 |
0.67 |
23.0 |
29.0 |
36.0 |
70.0 |
184.0 |
0.402174 |
0.601633 |
0.0 |
0.0 |
0.0 |
1.0 |
3.0 |
184.0 |
0.380435 |
0.690963 |
0.0 |
0.0 |
0.0 |
1.0 |
3.0 |
184.0 |
20.662183 |
13.417399 |
0.0 |
13.00000 |
14.2500 |
26.0 |
73.5000 |
3 |
491.0 |
439.154786 |
264.441453 |
1.0 |
200.00 |
432.0 |
666.5 |
891.0 |
491.0 |
0.242363 |
0.428949 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
355.0 |
25.140620 |
12.495398 |
0.42 |
18.0 |
24.0 |
32.0 |
74.0 |
491.0 |
0.615071 |
1.374883 |
0.0 |
0.0 |
0.0 |
1.0 |
8.0 |
491.0 |
0.393075 |
0.888861 |
0.0 |
0.0 |
0.0 |
0.0 |
6.0 |
491.0 |
13.675550 |
11.778142 |
0.0 |
7.75000 |
8.0500 |
15.5 |
69.5500 |
df.groupby('Pclass').describe()['Age'] #I took out only Age
|
count |
mean |
std |
min |
25% |
50% |
75% |
max |
Pclass |
|
|
|
|
|
|
|
|
1 |
186.0 |
38.233441 |
14.802856 |
0.92 |
27.0 |
37.0 |
49.0 |
80.0 |
2 |
173.0 |
29.877630 |
14.001077 |
0.67 |
23.0 |
29.0 |
36.0 |
70.0 |
3 |
355.0 |
25.140620 |
12.495398 |
0.42 |
18.0 |
24.0 |
32.0 |
74.0 |
In Jupyter, if the columns and rows of DataFrame cannot be displayed, they will be omitted.
If you want to display all columns (or all rows) without omitting them, you can prevent them from being omitted by executing the following.
#Display without omitting columns
pd.set_option('display.max_columns', None)
#Display without omitting lines
pd.set_option('display.max_rows', None)
Turn the result of groupby with a for statement
for i, group_df in df.groupby('Pclass'):
print("{}: group_df's type is {} and has {}".format(i, type(group_df), len(group_df)))
1: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 216
2: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 184
3: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 491
Number the highest Fare for each record in each Pclass group
df = pd.read_csv('train.csv')
results = []
for i, group_df in df.groupby('Pclass'):
sorted_group_df = group_df.sort_values('Fare')
sorted_group_df['RankInClass'] = np.arange(len(sorted_group_df))
results.append(sorted_group_df)
results_df = pd.concat(results)
results_df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
RankInClass |
633 |
634 |
0 |
1 |
Parr, Mr. William Henry Marsh |
male |
NaN |
0 |
0 |
112052 |
0.0 |
NaN |
S |
0 |
822 |
823 |
0 |
1 |
Reuchlin, Jonkheer. John George |
male |
38.0 |
0 |
0 |
19972 |
0.0 |
NaN |
S |
1 |
815 |
816 |
0 |
1 |
Fry, Mr. Richard |
male |
NaN |
0 |
0 |
112058 |
0.0 |
B102 |
S |
2 |
806 |
807 |
0 |
1 |
Andrews, Mr. Thomas Jr |
male |
39.0 |
0 |
0 |
112050 |
0.0 |
A36 |
S |
3 |
263 |
264 |
0 |
1 |
Harrison, Mr. William |
male |
40.0 |
0 |
0 |
112059 |
0.0 |
B94 |
S |
4 |
Join table
There are two main types of table joins.
Join by using a specific column or index as a Key
Simply combine (or vertically) DataFrames horizontally (or vertically)
import pandas as pd
df1 = pd.DataFrame({'Key':['k0','k','k2'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key':['k0','k1','k2'],
'C':['c0','c2','c3'],
'D':['d0','d1','d2']})
df1
|
Key |
A |
B |
0 |
k0 |
a0 |
b0 |
1 |
k |
a1 |
b1 |
2 |
k2 |
a2 |
b2 |
df2
|
Key |
C |
D |
0 |
k0 |
c0 |
d0 |
1 |
k1 |
c2 |
d1 |
2 |
k2 |
c3 |
d2 |
Both have a column called'Key'and their values are the same.
The other columns have different values. This column called ‘Key’ is used as a Key to join two DataFrames horizontally. Use .merge () for merging.
df1.merge(df2)
|
Key |
A |
B |
C |
D |
0 |
k0 |
a0 |
b0 |
c0 |
d0 |
1 |
k2 |
a2 |
b2 |
c3 |
d2 |
Simply combine (or vertically) DataFrames horizontally (or vertically)
Use pd.concat () concat = concatenate
#Vertical (often used)
pd.concat([df1,df2], axis=0) #The default is axis=0
/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.
To accept the future behavior, pass 'sort=False'.
To retain the current behavior and silence the warning, pass 'sort=True'.
|
A |
B |
C |
D |
Key |
0 |
a0 |
b0 |
NaN |
NaN |
k0 |
1 |
a1 |
b1 |
NaN |
NaN |
k |
2 |
a2 |
b2 |
NaN |
NaN |
k2 |
0 |
NaN |
NaN |
c0 |
d0 |
k0 |
1 |
NaN |
NaN |
c2 |
d1 |
k1 |
2 |
NaN |
NaN |
c3 |
d2 |
k2 |
#side
pd.concat([df1,df2], axis=1)
|
Key |
A |
B |
Key |
C |
D |
0 |
k0 |
a0 |
b0 |
k0 |
c0 |
d0 |
1 |
k |
a1 |
b1 |
k1 |
c2 |
d1 |
2 |
k2 |
a2 |
b2 |
k2 |
c3 |
d2 |
16th
How to use .merge ()
how: How to combine → {‘left’, ‘right’, ‘outer’, ‘inner’}, default is ‘inner’
Specify the column to be on: key (column that exists in both DataFrames). If not specified, they will be joined in a common column.
left_on: Column to be the key of the left DataFrame
right_on: Column to be the key of right DataFrame
left_index: Specify True when setting the left Key to index
right_index: Specify True when setting the Key of right to index
how
df1
|
Key |
A |
B |
0 |
k0 |
a0 |
b0 |
1 |
k |
a1 |
b1 |
2 |
k2 |
a2 |
b2 |
df2
|
Key |
C |
D |
0 |
k0 |
c0 |
d0 |
1 |
k1 |
c2 |
d1 |
2 |
k2 |
c3 |
d2 |
df1.merge(df2, how='left')
|
Key |
A |
B |
C |
D |
0 |
k0 |
a0 |
b0 |
c0 |
d0 |
1 |
k |
a1 |
b1 |
NaN |
NaN |
2 |
k2 |
a2 |
b2 |
c3 |
d2 |
df1.merge(df2, how='outer')
|
Key |
A |
B |
C |
D |
0 |
k0 |
a0 |
b0 |
c0 |
d0 |
1 |
k |
a1 |
b1 |
NaN |
NaN |
2 |
k2 |
a2 |
b2 |
c3 |
d2 |
3 |
k1 |
NaN |
NaN |
c2 |
d1 |
df1.merge(df2, how='inner')
|
Key |
A |
B |
C |
D |
0 |
k0 |
a0 |
b0 |
c0 |
d0 |
1 |
k2 |
a2 |
b2 |
c3 |
d2 |
on
The argument on specifies which column should be used as the Key when joining. You can only specify columns that are in both the left table and the right table.
If there is a common column, that column will be the Key and will be combined without specifying anything, but let's specify the basics. It's easy to understand and safe. It is often said that "they were joined in an unexpected column". If there are multiple common columns, you do not know which column will be joined, and even if there is only one common column, it is better to specify it. (Since the examples so far were before the explanation of on, I intentionally wrote it without specifying the on argument.)
To set index as Key, set right_index and left_index described later to True. If the column names of each table (DataFrame) are different, specify left_on and right_on described later.
df1 = pd.DataFrame({'Key':['k0','k1','k2'],
'ID':['aa','bb','cc'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key':['k0','k1','k3'],
'ID':['aa','bb','cc'],
'C':['c0','c1','c3'],
'D':['d0','d1','d3']})
df1.merge(df2, on='Key')
|
Key |
ID_x |
A |
B |
ID_y |
C |
D |
0 |
k0 |
aa |
a0 |
b0 |
aa |
c0 |
d0 |
1 |
k1 |
bb |
a1 |
b1 |
bb |
c1 |
d1 |
df1.merge(df2, on='ID')
|
Key_x |
ID |
A |
B |
Key_y |
C |
D |
0 |
k0 |
aa |
a0 |
b0 |
k0 |
c0 |
d0 |
1 |
k1 |
bb |
a1 |
b1 |
k1 |
c1 |
d1 |
2 |
k2 |
cc |
a2 |
b2 |
k3 |
c3 |
d3 |
Change suffix
df1.merge(df2, on='ID', suffixes=('_left', '_right'))
|
Key_left |
ID |
A |
B |
Key_right |
C |
D |
0 |
k0 |
aa |
a0 |
b0 |
k0 |
c0 |
d0 |
1 |
k1 |
bb |
a1 |
b1 |
k1 |
c1 |
d1 |
2 |
k2 |
cc |
a2 |
b2 |
k3 |
c3 |
d3 |
left_on, right_on
Specify this argument when the column name you want to use as Key is different between left and right.
df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
'C':['c0','c1','c3'],
'D':['d0','d1','d3']})
df1.merge(df2, left_on='Key1', right_on='Key2')
|
Key1 |
A |
B |
Key2 |
C |
D |
0 |
k0 |
a0 |
b0 |
k0 |
c0 |
d0 |
1 |
k1 |
a1 |
b1 |
k1 |
c1 |
d1 |
left_index, right_index
If you want to specify Index as Key instead of column, specify True for left_index and right_index.
df1.merge(df2, left_index=True, right_index=True)
|
Key1 |
A |
B |
Key2 |
C |
D |
0 |
k0 |
a0 |
b0 |
k0 |
c0 |
d0 |
1 |
k1 |
a1 |
b1 |
k1 |
c1 |
d1 |
2 |
k2 |
a2 |
b2 |
k3 |
c3 |
d3 |
join
If you use the join function, it will join by index, but you do not need to remember because you can do almost the same with merge.
df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
'C':['c0','c1','c3'],
'D':['d0','d1','d3']})
df1.join(df2)
|
Key1 |
A |
B |
Key2 |
C |
D |
0 |
k0 |
a0 |
b0 |
k0 |
c0 |
d0 |
1 |
k1 |
a1 |
b1 |
k1 |
c1 |
d1 |
2 |
k2 |
a2 |
b2 |
k3 |
c3 |
d3 |
df1.merge(df2, left_index=True, right_index=True)
|
Key1 |
A |
B |
Key2 |
C |
D |
0 |
k0 |
a0 |
b0 |
k0 |
c0 |
d0 |
1 |
k1 |
a1 |
b1 |
k1 |
c1 |
d1 |
2 |
k2 |
a2 |
b2 |
k3 |
c3 |
d3 |
df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
'A':['a0','a1','a2'],
'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
'C':['c0','c1','c3'],
'D':['d0','d1','d3']})
df3 = pd.DataFrame({'Key3':['k0','k1','k4'],
'E':['c0','c1','c3'],
'F':['d0','d1','d3']})
df1.join([df2, df3])
|
Key1 |
A |
B |
Key2 |
C |
D |
Key3 |
E |
F |
0 |
k0 |
a0 |
b0 |
k0 |
c0 |
d0 |
k0 |
c0 |
d0 |
1 |
k1 |
a1 |
b1 |
k1 |
c1 |
d1 |
k1 |
c1 |
d1 |
2 |
k2 |
a2 |
b2 |
k3 |
c3 |
d3 |
k4 |
c3 |
d3 |
17th
.unique() .nunique()
import pandas as pd
df = pd.read_csv('train.csv')
df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
df['Pclass'].unique()
array([3, 1, 2])
df['Pclass'].nunique()
3
.value_counts()
df['Pclass'].value_counts()
3 491
1 216
2 184
Name: Pclass, dtype: int64
(Super important) .apply ()
You can use the apply () function to process all records in a DataFrame and store the result in a separate column. It is an image that applies processing to each line
def get_age_group(age):
return str(age)[0] + '0s'
get_age_group(45)
'40s'
df = pd.DataFrame({'name':['John','Mike','Emily'],
'age':['23','36','42']})
df
|
name |
age |
0 |
John |
23 |
1 |
Mike |
36 |
2 |
Emily |
42 |
df['age'].apply(get_age_group)
0 20s
1 30s
2 40s
Name: age, dtype: object
How to use .apply () with a lambda function
#Assign to the variable f in the lambda function
f = lambda x: str(x)[0] + '0s'
#Put 43 as a trial
f(43)
'40s'
df['age_group'] = df['age'].apply(lambda x: str(x)[0] + '0s')
df
|
name |
age |
age_group |
0 |
John |
23 |
20s |
1 |
Mike |
36 |
30s |
2 |
Emily |
42 |
40s |
How to use .apply () for the whole record
df = pd.DataFrame({'name':['John','Mike','Emily'],
'age':['23','36','42']})
df['description'] = df.apply(lambda row:'{} is {} years old'.format(row['name'], row['age']), axis=1)
df
|
name |
age |
description |
0 |
John |
23 |
John is 23 years old |
1 |
Mike |
36 |
Mike is 36 years old |
2 |
Emily |
42 |
Emily is 42 years old |
18th
Save DataFrame in csv format with .to_csv ()
df = pd.read_csv('train.csv')
df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
df['Adult'] = df['Age'].apply(lambda x: x>20)
df.tail()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Adult |
886 |
887 |
0 |
2 |
Montvila, Rev. Juozas |
male |
27.0 |
0 |
0 |
211536 |
13.00 |
NaN |
S |
True |
887 |
888 |
1 |
1 |
Graham, Miss. Margaret Edith |
female |
19.0 |
0 |
0 |
112053 |
30.00 |
B42 |
S |
False |
888 |
889 |
0 |
3 |
Johnston, Miss. Catherine Helen "Carrie" |
female |
NaN |
1 |
2 |
W./C. 6607 |
23.45 |
NaN |
S |
False |
889 |
890 |
1 |
1 |
Behr, Mr. Karl Howell |
male |
26.0 |
0 |
0 |
111369 |
30.00 |
C148 |
C |
True |
890 |
891 |
0 |
3 |
Dooley, Mr. Patrick |
male |
32.0 |
0 |
0 |
370376 |
7.75 |
NaN |
Q |
True |
df.to_csv('train_w_adult.csv')
df = pd.read_csv('train_w_adult.csv')
df.head(3)
|
Unnamed: 0 |
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Adult |
0 |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
True |
1 |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
True |
2 |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
True |
Looking at the result of df.head (3), the index information when it was saved last time is saved in the mysterious column ‘Unnamed: 0’.
If you specify index = False in .to_csv (), you do not need to save the index. Basically, always specify index = False and save it in csv format.
Please note that if the same file already exists in the save destination, it will be overwritten.
df = pd.read_csv('train.csv')
df['Adult'] = df['Age'].apply(lambda x: x>20)
df.to_csv('train_w_adult.csv', index=False)
df = pd.read_csv('train_w_adult.csv')
df.head(3)
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
Adult |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
True |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
True |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
True |
df = pd.DataFrame({'A':[['a', 'b'], 2, 3], 'B':[['c', 'd'], 5, 6]})
df
|
A |
B |
0 |
[a, b] |
[c, d] |
1 |
2 |
5 |
2 |
3 |
6 |
#Make sure that the stored values are a list.
type(df['A'].iloc[0])
list
#Save as csv
df.to_csv('temp.csv', index=False)
#Load the saved csv
df = pd.read_csv('temp.csv')
df
|
A |
B |
0 |
['a', 'b'] |
['c', 'd'] |
1 |
2 |
5 |
2 |
3 |
6 |
type(df['A'].iloc[0])
str
Iterate a DataFrame with .iterrows ()
It is used when iterating a DataFrame with a for statement. It's hard to remember, but since "rows" is "iterated", let's remember iter + row + s. Since it can be turned with for, let's think that there are multiple s.
"Iteration" means to repeat the process. It's a loop. For example, if it is a list, it can be iterated with for i in list: (see Part 4).
In DataFrame, you cannot directly say for i in df: like a list. Write as follows using a function called .iterrows ().
df = pd.read_csv('train.csv')
for idx, row in df.iterrows():
if row['Age'] > 40 and row['Pclass'] == 3 and row['Sex'] == 'male' and row['Survived'] == 1:
print('{} is very lucky guy...!'.format(row['Name']))
Dahl, Mr. Karl Edwart is very lucky guy...!
Sundman, Mr. Johan Julian is very lucky guy...!
.apply () is often used to save the result of processing each record in another column, and this time .iterows () is often used when you want to process only instead of returning a value.
For example, the file path is stored in DataFrame, and .iterrows () is used to move or read the file.
Sort by specific column with .sort_values ()
#Sort by age
df.sort_values('Age')
df.head()
|
PassengerId |
Survived |
Pclass |
Name |
Sex |
Age |
SibSp |
Parch |
Ticket |
Fare |
Cabin |
Embarked |
0 |
1 |
0 |
3 |
Braund, Mr. Owen Harris |
male |
22.0 |
1 |
0 |
A/5 21171 |
7.2500 |
NaN |
S |
1 |
2 |
1 |
1 |
Cumings, Mrs. John Bradley (Florence Briggs Th... |
female |
38.0 |
1 |
0 |
PC 17599 |
71.2833 |
C85 |
C |
2 |
3 |
1 |
3 |
Heikkinen, Miss. Laina |
female |
26.0 |
0 |
0 |
STON/O2. 3101282 |
7.9250 |
NaN |
S |
3 |
4 |
1 |
1 |
Futrelle, Mrs. Jacques Heath (Lily May Peel) |
female |
35.0 |
1 |
0 |
113803 |
53.1000 |
C123 |
S |
4 |
5 |
0 |
3 |
Allen, Mr. William Henry |
male |
35.0 |
0 |
0 |
373450 |
8.0500 |
NaN |
S |
Create a pivot table with .pivot_table ()
data = {'Data':['Jan-1','Jan-1','Jan-1','Jan-2','Jan-2','Jan-2'],
'User':['Emily', 'John','Nick','Kevin','Emily','John'],
'Method':['Card','Card','Cash','Card','Cash','Cash'],
'Price':[100,250,200,460,200,130]}
df = pd.DataFrame(data)
df
|
Data |
User |
Method |
Price |
0 |
Jan-1 |
Emily |
Card |
100 |
1 |
Jan-1 |
John |
Card |
250 |
2 |
Jan-1 |
Nick |
Cash |
200 |
3 |
Jan-2 |
Kevin |
Card |
460 |
4 |
Jan-2 |
Emily |
Cash |
200 |
5 |
Jan-2 |
John |
Cash |
130 |
In values, enter the column you want to aggregate. This time it is Price.
First, the value specified by values (Price in this case) is entered in each cell. This is the column you want to aggregate.
On the other hand, just pass the columns for which you want to specify index and columns in a list.
df.pivot_table(values='Price', index=['Data', 'User'], columns=['Method'])
|
Method |
Card |
Cash |
Data |
User |
|
|
Jan-1 |
Emily |
100.0 |
NaN |
John |
250.0 |
NaN |
Nick |
NaN |
200.0 |
Jan-2 |
Emily |
NaN |
200.0 |
John |
NaN |
130.0 |
Kevin |
460.0 |
NaN |
df.pivot_table(values='Price', index=['Data', 'Method'], columns=['User'])
|
User |
Emily |
John |
Kevin |
Nick |
Data |
Method |
|
|
|
|
Jan-1 |
Card |
100.0 |
250.0 |
NaN |
NaN |
Cash |
NaN |
NaN |
NaN |
200.0 |
Jan-2 |
Card |
NaN |
NaN |
460.0 |
NaN |
Cash |
200.0 |
130.0 |
NaN |
NaN |
First, clarify "which column you want to aggregate" and put that column in values, then just put the information you want in index and columns.
Cross-section operation with .xs ()
.xs () is an abbreviation for cross section. I don't use this much either, but it comes in handy when working with DataFrames that have multiple indexes, such as pivots. It's a good idea to remember it along with the pivot.
What you need to do with this .xs () is to use it in the pivot table above, for example, when you want to extract only the "Card" row. (Exactly cross-section)
#pivot.xs('Card', level = 'Method')