14th

About NaN of DataFrame

import numpy as np
import pandas as pd

df = pd.read_csv('train.csv')
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

All records containing NaN have been dropped. (index remains the same. Basically, index is not reassigned unless .reset_index ()

df.dropna().head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	0	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	0	17463	51.8625	E46	S
10	11	1	3	Sandstrom, Miss. Marguerite Rut	female	4.0	1	1	PP 9549	16.7000	G6	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	0	113783	26.5500	C103	S

You can drop a column containing NaN by taking axis = 1 as an argument (default is line with axis = 0).

do not use much. When building a model, it is used in the strategy of reducing the variables (explanatory variables) that explain the data without reducing the number of data, but it is unlikely that "reduce the explanatory variables because there is even one NaN". .. Which explanatory variables are important for model building is a very important and careful issue.

df.dropna(axis=1) .head()

	PassengerId	Survived	Pclass	Name	Sex	SibSp	Ticket	Fare
0	1	0	3	Braund, Mr. Owen Harris	male	1	A/5 21171	7.2500
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	1	PC 17599	71.2833
2	3	1	3	Heikkinen, Miss. Laina	female	0	STON/O2. 3101282	7.9250
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	1	113803	53.1000
4	5	0	3	Allen, Mr. William Henry	male	0	373450	8.0500

By passing a list of column names to the subset argument, only the rows containing NaN in that column will be dropped.

You only need to drop the NaN row in a particular column. Remember, it's very convenient.

Of course, .dropna () does not overwrite the original df. If you want to update the original df, reassign it with the familiar inplace = True or df = df.dropna ().

df.dropna(subset=['Age']).head() ###index=888 is dropped.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

.fillna(value)

Assign a specific Value to NaN.

df.fillna('THIS IS IT').head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22	1	A/5 21171	7.2500	THIS IS IT	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26	0	STON/O2. 3101282	7.9250	THIS IS IT	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35	0	373450	8.0500	THIS IS IT	S

Substitute the column mean for the NaN contained in a particular column

df['Age'].mean()

29.69911764705882

df['Age'].fillna(df['Age'].mean()).head()

0    22.0
1    38.0
2    26.0
3    35.0
4    35.0
Name: Age, dtype: float64

df['Age'] = df['Age'].fillna(df['Age'].mean())

df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

pd.isna(df).head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	False	False	False	False	False	False	False	False	False	False	True	False
1	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	True	False
3	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	True	False

Substitute the NaN judgment result of Cabin using the Cabin_nan column

df['Cabin_nan'] = pd.isna(df['Cabin']) df

15th

groupby with the .groupby () function ()

df = pd.read_csv('train.csv')
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Group by in Pclass

Since df.groupby ('Pclass') alone does not know what to do after groupby is grouped, call functions such as .mean () and .count () after that.

df.groupby('Pclass').mean()

	PassengerId	Survived	Age	SibSp	Parch	Fare
Pclass
1	461.597222	0.629630	38.233441	0.416667	0.356481	84.154687
2	445.956522	0.472826	29.877630	0.402174	0.380435	20.662183
3	439.154786	0.242363	25.140620	0.615071	0.393075	13.675550

df = df[df['Pclass']==1]
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
6	7	0	1	McCarthy, Mr. Timothy J	male	54.0	0	17463	51.8625	E46	S
11	12	1	1	Bonnell, Miss. Elizabeth	female	58.0	0	113783	26.5500	C103	S
23	24	1	1	Sloper, Mr. William Thompson	male	28.0	0	113788	35.5000	A6	S

Take the statistic for each column with'Pclass'== 1

df[df['Pclass']==1].describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	216.000000	216.000000	216.0	186.000000	216.000000	216.000000	216.000000
mean	461.597222	0.629630	1.0	38.233441	0.416667	0.356481	84.154687
std	246.737616	0.484026	0.0	14.802856	0.611898	0.693997	78.380373
min	2.000000	0.000000	1.0	0.920000	0.000000	0.000000	0.000000
25%	270.750000	0.000000	1.0	27.000000	0.000000	0.000000	30.923950
50%	472.000000	1.000000	1.0	37.000000	0.000000	0.000000	60.287500
75%	670.500000	1.000000	1.0	49.000000	1.000000	0.000000	93.500000
max	890.000000	1.000000	1.0	80.000000	3.000000	4.000000	512.329200

Extract only mean

It's hard to take them out one by one, so it's a good idea to use .groupby () to see them all together.

df[df['Pclass']==1].describe().loc['mean']

PassengerId    461.597222
Survived         0.629630
Pclass           1.000000
Age             38.233441
SibSp            0.416667
Parch            0.356481
Fare            84.154687
Name: mean, dtype: float64

After groupby, take the value specified in by, which is the first argument of groupby, in index. In the above example, it is the value of Pclass (1, 2, 3). Of course, the result of grouby is also a DataFrame, so you can fetch a specific group of Series with .loc [].

df.groupby('Pclass').mean().loc[1]

PassengerId    461.597222
Survived         0.629630
Age             38.233441
SibSp            0.416667
Parch            0.356481
Fare            84.154687
Name: 1, dtype: float64

df.groupby('Pclass').count().loc[1] #count()And sum()Etc. are also possible

PassengerId    216
Survived       216
Name           216
Sex            216
Age            186
SibSp          216
Parch          216
Ticket         216
Fare           216
Cabin          176
Embarked       214
Name: 1, dtype: int64

df.groupby('Pclass').describe()

	PassengerId								Survived								Age								SibSp								Parch								Fare
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
Pclass
1	216.0	461.597222	246.737616	2.0	270.75	472.0	670.5	890.0	216.0	0.629630	0.484026	0.0	0.0	1.0	1.0	1.0	186.0	38.233441	14.802856	0.92	27.0	37.0	49.0	80.0	216.0	0.416667	0.611898	0.0	0.0	0.0	1.0	3.0	216.0	0.356481	0.693997	0.0	0.0	0.0	0.0	4.0	216.0	84.154687	78.380373	0.0	30.92395	60.2875	93.5	512.3292
2	184.0	445.956522	250.852161	10.0	234.50	435.5	668.0	887.0	184.0	0.472826	0.500623	0.0	0.0	0.0	1.0	1.0	173.0	29.877630	14.001077	0.67	23.0	29.0	36.0	70.0	184.0	0.402174	0.601633	0.0	0.0	0.0	1.0	3.0	184.0	0.380435	0.690963	0.0	0.0	0.0	1.0	3.0	184.0	20.662183	13.417399	0.0	13.00000	14.2500	26.0	73.5000
3	491.0	439.154786	264.441453	1.0	200.00	432.0	666.5	891.0	491.0	0.242363	0.428949	0.0	0.0	0.0	0.0	1.0	355.0	25.140620	12.495398	0.42	18.0	24.0	32.0	74.0	491.0	0.615071	1.374883	0.0	0.0	0.0	1.0	8.0	491.0	0.393075	0.888861	0.0	0.0	0.0	0.0	6.0	491.0	13.675550	11.778142	0.0	7.75000	8.0500	15.5	69.5500

df.groupby('Pclass').describe()['Age'] #I took out only Age

	count	mean	std	min	25%	50%	75%	max
Pclass
1	186.0	38.233441	14.802856	0.92	27.0	37.0	49.0	80.0
2	173.0	29.877630	14.001077	0.67	23.0	29.0	36.0	70.0
3	355.0	25.140620	12.495398	0.42	18.0	24.0	32.0	74.0

In Jupyter, if the columns and rows of DataFrame cannot be displayed, they will be omitted.

If you want to display all columns (or all rows) without omitting them, you can prevent them from being omitted by executing the following.

#Display without omitting columns
pd.set_option('display.max_columns', None)
#Display without omitting lines
pd.set_option('display.max_rows', None)

Turn the result of groupby with a for statement

for i, group_df in df.groupby('Pclass'):
    print("{}: group_df's type is {} and has {}".format(i, type(group_df), len(group_df)))

1: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 216
2: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 184
3: group_df's type is <class 'pandas.core.frame.DataFrame'> and has 491

Number the highest Fare for each record in each Pclass group

df = pd.read_csv('train.csv')
results = []
for i, group_df in df.groupby('Pclass'):
    sorted_group_df = group_df.sort_values('Fare')
    sorted_group_df['RankInClass'] = np.arange(len(sorted_group_df))
    results.append(sorted_group_df)

results_df = pd.concat(results)
results_df.head()

	PassengerId	Pclass	Name	Sex	Age	Ticket	Cabin	Embarked	RankInClass
633	634	1	Parr, Mr. William Henry Marsh	male	NaN	112052	NaN	S	0
822	823	1	Reuchlin, Jonkheer. John George	male	38.0	19972	NaN	S	1
815	816	1	Fry, Mr. Richard	male	NaN	112058	B102	S	2
806	807	1	Andrews, Mr. Thomas Jr	male	39.0	112050	A36	S	3
263	264	1	Harrison, Mr. William	male	40.0	112059	B94	S	4

Join table

There are two main types of table joins.

Join by using a specific column or index as a Key

Simply combine (or vertically) DataFrames horizontally (or vertically)

import pandas as pd

df1 = pd.DataFrame({'Key':['k0','k','k2'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key':['k0','k1','k2'],
                   'C':['c0','c2','c3'],
                   'D':['d0','d1','d2']})

df1

	Key	A	B
0	k0	a0	b0
1	k	a1	b1
2	k2	a2	b2

df2

	Key	C	D
0	k0	c0	d0
1	k1	c2	d1
2	k2	c3	d2

Both have a column called'Key'and their values are the same.

The other columns have different values. This column called ‘Key’ is used as a Key to join two DataFrames horizontally. Use .merge () for merging.

df1.merge(df2)

	Key	A	B	C	D
0	k0	a0	b0	c0	d0
1	k2	a2	b2	c3	d2

Simply combine (or vertically) DataFrames horizontally (or vertically)

Use pd.concat () concat = concatenate

#Vertical (often used)
pd.concat([df1,df2], axis=0) #The default is axis=0

/opt/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

	A	B	C	D	Key
0	a0	b0	NaN	NaN	k0
1	a1	b1	NaN	NaN	k
2	a2	b2	NaN	NaN	k2
0	NaN	NaN	c0	d0	k0
1	NaN	NaN	c2	d1	k1
2	NaN	NaN	c3	d2	k2

#side
pd.concat([df1,df2], axis=1)

	Key	A	B	Key	C	D
0	k0	a0	b0	k0	c0	d0
1	k	a1	b1	k1	c2	d1
2	k2	a2	b2	k2	c3	d2

16th

How to use .merge ()

how: How to combine → {‘left’, ‘right’, ‘outer’, ‘inner’}, default is ‘inner’

Specify the column to be on: key (column that exists in both DataFrames). If not specified, they will be joined in a common column.

left_on: Column to be the key of the left DataFrame

right_on: Column to be the key of right DataFrame

left_index: Specify True when setting the left Key to index

right_index: Specify True when setting the Key of right to index

how

df1

	Key	A	B
0	k0	a0	b0
1	k	a1	b1
2	k2	a2	b2

df2

	Key	C	D
0	k0	c0	d0
1	k1	c2	d1
2	k2	c3	d2

df1.merge(df2, how='left')

	Key	A	B	C	D
0	k0	a0	b0	c0	d0
1	k	a1	b1	NaN	NaN
2	k2	a2	b2	c3	d2

df1.merge(df2, how='outer')

	Key	A	B	C	D
0	k0	a0	b0	c0	d0
1	k	a1	b1	NaN	NaN
2	k2	a2	b2	c3	d2
3	k1	NaN	NaN	c2	d1

df1.merge(df2, how='inner')

	Key	A	B	C	D
0	k0	a0	b0	c0	d0
1	k2	a2	b2	c3	d2

The argument on specifies which column should be used as the Key when joining. You can only specify columns that are in both the left table and the right table.

If there is a common column, that column will be the Key and will be combined without specifying anything, but let's specify the basics. It's easy to understand and safe. It is often said that "they were joined in an unexpected column". If there are multiple common columns, you do not know which column will be joined, and even if there is only one common column, it is better to specify it. (Since the examples so far were before the explanation of on, I intentionally wrote it without specifying the on argument.)

To set index as Key, set right_index and left_index described later to True. If the column names of each table (DataFrame) are different, specify left_on and right_on described later.

df1 = pd.DataFrame({'Key':['k0','k1','k2'],
                    'ID':['aa','bb','cc'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key':['k0','k1','k3'],
                    'ID':['aa','bb','cc'],
                   'C':['c0','c1','c3'],
                   'D':['d0','d1','d3']})

df1.merge(df2, on='Key')

	Key	ID_x	A	B	ID_y	C	D
0	k0	aa	a0	b0	aa	c0	d0
1	k1	bb	a1	b1	bb	c1	d1

df1.merge(df2, on='ID')

	Key_x	ID	A	B	Key_y	C	D
0	k0	aa	a0	b0	k0	c0	d0
1	k1	bb	a1	b1	k1	c1	d1
2	k2	cc	a2	b2	k3	c3	d3

Change suffix

df1.merge(df2, on='ID', suffixes=('_left', '_right'))

	Key_left	ID	A	B	Key_right	C	D
0	k0	aa	a0	b0	k0	c0	d0
1	k1	bb	a1	b1	k1	c1	d1
2	k2	cc	a2	b2	k3	c3	d3

left_on, right_on

Specify this argument when the column name you want to use as Key is different between left and right.

df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
                   'C':['c0','c1','c3'],
                   'D':['d0','d1','d3']})

df1.merge(df2, left_on='Key1', right_on='Key2')

	Key1	A	B	Key2	C	D
0	k0	a0	b0	k0	c0	d0
1	k1	a1	b1	k1	c1	d1

left_index, right_index

If you want to specify Index as Key instead of column, specify True for left_index and right_index.

df1.merge(df2, left_index=True, right_index=True)

	Key1	A	B	Key2	C	D
0	k0	a0	b0	k0	c0	d0
1	k1	a1	b1	k1	c1	d1
2	k2	a2	b2	k3	c3	d3

join

If you use the join function, it will join by index, but you do not need to remember because you can do almost the same with merge.

df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
                   'C':['c0','c1','c3'],
                   'D':['d0','d1','d3']})

df1.join(df2)

	Key1	A	B	Key2	C	D
0	k0	a0	b0	k0	c0	d0
1	k1	a1	b1	k1	c1	d1
2	k2	a2	b2	k3	c3	d3

df1.merge(df2, left_index=True, right_index=True)

	Key1	A	B	Key2	C	D
0	k0	a0	b0	k0	c0	d0
1	k1	a1	b1	k1	c1	d1
2	k2	a2	b2	k3	c3	d3

df1 = pd.DataFrame({'Key1':['k0','k1','k2'],
                   'A':['a0','a1','a2'],
                   'B':['b0','b1','b2']})
df2 = pd.DataFrame({'Key2':['k0','k1','k3'],
                   'C':['c0','c1','c3'],
                   'D':['d0','d1','d3']})
df3 = pd.DataFrame({'Key3':['k0','k1','k4'],
                   'E':['c0','c1','c3'],
                   'F':['d0','d1','d3']})
df1.join([df2, df3])

	Key1	A	B	Key2	C	D	Key3	E	F
0	k0	a0	b0	k0	c0	d0	k0	c0	d0
1	k1	a1	b1	k1	c1	d1	k1	c1	d1
2	k2	a2	b2	k3	c3	d3	k4	c3	d3

17th

.unique() .nunique()

import pandas as pd
df = pd.read_csv('train.csv')
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

df['Pclass'].unique()

array([3, 1, 2])

df['Pclass'].nunique()

.value_counts()

df['Pclass'].value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

(Super important) .apply ()

You can use the apply () function to process all records in a DataFrame and store the result in a separate column. It is an image that applies processing to each line

def get_age_group(age):
    return str(age)[0] + '0s'

get_age_group(45)

'40s'

df = pd.DataFrame({'name':['John','Mike','Emily'],
                  'age':['23','36','42']})
df

	name	age
0	John	23
1	Mike	36
2	Emily	42

df['age'].apply(get_age_group)

0    20s
1    30s
2    40s
Name: age, dtype: object

How to use .apply () with a lambda function

#Assign to the variable f in the lambda function
f = lambda x: str(x)[0] + '0s'
#Put 43 as a trial
f(43)

'40s'

df['age_group'] = df['age'].apply(lambda x: str(x)[0] + '0s')

df

	name	age	age_group
0	John	23	20s
1	Mike	36	30s
2	Emily	42	40s

How to use .apply () for the whole record

df = pd.DataFrame({'name':['John','Mike','Emily'],
                  'age':['23','36','42']})
df['description'] = df.apply(lambda row:'{} is {} years old'.format(row['name'], row['age']), axis=1)

df

	name	age	description
0	John	23	John is 23 years old
1	Mike	36	Mike is 36 years old
2	Emily	42	Emily is 42 years old

18th

Save DataFrame in csv format with .to_csv ()

df = pd.read_csv('train.csv')
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

df['Adult'] = df['Age'].apply(lambda x: x>20)
df.tail()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Adult
886	887	0	2	Montvila, Rev. Juozas	male	27.0	0	0	211536	13.00	NaN	S	True
887	888	1	1	Graham, Miss. Margaret Edith	female	19.0	0	0	112053	30.00	B42	S	False
888	889	0	3	Johnston, Miss. Catherine Helen "Carrie"	female	NaN	1	2	W./C. 6607	23.45	NaN	S	False
889	890	1	1	Behr, Mr. Karl Howell	male	26.0	0	0	111369	30.00	C148	C	True
890	891	0	3	Dooley, Mr. Patrick	male	32.0	0	0	370376	7.75	NaN	Q	True

df.to_csv('train_w_adult.csv')

df = pd.read_csv('train_w_adult.csv')
df.head(3)

	Unnamed: 0	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Adult
0	0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	True
1	1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	True
2	2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	True

Looking at the result of df.head (3), the index information when it was saved last time is saved in the mysterious column ‘Unnamed: 0’.

If you specify index = False in .to_csv (), you do not need to save the index. Basically, always specify index = False and save it in csv format.

Please note that if the same file already exists in the save destination, it will be overwritten.

df = pd.read_csv('train.csv')
df['Adult'] = df['Age'].apply(lambda x: x>20)
df.to_csv('train_w_adult.csv', index=False)

df = pd.read_csv('train_w_adult.csv')
df.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked	Adult
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S	True
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C	True
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S	True

df = pd.DataFrame({'A':[['a', 'b'], 2, 3], 'B':[['c', 'd'], 5, 6]})
df

	A	B
0	[a, b]	[c, d]
1	2	5
2	3	6

#Make sure that the stored values are a list.
type(df['A'].iloc[0])

list

#Save as csv
df.to_csv('temp.csv', index=False)
#Load the saved csv
df = pd.read_csv('temp.csv')
df

	A	B
0	['a', 'b']	['c', 'd']
1	2	5
2	3	6

type(df['A'].iloc[0])

str

Iterate a DataFrame with .iterrows ()

It is used when iterating a DataFrame with a for statement. It's hard to remember, but since "rows" is "iterated", let's remember iter + row + s. Since it can be turned with for, let's think that there are multiple s.

"Iteration" means to repeat the process. It's a loop. For example, if it is a list, it can be iterated with for i in list: (see Part 4).

In DataFrame, you cannot directly say for i in df: like a list. Write as follows using a function called .iterrows ().

df = pd.read_csv('train.csv')
for idx, row in df.iterrows():
    if row['Age'] > 40 and row['Pclass'] == 3 and row['Sex'] == 'male' and row['Survived'] == 1:
        print('{} is very lucky guy...!'.format(row['Name']))

Dahl, Mr. Karl Edwart is very lucky guy...!
Sundman, Mr. Johan Julian is very lucky guy...!

.apply () is often used to save the result of processing each record in another column, and this time .iterows () is often used when you want to process only instead of returning a value.

For example, the file path is stored in DataFrame, and .iterrows () is used to move or read the file.

Sort by specific column with .sort_values ()

#Sort by age
df.sort_values('Age')
df.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Create a pivot table with .pivot_table ()

data = {'Data':['Jan-1','Jan-1','Jan-1','Jan-2','Jan-2','Jan-2'],
       'User':['Emily', 'John','Nick','Kevin','Emily','John'],
       'Method':['Card','Card','Cash','Card','Cash','Cash'],
       'Price':[100,250,200,460,200,130]}
df = pd.DataFrame(data)
df

	Data	User	Method	Price
0	Jan-1	Emily	Card	100
1	Jan-1	John	Card	250
2	Jan-1	Nick	Cash	200
3	Jan-2	Kevin	Card	460
4	Jan-2	Emily	Cash	200
5	Jan-2	John	Cash	130

In values, enter the column you want to aggregate. This time it is Price.

First, the value specified by values (Price in this case) is entered in each cell. This is the column you want to aggregate.

On the other hand, just pass the columns for which you want to specify index and columns in a list.

df.pivot_table(values='Price', index=['Data', 'User'], columns=['Method'])

	Method	Card	Cash
Data	User
Jan-1	Emily	100.0	NaN
	John	250.0	NaN
	Nick	NaN	200.0
Jan-2	Emily	NaN	200.0
	John	NaN	130.0
	Kevin	460.0	NaN

df.pivot_table(values='Price', index=['Data', 'Method'], columns=['User'])

	User	Emily	John	Kevin	Nick
Data	Method
Jan-1	Card	100.0	250.0	NaN	NaN
Jan-1	Cash	NaN	NaN	NaN	200.0
Jan-2	Card	NaN	NaN	460.0	NaN
Jan-2	Cash	200.0	130.0	NaN	NaN

First, clarify "which column you want to aggregate" and put that column in values, then just put the information you want in index and columns.

Cross-section operation with .xs ()

.xs () is an abbreviation for cross section. I don't use this much either, but it comes in handy when working with DataFrames that have multiple indexes, such as pivots. It's a good idea to remember it along with the pivot.

What you need to do with this .xs () is to use it in the pivot table above, for example, when you want to extract only the "Card" row. (Exactly cross-section)

#pivot.xs('Card', level = 'Method')

	A	B	C	D	Key
0	a0	b0	NaN	NaN	k0
1	a1	b1	NaN	NaN	k
2	a2	b2	NaN	NaN	k2
0	NaN	NaN	c0	d0	k0
1	NaN	NaN	c2	d1	k1
2	NaN	NaN	c3	d2	k2

	A	B	C	D	Key
0	a0	b0	NaN	NaN	k0
1	a1	b1	NaN	NaN	k
2	a2	b2	NaN	NaN	k2
0	NaN	NaN	c0	d0	k0
1	NaN	NaN	c2	d1	k1
2	NaN	NaN	c3	d2	k2

I wrote the basic operation of Pandas with Jupyter Lab (Part 2)

14th

About NaN of DataFrame

All records containing NaN have been dropped. (index remains the same. Basically, index is not reassigned unless .reset_index ()

You can drop a column containing NaN by taking axis = 1 as an argument (default is line with axis = 0).

By passing a list of column names to the subset argument, only the rows containing NaN in that column will be dropped.

You only need to drop the NaN row in a particular column. Remember, it's very convenient.

Of course, .dropna () does not overwrite the original df. If you want to update the original df, reassign it with the familiar inplace = True or df = df.dropna ().

Assign a specific Value to NaN.

Substitute the column mean for the NaN contained in a particular column

Substitute the NaN judgment result of Cabin using the Cabin_nan column

15th

groupby with the .groupby () function ()

Group by in Pclass

Since df.groupby ('Pclass') alone does not know what to do after groupby is grouped, call functions such as .mean () and .count () after that.

Take the statistic for each column with'Pclass'== 1

Extract only mean

It's hard to take them out one by one, so it's a good idea to use .groupby () to see them all together.

After groupby, take the value specified in by, which is the first argument of groupby, in index. In the above example, it is the value of Pclass (1, 2, 3). Of course, the result of grouby is also a DataFrame, so you can fetch a specific group of Series with .loc [].

In Jupyter, if the columns and rows of DataFrame cannot be displayed, they will be omitted.

If you want to display all columns (or all rows) without omitting them, you can prevent them from being omitted by executing the following.

Turn the result of groupby with a for statement

Number the highest Fare for each record in each Pclass group

Join table

There are two main types of table joins.

Join by using a specific column or index as a Key

Simply combine (or vertically) DataFrames horizontally (or vertically)

Both have a column called'Key'and their values are the same.

The other columns have different values. This column called ‘Key’ is used as a Key to join two DataFrames horizontally. Use .merge () for merging.

Simply combine (or vertically) DataFrames horizontally (or vertically)

Use pd.concat () concat = concatenate

16th

How to use .merge ()

how: How to combine → {‘left’, ‘right’, ‘outer’, ‘inner’}, default is ‘inner’

Specify the column to be on: key (column that exists in both DataFrames). If not specified, they will be joined in a common column.

left_on: Column to be the key of the left DataFrame

right_on: Column to be the key of right DataFrame

left_index: Specify True when setting the left Key to index

right_index: Specify True when setting the Key of right to index

The argument on specifies which column should be used as the Key when joining. You can only specify columns that are in both the left table and the right table.

To set index as Key, set right_index and left_index described later to True. If the column names of each table (DataFrame) are different, specify left_on and right_on described later.

Change suffix

Specify this argument when the column name you want to use as Key is different between left and right.

If you want to specify Index as Key instead of column, specify True for left_index and right_index.

If you use the join function, it will join by index, but you do not need to remember because you can do almost the same with merge.

17th

(Super important) .apply ()

You can use the apply () function to process all records in a DataFrame and store the result in a separate column. It is an image that applies processing to each line

How to use .apply () with a lambda function

How to use .apply () for the whole record

18th

Save DataFrame in csv format with .to_csv ()

Looking at the result of df.head (3), the index information when it was saved last time is saved in the mysterious column ‘Unnamed: 0’.

If you specify index = False in .to_csv (), you do not need to save the index. Basically, always specify index = False and save it in csv format.

Please note that if the same file already exists in the save destination, it will be overwritten.

Iterate a DataFrame with .iterrows ()

It is used when iterating a DataFrame with a for statement. It's hard to remember, but since "rows" is "iterated", let's remember iter + row + s. Since it can be turned with for, let's think that there are multiple s.

"Iteration" means to repeat the process. It's a loop. For example, if it is a list, it can be iterated with for i in list: (see Part 4).

In DataFrame, you cannot directly say for i in df: like a list. Write as follows using a function called .iterrows ().

.apply () is often used to save the result of processing each record in another column, and this time .iterows () is often used when you want to process only instead of returning a value.

For example, the file path is stored in DataFrame, and .iterrows () is used to move or read the file.

Sort by specific column with .sort_values ()

Create a pivot table with .pivot_table ()

In values, enter the column you want to aggregate. This time it is Price.

First, the value specified by values (Price in this case) is entered in each cell. This is the column you want to aggregate.

On the other hand, just pass the columns for which you want to specify index and columns in a list.

First, clarify "which column you want to aggregate" and put that column in values, then just put the information you want in index and columns.

Cross-section operation with .xs ()

.xs () is an abbreviation for cross section. I don't use this much either, but it comes in handy when working with DataFrames that have multiple indexes, such as pivots. It's a good idea to remember it along with the pivot.

What you need to do with this .xs () is to use it in the pivot table above, for example, when you want to extract only the "Card" row. (Exactly cross-section)

	A	B	C	D	Key
0	a0	b0	NaN	NaN	k0
1	a1	b1	NaN	NaN	k
2	a2	b2	NaN	NaN	k2
0	NaN	NaN	c0	d0	k0
1	NaN	NaN	c2	d1	k1
2	NaN	NaN	c3	d2	k2