I want to do ○○ with Pandas

Change log

2020/07/08
Changed the data display format from CSV to Markdown Table. I used the CSV to Markdown Table Generator (https://donatstudios.com/CsvToMarkdownTable) to convert from CSV to Markdown Table.

In Pandas, I often ask "What should I do when I want to do it?", So I will summarize it according to the purpose.

Premise

In this sample code, A list of Titanic survivors (train.csv) provided by Kaggle Read and use with pandas.read_csv ().

Titanic: Machine Learning from Disaster | Kaggle

import pandas as pd
df = pd.read_csv('train.csv')

pandas.read_csv — pandas 1.0.5 documentation

I want to output statistical information

df.describe()

	PassengerId	Survived	Pclass	Age	SibSp	Parch	Fare
count	891.000000	891.000000	891.000000	714.000000	891.000000	891.000000	891.000000
mean	446.000000	0.383838	2.308642	29.699118	0.523008	0.381594	32.204208
std	257.353842	0.486592	0.836071	14.526497	1.102743	0.806057	49.693429
min	1.000000	0.000000	1.000000	0.420000	0.000000	0.000000	0.000000
25%	223.500000	0.000000	2.000000	20.125000	0.000000	0.000000	7.910400
50%	446.000000	0.000000	3.000000	28.000000	0.000000	0.000000	14.454200
75%	668.500000	1.000000	3.000000	38.000000	1.000000	0.000000	31.000000
max	891.000000	1.000000	3.000000	80.000000	8.000000	6.000000	512.329200

#Narrow down the output columns
df['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

pandas.DataFrame.describe — pandas 1.0.5 documentation

I want to check the number of data

df['Age'].count()

You can check the number of rows / columns that contain values other than None, NaN, and NaT.

pandas.DataFrame.count — pandas 1.0.5 documentation

I want to narrow down the data

# 20 < Age <Extract 40 rows
df[(20 < df['Age']) & (df['Age'] < 40)].head()

Index	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

If you want to narrow down by multiple AND / OR conditions, specify the conditions by enclosing them in (), as in df [(A) & (B)].

Python Pandas: Boolean indexing on multiple columns - Stack Overflow

I want to convert categorized data to numbers

# Embarked(C, Q, S)Numerical value(1, 2, 3)Conversion to
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})

Index	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	3.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	1.0
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	3.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	3.0
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	3.0

pandas.Series.map — pandas 1.0.4 documentation

I want to change the column name

# Sex(female, male)Numerical value(0, 1)Convert to and column name(Sex)To Male
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	NaN	3.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	1.0
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	NaN	3.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	3.0
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	NaN	3.0

pandas.DataFrame.rename — pandas 1.0.4 documentation

If you pass an array with a list of column names, you can change all the column names at once.

pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']

Index	a	b
0	1	10
1	2	20

python - Renaming columns in pandas - Stack Overflow

I want to check the number of missing values for each column

df.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Male             0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

pandas.isnull — pandas 1.0.4 documentation pandas.DataFrame.sum — pandas 1.0.4 documentation

I want to exclude missing values

#Exclude all rows containing missing values
df_dn = df.dropna()
df_dn.count()

PassengerId    183
Survived       183
Pclass         183
Name           183
Male           183
Age            183
SibSp          183
Parch          183
Ticket         183
Fare           183
Cabin          183
Embarked       183
dtype: int64

pandas.DataFrame.dropna — pandas 1.0.5 documentation

I want to retrieve the specified column

#Extract Survived and Age columns
df[['Survived', 'Age']]

Index	Survived	Age
0	0	22.0
1	1	38.0
2	1	26.0
3	1	35.0
4	0	35.0

Indexing and selecting data — pandas 1.0.4 documentation Get / change the value of any position with pandas at, iat, loc, iloc | note.nkmk.me

I want to exclude the specified column

df_dn = df.drop('Cabin', axis='columns')

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Embarked
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	3.0
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	1.0
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	3.0
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	3.0
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	3.0

pandas.DataFrame.dropna — pandas 1.0.5 documentation

I want to apply a function to a row / column value and create a new row / column

import re

#Function to extract title
def getTitle(row):
    name = row['Name']
    p = re.compile('.*\ (.*)\.\ .*')
    surname = p.search(name)
    return surname.group(1)

df['Title'] = df.apply(getTitle, axis='columns')

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	NaN	3.0	Mr
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	1.0	Mrs
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	NaN	3.0	Miss
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	3.0	Mrs
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	NaN	3.0	Mr

pandas.DataFrame.apply — pandas 1.0.5 documentation

I want to calculate the average for each category

#Find the average age for each title
df.groupby('Title').mean()['Age']

Title
Capt        70.000000
Col         58.000000
Countess    33.000000
Don         40.000000
Dr          42.000000
Jonkheer    38.000000
L           54.000000
Lady        48.000000
Major       48.500000
Master       4.574167
Miss        21.773973
Mlle        24.000000
Mme         24.000000
Mr          32.368090
Mrs         35.728972
Ms          28.000000
Rev         43.166667
Sir         49.000000
Name: Age, dtype: float64

You can also find the number of data items for each title by using df.groupby ('Title'). Count (). How to use Pandas groupby --Qiita

I want to sort rows by value

df.sort_values(by='Age')

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	AgeMean
803	804	1	3	Thomas, Master. Assad Alexander	1	0.42	0	1	2625	8.5167	NaN	1.0	Master	NaN
755	756	1	2	Hamalainen, Master. Viljo	1	0.67	1	1	250649	14.5000	NaN	3.0	Master	NaN
644	645	1	3	Baclini, Miss. Eugenie	0	0.75	2	1	2666	19.2583	NaN	1.0	Miss	NaN
469	470	1	3	Baclini, Miss. Helene Barbara	0	0.75	2	1	2666	19.2583	NaN	1.0	Miss	NaN
78	79	1	2	Caldwell, Master. Alden Gates	1	0.83	0	2	248738	29.0000	NaN	3.0	Master	NaN

pandas.DataFrame.sort_values — pandas 1.0.5 documentation

Normally, DaraFrame that executed sort_values () is not changed, and the return value is obtained in the sorted state. If ʻascending = False is specified, the specified columns will be sorted in descending order. If ʻinplace = True is specified, the DataFrame that executed sort_values () will be sorted and the return value will be None.

I want to find out the unique values contained in a column

df['Survived'].unique()

array([0, 1], dtype=int64)

pandas.unique — pandas 1.0.5 documentation

I want to retrieve a line that contains a specific string

df[df['Name'].str.contains('Thomas')]

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked	Title	AgeMean
149	150	0	2	Byles, Rev. Thomas Roussel Davids	1	42.00	0	0	244310	13.0000	NaN	3.0	Rev	NaN
151	152	1	1	Pears, Mrs. Thomas (Edith Wearne)	0	22.00	1	0	113776	66.6000	C2	3.0	Mrs	NaN
159	160	0	3	Sage, Master. Thomas Henry	1	NaN	8	2	CA. 2343	69.5500	NaN	3.0	Master	NaN
186	187	1	3	O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey)	0	NaN	1	0	370365	15.5000	NaN	2.0	Mrs	NaN
252	253	0	1	Stead, Mr. William Thomas	1	62.00	0	0	113514	26.5500	C87	3.0	Mr	NaN

pandas.Series.str.contains — pandas 1.0.5 documentation python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow

Use the ~ operator if you want to retrieve a * value that does not * contain a specific string.

df[~df['Name'].str.contains('Thomas')]

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title	AgeMean
0	1	0	3	Braund, Mr. Owen Harris	1	22.0	1	A/5 21171	7.2500	NaN	3.0	Mr	NaN
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	0	38.0	1	PC 17599	71.2833	C85	1.0	Mrs	NaN
2	3	1	3	Heikkinen, Miss. Laina	0	26.0	0	STON/O2. 3101282	7.9250	NaN	3.0	Miss	NaN
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.0	1	113803	53.1000	C123	3.0	Mrs	NaN
4	5	0	3	Allen, Mr. William Henry	1	35.0	0	373450	8.0500	NaN	3.0	Mr	NaN

python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow

I want to add color when displaying a data frame

#value"Mr"Make the background color of the column yellow
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])

Index	PassengerId	Survived	Pclass	Name	Male	Age	SibSp	Ticket	Fare	Cabin	Embarked	Title	AgeMean
0	1	0	3	Braund, Mr. Owen Harris	1	22.000000	1	A/5 21171	7.250000	nan	3.000000	Mr	nan
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Thayer)	0	38.000000	1	PC 17599	71.283300	C85	1.000000	Mrs	nan
2	3	1	3	Heikkinen, Miss. Laina	0	26.000000	0	STON/O2. 3101282	7.925000	nan	3.000000	Miss	nan
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	0	35.000000	1	113803	53.100000	C123	3.000000	Mrs	nan
4	5	0	3	Allen, Mr. William Henry	1	35.000000	0	373450	8.050000	nan	3.000000	Mr	nan

When opened in Jupyter Notebook, the corresponding column is displayed with a colored background. Note that when you open Jupyter Notebook on GitHub, the background color will not be added.

pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation python - Pandas style function to highlight specific columns - Stack Overflow

I want to output in CSV format

df.to_csv('output.csv', index=False)

If you do not want to include the index (line number), specify ʻindex = False`. pandas.DataFrame.to_csv — pandas 1.0.5 documentation

If you don't want to put a line break on the last line of the file, pass line_terminator =" " only on the last line python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow