In Pandas, I often ask "What should I do when I want to do it?", So I will summarize it according to the purpose.
In this sample code,
A list of Titanic survivors (train.csv
) provided by Kaggle
Read and use with pandas.read_csv ()
.
Titanic: Machine Learning from Disaster | Kaggle
import pandas as pd
df = pd.read_csv('train.csv')
pandas.read_csv — pandas 1.0.5 documentation
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
#Narrow down the output columns
df['Age'].describe()
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
pandas.DataFrame.describe — pandas 1.0.5 documentation
df['Age'].count()
714
You can check the number of rows / columns that contain values other than None
, NaN
, and NaT
.
# 20 < Age <Extract 40 rows
df[(20 < df['Age']) & (df['Age'] < 40)].head()
Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
If you want to narrow down by multiple AND / OR conditions, specify the conditions by enclosing them in ()
, as in df [(A) & (B)]
.
# Embarked(C, Q, S)Numerical value(1, 2, 3)Conversion to
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})
Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 |
pandas.Series.map — pandas 1.0.4 documentation
# Sex(female, male)Numerical value(0, 1)Convert to and column name(Sex)To Male
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 |
pandas.DataFrame.rename — pandas 1.0.4 documentation
If you pass an array with a list of column names, you can change all the column names at once.
pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']
Index | a | b |
---|---|---|
0 | 1 | 10 |
1 | 2 | 20 |
python - Renaming columns in pandas - Stack Overflow
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Male 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
pandas.isnull — pandas 1.0.4 documentation pandas.DataFrame.sum — pandas 1.0.4 documentation
#Exclude all rows containing missing values
df_dn = df.dropna()
df_dn.count()
PassengerId 183
Survived 183
Pclass 183
Name 183
Male 183
Age 183
SibSp 183
Parch 183
Ticket 183
Fare 183
Cabin 183
Embarked 183
dtype: int64
pandas.DataFrame.dropna — pandas 1.0.5 documentation
#Extract Survived and Age columns
df[['Survived', 'Age']]
Index | Survived | Age |
---|---|---|
0 | 0 | 22.0 |
1 | 1 | 38.0 |
2 | 1 | 26.0 |
3 | 1 | 35.0 |
4 | 0 | 35.0 |
Indexing and selecting data — pandas 1.0.4 documentation Get / change the value of any position with pandas at, iat, loc, iloc | note.nkmk.me
df_dn = df.drop('Cabin', axis='columns')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 3.0 |
pandas.DataFrame.dropna — pandas 1.0.5 documentation
import re
#Function to extract title
def getTitle(row):
name = row['Name']
p = re.compile('.*\ (.*)\.\ .*')
surname = p.search(name)
return surname.group(1)
df['Title'] = df.apply(getTitle, axis='columns')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr |
pandas.DataFrame.apply — pandas 1.0.5 documentation
#Find the average age for each title
df.groupby('Title').mean()['Age']
Title
Capt 70.000000
Col 58.000000
Countess 33.000000
Don 40.000000
Dr 42.000000
Jonkheer 38.000000
L 54.000000
Lady 48.000000
Major 48.500000
Master 4.574167
Miss 21.773973
Mlle 24.000000
Mme 24.000000
Mr 32.368090
Mrs 35.728972
Ms 28.000000
Rev 43.166667
Sir 49.000000
Name: Age, dtype: float64
You can also find the number of data items for each title by using df.groupby ('Title'). Count ()
.
How to use Pandas groupby --Qiita
df.sort_values(by='Age')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | 1 | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | 1.0 | Master | NaN |
755 | 756 | 1 | 2 | Hamalainen, Master. Viljo | 1 | 0.67 | 1 | 1 | 250649 | 14.5000 | NaN | 3.0 | Master | NaN |
644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN |
469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN |
78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | 1 | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | 3.0 | Master | NaN |
pandas.DataFrame.sort_values — pandas 1.0.5 documentation
Normally, DaraFrame that executed sort_values ()
is not changed, and the return value is obtained in the sorted state.
If ʻascending = False is specified, the specified columns will be sorted in descending order. If ʻinplace = True
is specified, the DataFrame that executed sort_values ()
will be sorted and the return value will be None
.
df['Survived'].unique()
array([0, 1], dtype=int64)
pandas.unique — pandas 1.0.5 documentation
df[df['Name'].str.contains('Thomas')]
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
149 | 150 | 0 | 2 | Byles, Rev. Thomas Roussel Davids | 1 | 42.00 | 0 | 0 | 244310 | 13.0000 | NaN | 3.0 | Rev | NaN |
151 | 152 | 1 | 1 | Pears, Mrs. Thomas (Edith Wearne) | 0 | 22.00 | 1 | 0 | 113776 | 66.6000 | C2 | 3.0 | Mrs | NaN |
159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | 1 | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | 3.0 | Master | NaN |
186 | 187 | 1 | 3 | O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) | 0 | NaN | 1 | 0 | 370365 | 15.5000 | NaN | 2.0 | Mrs | NaN |
252 | 253 | 0 | 1 | Stead, Mr. William Thomas | 1 | 62.00 | 0 | 0 | 113514 | 26.5500 | C87 | 3.0 | Mr | NaN |
pandas.Series.str.contains — pandas 1.0.5 documentation python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow
Use the ~
operator if you want to retrieve a * value that does not * contain a specific string.
df[~df['Name'].str.contains('Thomas')]
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr | NaN |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs | NaN |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss | NaN |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs | NaN |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr | NaN |
python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow
#value"Mr"Make the background color of the column yellow
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.250000 | nan | 3.000000 | Mr | nan |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.283300 | C85 | 1.000000 | Mrs | nan |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.925000 | nan | 3.000000 | Miss | nan |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.000000 | 1 | 0 | 113803 | 53.100000 | C123 | 3.000000 | Mrs | nan |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.000000 | 0 | 0 | 373450 | 8.050000 | nan | 3.000000 | Mr | nan |
When opened in Jupyter Notebook, the corresponding column is displayed with a colored background. Note that when you open Jupyter Notebook on GitHub, the background color will not be added.
pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation python - Pandas style function to highlight specific columns - Stack Overflow
df.to_csv('output.csv', index=False)
If you do not want to include the index (line number), specify ʻindex = False`. pandas.DataFrame.to_csv — pandas 1.0.5 documentation
If you don't want to put a line break on the last line of the file, pass line_terminator =" "
only on the last line
python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow
Recommended Posts