Pandas demande souvent: "Que dois-je faire quand je veux faire ça?", Je vais donc les résumer par utilisation.
Dans cet exemple de code,
Liste des survivants du Titanic (train.csv
) fournie par Kaggle
Lire et utiliser avec pandas.read_csv ()
.
Titanic: Machine Learning from Disaster | Kaggle
import pandas as pd
df = pd.read_csv('train.csv')
pandas.read_csv — pandas 1.0.5 documentation
df.describe()
PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
---|---|---|---|---|---|---|---|
count | 891.000000 | 891.000000 | 891.000000 | 714.000000 | 891.000000 | 891.000000 | 891.000000 |
mean | 446.000000 | 0.383838 | 2.308642 | 29.699118 | 0.523008 | 0.381594 | 32.204208 |
std | 257.353842 | 0.486592 | 0.836071 | 14.526497 | 1.102743 | 0.806057 | 49.693429 |
min | 1.000000 | 0.000000 | 1.000000 | 0.420000 | 0.000000 | 0.000000 | 0.000000 |
25% | 223.500000 | 0.000000 | 2.000000 | 20.125000 | 0.000000 | 0.000000 | 7.910400 |
50% | 446.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
75% | 668.500000 | 1.000000 | 3.000000 | 38.000000 | 1.000000 | 0.000000 | 31.000000 |
max | 891.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 6.000000 | 512.329200 |
#Réduisez les colonnes de sortie
df['Age'].describe()
count 714.000000
mean 29.699118
std 14.526497
min 0.420000
25% 20.125000
50% 28.000000
75% 38.000000
max 80.000000
Name: Age, dtype: float64
pandas.DataFrame.describe — pandas 1.0.5 documentation
df['Age'].count()
714
Vous pouvez vérifier le nombre de lignes / colonnes contenant des valeurs autres que «None», «NaN» et «NaT».
# 20 < Age <Extraire 40 lignes
df[(20 < df['Age']) & (df['Age'] < 40)].head()
Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
Si vous souhaitez restreindre par plusieurs conditions ET / OU, spécifiez les conditions en les entourant de ()
, comme df [(A) & (B)]
.
# Embarked(C, Q, S)Valeur numérique(1, 2, 3)Conversion en
df['Embarked'] = df['Embarked'].map({'C': 1, 'Q': 2, 'S': 3})
Index | PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 |
pandas.Series.map — pandas 1.0.4 documentation
# Sex(female, male)Valeur numérique(0, 1)Convertir en et nom de colonne(Sex)À l'homme
df['Sex'] = df['Sex'].map({'female': 0, 'male': 1})
df = df.rename(columns={'Sex': 'Male'})
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 |
pandas.DataFrame.rename — pandas 1.0.4 documentation
Si vous passez un tableau avec une liste de noms de colonnes, vous pouvez modifier tous les noms de colonnes à la fois.
pd.DataFrame({'c': [1, 2], 'd': [10, 20]}).columns = ['a', 'b']
Index | a | b |
---|---|---|
0 | 1 | 10 |
1 | 2 | 20 |
python - Renaming columns in pandas - Stack Overflow
df.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Male 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
pandas.isnull — pandas 1.0.4 documentation pandas.DataFrame.sum — pandas 1.0.4 documentation
#Exclure toutes les lignes contenant des valeurs manquantes
df_dn = df.dropna()
df_dn.count()
PassengerId 183
Survived 183
Pclass 183
Name 183
Male 183
Age 183
SibSp 183
Parch 183
Ticket 183
Fare 183
Cabin 183
Embarked 183
dtype: int64
pandas.DataFrame.dropna — pandas 1.0.5 documentation
#Extraire les colonnes Survived et Age
df[['Survived', 'Age']]
Index | Survived | Age |
---|---|---|
0 | 0 | 22.0 |
1 | 1 | 38.0 |
2 | 1 | 26.0 |
3 | 1 | 35.0 |
4 | 0 | 35.0 |
Indexing and selecting data — pandas 1.0.4 documentation Obtenir / modifier la valeur de n'importe quelle position avec les pandas à, iat, loc, iloc | note.nkmk.me
df_dn = df.drop('Cabin', axis='columns')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Embarked |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | 3.0 |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | 1.0 |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | 3.0 |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | 3.0 |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | 3.0 |
pandas.DataFrame.dropna — pandas 1.0.5 documentation
import re
#Fonction pour extraire le titre
def getTitle(row):
name = row['Name']
p = re.compile('.*\ (.*)\.\ .*')
surname = p.search(name)
return surname.group(1)
df['Title'] = df.apply(getTitle, axis='columns')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr |
pandas.DataFrame.apply — pandas 1.0.5 documentation
#Trouvez l'âge moyen pour chaque titre
df.groupby('Title').mean()['Age']
Title
Capt 70.000000
Col 58.000000
Countess 33.000000
Don 40.000000
Dr 42.000000
Jonkheer 38.000000
L 54.000000
Lady 48.000000
Major 48.500000
Master 4.574167
Miss 21.773973
Mlle 24.000000
Mme 24.000000
Mr 32.368090
Mrs 35.728972
Ms 28.000000
Rev 43.166667
Sir 49.000000
Name: Age, dtype: float64
Vous pouvez également trouver le nombre d'éléments de données pour chaque titre en utilisant df.groupby ('Titre'). Count ()
.
Comment utiliser Pandas groupby --Qiita
df.sort_values(by='Age')
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
803 | 804 | 1 | 3 | Thomas, Master. Assad Alexander | 1 | 0.42 | 0 | 1 | 2625 | 8.5167 | NaN | 1.0 | Master | NaN |
755 | 756 | 1 | 2 | Hamalainen, Master. Viljo | 1 | 0.67 | 1 | 1 | 250649 | 14.5000 | NaN | 3.0 | Master | NaN |
644 | 645 | 1 | 3 | Baclini, Miss. Eugenie | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN |
469 | 470 | 1 | 3 | Baclini, Miss. Helene Barbara | 0 | 0.75 | 2 | 1 | 2666 | 19.2583 | NaN | 1.0 | Miss | NaN |
78 | 79 | 1 | 2 | Caldwell, Master. Alden Gates | 1 | 0.83 | 0 | 2 | 248738 | 29.0000 | NaN | 3.0 | Master | NaN |
pandas.DataFrame.sort_values — pandas 1.0.5 documentation
Normalement, le DaraFrame qui a exécuté sort_values ()
est inchangé et les valeurs de retour sont obtenues dans un état trié.
Si ʻascending = False est spécifié, les colonnes spécifiées seront triées par ordre décroissant. Si ʻinplace = True
est spécifié, le DataFrame qui a exécutésort_values ()
sera trié et la valeur de retour sera None
.
df['Survived'].unique()
array([0, 1], dtype=int64)
pandas.unique — pandas 1.0.5 documentation
df[df['Name'].str.contains('Thomas')]
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
149 | 150 | 0 | 2 | Byles, Rev. Thomas Roussel Davids | 1 | 42.00 | 0 | 0 | 244310 | 13.0000 | NaN | 3.0 | Rev | NaN |
151 | 152 | 1 | 1 | Pears, Mrs. Thomas (Edith Wearne) | 0 | 22.00 | 1 | 0 | 113776 | 66.6000 | C2 | 3.0 | Mrs | NaN |
159 | 160 | 0 | 3 | Sage, Master. Thomas Henry | 1 | NaN | 8 | 2 | CA. 2343 | 69.5500 | NaN | 3.0 | Master | NaN |
186 | 187 | 1 | 3 | O'Brien, Mrs. Thomas (Johanna "Hannah" Godfrey) | 0 | NaN | 1 | 0 | 370365 | 15.5000 | NaN | 2.0 | Mrs | NaN |
252 | 253 | 0 | 1 | Stead, Mr. William Thomas | 1 | 62.00 | 0 | 0 | 113514 | 26.5500 | C87 | 3.0 | Mr | NaN |
pandas.Series.str.contains — pandas 1.0.5 documentation python - How to filter rows containing a string pattern from a Pandas dataframe - Stack Overflow
Utilisez l'opérateur ~
si vous voulez récupérer * des valeurs qui n'incluent pas * de chaîne spécifique.
df[~df['Name'].str.contains('Thomas')]
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | 3.0 | Mr | NaN |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | 1.0 | Mrs | NaN |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | 3.0 | Miss | NaN |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | 3.0 | Mrs | NaN |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | 3.0 | Mr | NaN |
python - Search for "does-not-contain" on a DataFrame in pandas - Stack Overflow
#valeur"Mr"Rendre la couleur d'arrière-plan de la colonne jaune
df.style.apply(lambda x: ['background-color: yellow' if v == 'Mr' else '' for v in x])
Index | PassengerId | Survived | Pclass | Name | Male | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Title | AgeMean |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 3 | Braund, Mr. Owen Harris | 1 | 22.000000 | 1 | 0 | A/5 21171 | 7.250000 | nan | 3.000000 | Mr | nan |
1 | 2 | 1 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer) | 0 | 38.000000 | 1 | 0 | PC 17599 | 71.283300 | C85 | 1.000000 | Mrs | nan |
2 | 3 | 1 | 3 | Heikkinen, Miss. Laina | 0 | 26.000000 | 0 | 0 | STON/O2. 3101282 | 7.925000 | nan | 3.000000 | Miss | nan |
3 | 4 | 1 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.000000 | 1 | 0 | 113803 | 53.100000 | C123 | 3.000000 | Mrs | nan |
4 | 5 | 0 | 3 | Allen, Mr. William Henry | 1 | 35.000000 | 0 | 0 | 373450 | 8.050000 | nan | 3.000000 | Mr | nan |
Lorsqu'elle est ouverte dans Jupyter Notebook, la colonne correspondante s'affiche avec un arrière-plan coloré. Notez que lorsque vous ouvrez le bloc-notes Jupyter sur GitHub, la couleur d'arrière-plan ne sera pas ajoutée.
pandas.io.formats.style.Styler.apply — pandas 1.0.5 documentation python - Pandas style function to highlight specific columns - Stack Overflow
df.to_csv('output.csv', index=False)
Si vous ne souhaitez pas inclure l'index (numéro de ligne), spécifiez ʻindex = False`. pandas.DataFrame.to_csv — pandas 1.0.5 documentation
Si vous ne voulez pas insérer de saut de ligne sur la dernière ligne du fichier, passez line_terminator =" "
uniquement sur la dernière ligne
python - How to stop writing a blank line at the end of csv file - pandas - Stack Overflow
Recommended Posts