Introduction

What happens to the entire data when doing data analysis? You may want to confirm that. So I'll write about how to get an overview of the whole data in pandas. First, I will summarize the existing methods, and then I will introduce my own method.

environment

python 3.7.4、pandas 0.25.1

Existing method

The methods .info () and .describe () that combine data already exist in pandas.DataFrame. Someone has already summarized these, so please refer to that (Data overview with Pandas). It's easy to display only the result (I'm sorry that the data is plagiarized with the same titanic ...).

import pandas as pd
data = pd.read_csv("train.csv") #Read data

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

data.describe()

Self-made method

However, with this alone, there is a slight itching. For example, describe () doesn't know the type and missing value information, but it is troublesome to do both info () and describe () twice. So, I made a method that combines info () and describe ().

import numpy as np

def summarize_data(df):

    df_summary=pd.DataFrame({'nunique':np.zeros(df.shape[1])}, index=df.keys())

    df_summary['nunique']=df.nunique()
    df_summary['dtype']=df.dtypes
    df_summary['isnull']=df.isnull().sum()
    df_summary['first_val']=df.iloc[0]
    df_summary['max']=df.max(numeric_only=True)
    df_summary['min']=df.min(numeric_only=True)
    df_summary['mean']=df.mean(numeric_only=True)
    df_summary['std']=df.std(numeric_only=True)
    df_summary['mode']=df.mode().iloc[0]
    
    pd.set_option('display.max_rows', len(df.keys())) #Do not omit the display
    
    return df_summary

summarize_data(data)

In addition, in the kaggle kernel etc., if the number of data is large, the display will be omitted, so it is set so that it is not omitted in the last line of summarize_df ().

Summary

I introduced the existing method that summarizes the data summary of pandas.DataFrame and the self-made method that combines them. Not only can you get an overview at the beginning, but you can also use it to check whether scale conversion and missing value processing are done properly. It would be convenient to have another column like this, please let me know if you have any!

How to get an overview of your data in Pandas

Introduction

environment

Existing method

Self-made method

Summary