How to get an overview of your data in Pandas

Introduction

What happens to the entire data when doing data analysis? You may want to confirm that. So I'll write about how to get an overview of the whole data in pandas. First, I will summarize the existing methods, and then I will introduce my own method.

environment

python 3.7.4、pandas 0.25.1

Existing method

The methods .info () and .describe () that combine data already exist in pandas.DataFrame. Someone has already summarized these, so please refer to that (Data overview with Pandas). It's easy to display only the result (I'm sorry that the data is plagiarized with the same titanic ...).

import pandas as pd
data = pd.read_csv("train.csv") #Read data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
data.describe()

describe.png

Self-made method

However, with this alone, there is a slight itching. For example, describe () doesn't know the type and missing value information, but it is troublesome to do both info () and describe () twice. So, I made a method that combines info () and describe ().

import numpy as np

def summarize_data(df):

    df_summary=pd.DataFrame({'nunique':np.zeros(df.shape[1])}, index=df.keys())

    df_summary['nunique']=df.nunique()
    df_summary['dtype']=df.dtypes
    df_summary['isnull']=df.isnull().sum()
    df_summary['first_val']=df.iloc[0]
    df_summary['max']=df.max(numeric_only=True)
    df_summary['min']=df.min(numeric_only=True)
    df_summary['mean']=df.mean(numeric_only=True)
    df_summary['std']=df.std(numeric_only=True)
    df_summary['mode']=df.mode().iloc[0]
    
    pd.set_option('display.max_rows', len(df.keys())) #Do not omit the display
    
    return df_summary

summarize_data(data)

summarize.png

In addition, in the kaggle kernel etc., if the number of data is large, the display will be omitted, so it is set so that it is not omitted in the last line of summarize_df ().

Summary

I introduced the existing method that summarizes the data summary of pandas.DataFrame and the self-made method that combines them. Not only can you get an overview at the beginning, but you can also use it to check whether scale conversion and missing value processing are done properly. It would be convenient to have another column like this, please let me know if you have any!

Recommended Posts

How to get an overview of your data in Pandas
How to get help in an interactive shell
How to get the number of digits in Python
How to write soberly in pandas
How to get a list of built-in exceptions in python
Data science companion in python, how to specify elements in pandas
How to get a quadratic array of squares in a spiral!
How to find out if there is an arbitrary value in "somewhere" of pandas DataFrame
How to calculate the sum or average of time series csv data in an instant
<Pandas> How to handle time series data in a pivot table
How to create a large amount of test data in MySQL? ??
I will explain how to use Pandas in an easy-to-understand manner.
How to get rid of long comprehensions
How to get a stacktrace in python
How to reassign index in pandas dataframe
How to read CSV files in Pandas
How to get rid of the "Tags must be an array of hashes." Error in the qiita api
How to change multiple columns of csv in Pandas (Unixtime-> Japan Time)
How to get a specific column name and index name in pandas DataFrame
How to send a visualization image of data created in Python to Typetalk
How to keep track of work in Powershell
Summary of how to import files in Python 3
How to get results from id in Celery
[Django] How to get data by specifying SQL.
Summary of how to use MNIST in Python
How to get article data using Qiita API
How to create data to put in CNN (Chainer)
I want to get League of Legends data ③
I want to get League of Legends data ②
How to get dictionary type elements of Python 2.7
How to get the files in the [Python] folder
How to read time series data in PyTorch
I want to get League of Legends data ①
Basics of pandas for beginners ② Understanding data overview
The first step to log analysis (how to format and put log data in Pandas)
How to get a list of files in the same directory with python
[Introduction to Python] How to get the index of data with a for statement
How to use Pandas 2
How to get the variable name itself in python
Summary of tools needed to analyze data in Python
How to apply markers only to specific data in matplotlib
How to get multiple model objects randomly in Django
How to access with cache when reading_json in pandas
How to get more than 1000 data with SQLAlchemy + MySQLdb
How to extract non-missing value nan data with pandas
How to output CSV of multi-line header with pandas
How to make an interactive CLI tool in Golang
How to create an image uploader in Bottle (Python)
I tried to summarize how to use pandas in python
How to extract non-missing value nan data with pandas
[Linux] How to put your IP in a variable
manage to get rid of heavy pyls in vim-lsp
How to get RGB and HSV histograms in OpenCV
Use Pandas to write only the specified lines of the data frame to an excel file
[Blender] How to get the selection order of vertices, edges and faces of an object
How to split machine learning training data into objective variables and others in Pandas
How strong is your Qiita? Statistics on the number of Contributes seen in the data
How to change python version of Notebook in Watson Studio (or Cloud Pak for Data)
[Go language] How to get terminal input in real time
[Introduction to cx_Oracle] Overview of cx_Oracle
How to swap elements in an array in Python, and how to reverse an array.