What happens to the entire data when doing data analysis? You may want to confirm that. So I'll write about how to get an overview of the whole data in pandas. First, I will summarize the existing methods, and then I will introduce my own method.
python 3.7.4、pandas 0.25.1
The methods .info () and .describe () that combine data already exist in pandas.DataFrame. Someone has already summarized these, so please refer to that (Data overview with Pandas). It's easy to display only the result (I'm sorry that the data is plagiarized with the same titanic ...).
import pandas as pd
data = pd.read_csv("train.csv") #Read data
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
data.describe()
However, with this alone, there is a slight itching. For example, describe () doesn't know the type and missing value information, but it is troublesome to do both info () and describe () twice. So, I made a method that combines info () and describe ().
import numpy as np
def summarize_data(df):
df_summary=pd.DataFrame({'nunique':np.zeros(df.shape[1])}, index=df.keys())
df_summary['nunique']=df.nunique()
df_summary['dtype']=df.dtypes
df_summary['isnull']=df.isnull().sum()
df_summary['first_val']=df.iloc[0]
df_summary['max']=df.max(numeric_only=True)
df_summary['min']=df.min(numeric_only=True)
df_summary['mean']=df.mean(numeric_only=True)
df_summary['std']=df.std(numeric_only=True)
df_summary['mode']=df.mode().iloc[0]
pd.set_option('display.max_rows', len(df.keys())) #Do not omit the display
return df_summary
summarize_data(data)
In addition, in the kaggle kernel etc., if the number of data is large, the display will be omitted, so it is set so that it is not omitted in the last line of summarize_df ().
I introduced the existing method that summarizes the data summary of pandas.DataFrame and the self-made method that combines them. Not only can you get an overview at the beginning, but you can also use it to check whether scale conversion and missing value processing are done properly. It would be convenient to have another column like this, please let me know if you have any!
Recommended Posts