Taking the Titanic data as an example, make a note of what you do first to see the characteristics of the data.
Usually, pandas-profiling
may be better because it gives you more detailed information.
import numpy as np
import pandas as pd
pd.set_option('display.max_columns', 100)
import warnings
warnings.filterwarnings('ignore')
import collections
Data preparation
!wget https://web.stanford.edu/class/archive/cs/cs109/cs109.1166/stuff/titanic.csv
Data reading & a little processing
filename = "/content/titanic.csv"
df = pd.read_csv(filename, encoding='utf-8')
#Make NaN properly
df["Name"] = [di if np.random.rand()>0.1 else float("nan") for di in df["Name"]]
df["Sex"] = [di if np.random.rand()>0.01 else float("nan") for di in df["Sex"]]
df["Age"] = [di if np.random.rand()>0.05 else float("nan") for di in df["Age"]]
#family name
df["f_Name"] = [str(di).split(" ")[-1] if len(str(di).split(" "))>1 else float("nan") for di in df["Name"]]
You can create a data frame like this.
I will use collections.Counter
later, but if the NaN value isfloat ("nan")
, it will not be aggregated well, so replace it with np.nan
. For more information, see here
df = df.replace(float("nan"), np.nan)
Define the data types one by one.
target = "Survived"
cate_list = ["Pclass", "Name", "f_Name", "Sex", "Siblings/Spouses Aboard", "Parents/Children Aboard"]
num_list = ["Age", "Fare"]
all_list = cate_list+num_list
The following is the main process.
n = df.shape[0]
max_n_unique = 10
n_unique_list=[]
min_data_list=[]
max_data_list=[]
major_data_rate_list=[]
#category only
for colname in all_list:
if colname in cate_list: #cate
n_unique = len(df[colname].unique())
min_data = np.nan
max_data = np.nan
if n_unique>max_n_unique: #If there are many categories
c = collections.Counter(df[colname])
c_dict = dict(c.most_common(max_n_unique-1))
#k_list = [k for k,v in c_dict.items()]
v_list = [v/n for k,v in c_dict.items()]
major_data_rate = np.sum(v_list)
else:
major_data_rate = np.nan
else: #num
n_unique = np.nan
major_data_rate = np.nan
min_data = df[colname].min()
max_data = df[colname].max()
n_unique_list.append(n_unique)
major_data_rate_list.append(major_data_rate)
min_data_list.append(min_data)
max_data_list.append(max_data)
have_nan = df.loc[:,all_list].isnull().any(axis=0)
nan_rate = df.loc[:,all_list].isnull().sum(axis=0)/n
summary_df = pd.DataFrame({"colname":all_list,
"have_nan":have_nan.values,
"nan_rate":nan_rate.values,
"n_unique":n_unique_list,
"major_data_rate":major_data_rate_list,
"min_data":min_data_list,
"max_data":max_data_list
})
You can create a data frame that summarizes the characteristics of such variables.
major_data_rate
considers the number specified by max_n_unique
, for example, 10 frequently occurring Top 10 data as major, and calculates the ratio of that data. (It is assumed that other than Top 10 will be summarized by ʻothers` etc. in the later processing.)
stack overflow:Why does collections.Counter treat numpy.nan as equal? CS109:A Titanic Probability GitHub:pandas-profiling
Recommended Posts