How to check for missing values (Kaggle: House Prices)

at first

One of the important things in data analysis is to check the contents of the data. This time, I will introduce a method for checking missing values that even non-engineers can do.

Data set loading

Import pandas to load the dataset. This time, we'll use data from train.csv in Kaggle's House Prices: Advanced Regression Techniques.

House Prices: Advanced Regression Techniques https://www.kaggle.com/c/house-prices-advanced-regression-techniques

import pandas as pd
data = pd.read_csv('../train.csv')

Display from items with many missing values

Set the data you want to check in df. In this case, we will look at the train.csv set above.

#How to check missing values
df=data #Register the dataset in df
total = df.isnull().sum()
percent = round(df.isnull().sum()/df.isnull().count()*100,2)

missing_data = pd.concat([total,percent],axis =1, keys=['Total','Ratio_of_NA(%)'])
type=pd.DataFrame(df[missing_data.index].dtypes, columns=['Types'])
missing_data=pd.concat([missing_data,type],axis=1)
missing_data=missing_data.sort_values('Total',ascending=False)
missing_data.head(20)

print(missing_data.head(20))
print()
print(set(missing_data['Types']))
print()
print("---Categorical col---")
print(missing_data[missing_data['Types']=="object"].index)
print()
print("---Numerical col---")
print(missing_data[missing_data['Types'] !="object"].index)

missingvalue.PNG

Visualization of missing values

You can use the code above to find out the percentage of missing values. But where are the missing values, such as time series datasets? There are times when you want to know. In that case, use heatmap.

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
%matplotlib inline

df = data
plt.figure(figsize=(16,16)) #Size adjustment
plt.title("Missing Value") #title
sns.heatmap(df.isnull(), cbar=False) #Heat map display

heat.png

Summary

By registering various data sets in the df of each code, it is possible to automatically determine whether each column is a text type or a numeric type and visualize missing values.

Recommended Posts

How to check for missing values (Kaggle: House Prices)
[For non-programmers] How to walk Kaggle
Challenge Kaggle [House Prices]
Python # How to check type and type for super beginners
Kaggle House Prices ③ ~ Forecast / Submission ~
Kaggle House Prices ② ~ Model Creation ~
Kaggle House Prices ① ~ Feature Engineering ~
[Python] How to swap array values
How to check Linux OS version
[Python] How to extract / delete / convert a matrix containing missing values (NaN)
How to check the version of Django
How to create * .spec files for pyinstaller.
Search / Delete Missing Values in "Kaggle Memorandum"
[Python] Organizing how to use for statements
How to check opencv version in python
How to install Windows Subsystem For Linux
How to handle consecutive values in MySQL
How to use Pylint for PyQt5 apps
How to use "deque" for Python data
How to use fingerprint authentication for KDE
How to assign multiple values to the Matplotlib colorbar
How to specify the launch browser for JupyterLab 3.0.0
How to use MkDocs for the first time
How to make Spigot plugin (for Java beginners)
How to use Template Engine for Network Engineer
How to install Python for pharmaceutical company researchers
How to use data analysis tools for beginners
How to check / extract files in RPM package
From ROS for Windows installation to operation check
How to write a ShellScript Bash for statement
How to create a shortcut command for LINUX
[ESXi (vCenter)] How to add NIC for CentOS 7.3
Data cleaning How to handle missing and outliers
How to make Python faster for beginners [numpy]
[For beginners] How to study programming Private memo
How to find the correlation for categorical variables
How to force build TensorFlow 2.3.0 for CUDA11 + cuDNN8
How to set CPU affinity for process threads
Check! How to use Azure Key Vault with Azure SDK for Python! (Measures around authentication)
[Introduction to Azure for kaggle users] Comparison of how to start and use Azure Notebooks and Azure Notebooks VM
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 2: Checking Missing Values)