A memo on how to delete a column with a missing value.
Data used: Kaggle cources: Rent data for Intermediate Machine Learning --Missing Values
Environment: Kaggle notebook
Preparation of module / os, reading of data
DropColumn.py
#os,Module import
import os
import pandas as pd
#Data reading
X_full=pd.read_csv('../input/train.csv',index_col='Id')
X_full
has the following indexes
DropColumn.py
X_full.columns
Among them, the Column containing the defect is
DropColumn.py
cols_missing=[col for col in X_full.columns
if X_full[col].isnull().any()]
cols_missing
It seems. Delete these all at once.
DropColumn.py
reduced_X_full=X_full.drop(cols_missing,axis=1)
reduced_X_full
Deletion completed.
How to use scikit-learn's SimpleImputer
SimpleImputer
uses statistical values such as median and mean to complement missing values.
For example, if you want to complement with the median Specify as ʻimputer = SimpleImputer (strategy ='median')`.
ImputeValue.py
#Definition of imputer
from sklearn.impute import SimpleImputer
imputer=SimpleImputer(strategy='median')
#X_Complement full missing values
imputed_X_full=pd.DataFrame(imputer.fit_transform(X_full))
At this rate, the column names of ʻimputed_X_full` are ordinal.
ImputeValue.py
imputer_X_full.columns
Undo column name
ImputeValue.py
imputed_X_full.columns=X_full.columns
imputed_X_full.columns
Completion completed.
Recommended Posts