[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (4th: Complementing Missing Values (Complete))

theme

This is the 4th project to make a note of the contents of hands-on, where everyone will challenge the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere. The impression that it gradually accumulated when it was the 4th time.

Today's work

Complementing missing values (I'll do it here this week)

What I did up to the last time was the one that "gets the index including the missing value as an array". (By the way, I feel like I'm confused that "python has various concepts of arrays and it's annoying")

#Complement missing values according to data type
#0 for float
#In the case of object'NA'
na_float_cols = alldata[na_col_list].dtypes[alldata[na_col_list].dtypes=='float64'].index.tolist() #float64
na_obj_cols = alldata[na_col_list].dtypes[alldata[na_col_list].dtypes=='object'].index.tolist() #object
#Substitute 0 if float64 type is missing
for na_float_col in na_float_cols:
    alldata.loc[alldata[na_float_col].isnull(),na_float_col] = 0.0
#If the object type is missing'NA'Substitute
for na_obj_col in na_obj_cols:
    alldata.loc[alldata[na_obj_col].isnull(),na_obj_col] = 'NA'

Indexes that are numeric and have missing values

alldata[na_col_list].dtypes[alldata[na_col_list].dtypes=='float64'].index.tolist() スクリーンショット 2020-06-15 11.52.37.png

Indexes with missing values in categorical variables

alldata[na_col_list].dtypes[alldata[na_col_list].dtypes=='object'].index.tolist() スクリーンショット 2020-06-15 11.55.30.png

Substitute a missing value of numeric type

for na_float_col in na_float_cols:
    alldata.loc[alldata[na_float_col].isnull(),na_float_col] = 0.0

Now, but repeatedly about for in

I will read about for now. The order of variables and objects is reversed from that written in PHP (I don't know if it's correct).

It's persistent, but .isnull ()

Try to output na_float_col and ʻalldata [na_float_col]`. For the time being, let's take a look at the iron plate for checking the operation of the iterative process.

About .loc

alldata.loc[alldata[na_float_col].isnull(),na_float_col]

Set a value for the missing value

Specify in a matrix and enter "0.0" only for missing values. alldata.loc[alldata[na_float_col].isnull(),na_float_col] = 0.0

Completion of missing values for categorical variables

Missing value completion result

Each item is too detailed to see, but you should be able to go with this.

Output result of ʻall data` スクリーンショット 2020-06-15 12.17.07.png

Dummy of categorical variables

I thought I'd do it, but the time has run out, so I'd like to finish it as a preparation for "dummy categorical variables". Is it like quantifying it so that it can be analyzed? .. .. ??

That's it.

It took longer than I expected to complete the missing values. I wonder if this is a Python trap that packs everything in one line (hopefully it's not a trap or anything).

It's almost time for the actual treatment to come closer and I'm excited to smell the clothes I'm wearing.

Recommended Posts

[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (4th: Complementing Missing Values (Complete))
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 2: Checking Missing Values)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (8th: Building a Forecast Model)
[Hands-on for beginners] Read kaggle's "Forecasting Home Prices" line by line (Part 1: Reading data)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 3: Preparation for missing value complementation)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (7th: Preparing to build a prediction model)
[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 5: Dummy categorical variables)
[Hands-on for beginners] Read kaggle's "Predicting House Prices" line by line (6th: Distribution conversion of objective variables)
Complementing kaggle's titanic missing values and creating features