theme

This is the 5th project to make a note of the hands-on content that everyone will challenge to the famous "House Price" problem of kaggle. It's more of a memo than a commentary, but I hope it helps someone somewhere. I want to think that the end is about to be seen.

Original theme: https://www.kaggle.com/c/house-prices-advanced-regression-techniques
Referenced article: https://yolo-kiyoshi.com/2018/12/17/post-1003/

Today's work

Dummy of categorical variables

It's like replacing character strings with numbers.

Reference: https://markezine.jp/article/detail/20790

#List features of categorical variables
cat_cols = alldata.dtypes[alldata.dtypes=='object'].index.tolist()
#List the features of numerical variables
num_cols = alldata.dtypes[alldata.dtypes!='object'].index.tolist()
#List columns required for data splitting and submission
other_cols = ['Id','WhatIsData']
#Remove extra elements from the list
cat_cols.remove('WhatIsData') #Training data / test data distinction flag removal
num_cols.remove('Id') #Id delete
#Dummy categorical variables
alldata_cat = pd.get_dummies(alldata[cat_cols])
#Data integration
all_data = pd.concat([alldata[other_cols],alldata[num_cols],alldata_cat],axis=1)

List features of categorical variables

.dtypes: I did this in Part 3. It's the one that detects the data type.
.index: This was done in Part 4. The one that extracts only the index of the corresponding array.
.tolist (): This was also released in the 4th time. The one that makes the one extracted by index into an array.

Oh, I think I'm piled up. The mysterious response. Then, I would like to output only the following results together. Only the object type data type has the index in the list.

cat_cols = alldata.dtypes[alldata.dtypes=='object'].index.tolist() スクリーンショット 2020-06-22 12.05.14.png

List the features of numerical variables

num_cols = alldata.dtypes[alldata.dtypes!='object'].index.tolist()

This is the same as listing the features of categorical variables, so I will omit it.

List columns required for data splitting and submission

other_cols = ['Id','WhatIsData']

As you can see, the column added in Part 2 is stored in the array. Apparently this next step will be used to remove extra elements from the list.

Remove extra elements from the list

It seems that it removes unnecessary elements from the list. You can also confirm from the previous output that there was an item called WhatIsData in cat_cols.

cat_cols.remove ('WhatIsData') #Training data / test data distinction flag removal num_cols.remove ('Id') #Id remove

.remove (): I don't know how to use it. Delete the corresponding item.
.remove () Reference: https://www.javadrive.jp/python/list/index8.html

Dummy categorical variables

alldata_cat = pd.get_dummies(alldata[cat_cols])

.get_dummies: As you can see, it dummies the values in a given column.
.get_dummies Reference: https://note.nkmk.me/python-pandas-get-dummies/

Unusual impression. It's so convenient that you can just apply it to a function and it will do everything for you ... I like this kind of python.

ʻAlldata_cat = pd.get_dummies (alldata [cat_cols])` output result. It's amazing, it's really changed. スクリーンショット 2020-06-22 12.15.49.png

Data integration

all_data = pd.concat([alldata[other_cols],alldata[num_cols],alldata_cat],axis=1)

This is just what I saw. Combine [alldata [other_cols], alldata [num_cols], alldata_cat with concat. (I've come to say that it looks great)

That's it.

Did you proceed at a good tempo this time? It seems that it is not taking much time to read and understand unexpectedly. It feels like you're getting used to it. I will continue to devote myself. Now that the data has been formatted, it's time to analyze it. I'm looking forward to it.

[Hands-on for beginners] Read kaggle's "Predicting Home Prices" line by line (Part 5: Dummy categorical variables)