Please take a look at the previous content! https://qiita.com/lindq_yu/items/4f8e3e1d28df0c693d4f
Confirmation of data properties when engineering features. This data is PAY_AMT, BILL_AMT, AGE, LIMIT BAL are numerical data Category data for SEX, MARRIAGE, EDUCATION PAY has categorical variables such as revolving payment, payment was possible, and was not possible, but there are numerical values in which the delay month is 1 month, 2 months, 3 months, and so on.
I would like to process these data firmly.
There are several methods for feature engineering of categorical variables, but the typical One-hot encoding is used.
In the description of this data, there is no "0" data in EDUCATION and MARRIAGE, but it is in this dataset. EDUCATION
5 and 6 are unknown. Originally, these two seem to be meaningful unkown. For business situations (I don't really know because I'm a student ...), I can ask the person in charge of input, the questionnaire creator, and the dataset creator, but I don't know this time, so I include unkown and "0". And match with others (since there are only 14 "0" data)
MARRIGE
This item also has 3 as others, which is a question that can include implications. Normally, this is also something to check with the person in charge, but unfortunately it cannot be done, so "0" will be included in others.
Process the dataset based on the above.
↓ Data set description
python
#Data extraction
category=dataset.loc[:,["SEX","MARRIAGE","EDUCATION"]]
#Counting the number of SEX appearances
#print("SEX value count")
#print(category["SEX"].value_counts())
#print("")
#Counting the number of appearances of MARRIAGE
#print("MARRIAGE value count")
#print(category["MARRIAGE"].value_counts())
#print("")
#Counting the number of occurrences of EDUCATION
#print("EDUCATION")
#print(category["EDUCATION"].value_counts())
#MARRIAGE"0" -unknown-To"3" -others-Conversion to
category["MARRIAGE"] = category["MARRIAGE"].replace(0,3)
#EDUCATION"0" -unknown- ,"5" -unknown- ,"6" -unknown-To"3" -others-Conversion to
category["EDUCATION"] = category["EDUCATION"].replace(0,4)
category["EDUCATION"] = category["EDUCATION"].replace(5,4)
category["EDUCATION"] = category["EDUCATION"].replace(6,4)
#Confirmation of category
category
Click here for the converted result ↓
onehot_category Convert this categorical variable to data using onehot_category.
python
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(categories="auto", sparse=False, dtype=np.float32)
onehot_X = enc.fit_transform(category)
onehot_category= pd.DataFrame(data = onehot_X,columns = ["male","female","graduate school","university", "high school", "EDU-others","married", "single","MARR-others"])
onehot_category
Conversion completed!
Numerical data should be left untouched. Numerical data also has conversion, but there are models with higher accuracy if it is not, so leave it for now.
PAY
--Revo count (Revo) --The number of times you paid successfully (Could) --Could not --Number of times no payment was made (not)
--Number of months you couldn't pay
Create variables for the number of revolving credits (Revo), the number of successful payments (Could), the number of times payments could not be made (Could not), and the number of times payments were not made (not). At the continuous value, 0 is substituted, assuming that there is no delay in the month when there was no payment, which was paid successfully with revolving payment. Designed so that the remaining value is the number of delayed months.
python
l = []
for i in range(1,7):
l.append("PAY_" + str(i))
PAY=dataset.loc[:,l]
PAY["Revo"] = PAY[PAY == 0].count(1)
PAY["Could"] = PAY[PAY == -1].count(1)
PAY["Not"] = PAY[PAY == -2].count(1)
PAY["Could not"] =6-PAY["Not"]-PAY["Could"]-PAY["Revo"]
for i in l:
PAY[i] = PAY[i].replace(-1,0)
PAY[i] = PAY[i].replace(-2,0)
Complete!
The created variable. Merge adjusted variables
python
#Numerical data
l = []
l.append("AGE")
l.append("LIMIT_BAL")
for i in range(1,7):
l.append("PAY_AMT" + str(i))
for i in range(1,7):
l.append("BILL_AMT" + str(i))
merge_data = dataset.loc[:,l]
#Category data
merge_data = merge_data.join(onehot_category)
#PAY
merge_data = merge_data.join(PAY)
merge_data
So far, we have been doing feature engineering. Next time, I would like to put it in a machine learning model!