――This article is a memorandum article for elementary school students who are self-taught in python, machine learning, etc.
――It will be extremely simple, "study while copying the code that you are interested in".
――We would appreciate your constructive comments (please LGTM
& stock
if you like).
――The theme this time is ** IBM HR Analytics Employee Attrition & Performance **. According to the explanation given in kaggle, it seems to be a problem of searching for ** "reasons for employee retirement" **. ――This time, I copied the sutras while watching the following youtube video.
Link:Predict Employee Attrition Using Machine Learning & Python
The data was taken from kaggle.
Link:IBM HR Analytics Employee Attrition & Performance
The analysis used Google Colaboratry, as you can see in the youtube video (it's a convenient time).
Then I would like to do it.
#Loading the library
import numpy as np
import pandas as pd
import seaborn as sns
Load the underlying library. I feel like I will continue to add the necessary libraries as needed.
Next, regarding reading data, load the csv file downloaded from the kaggle site with google colab.
#Data upload
from google.colab import files
uploaded = files.upload()
By doing this, you can import locally stored files onto google colab. I usually upload files to Google Drive and then load them in conjunction with Google Drive, so this is easier and better.
I will read the uploaded data.
#Data reading
df = pd.read_csv('WA_Fn-UseC_-HR-Employee-Attrition.csv')
#Data confirmation
df.head(7)
It's a familiar code. From now on, we will check the contents of the data.
The following code is (actually) running separately, but I'll put them together here.
#Check the number of rows / columns in the data frame
df.shape
#Check the data type of the contents of each column
df.dtypes
#Confirmation of missing values
df.isna().sum()
df.isnull().values.any()
#Confirmation of basic statistics
df.describe()
#Confirmation of the number of retirees and enrolled persons (Understanding the number of explained variables)
df['Attrition'].values_counts() #Figure 1
#Visualization of retirees and enrollment
sns.countplot(df['Attrition'])
#Visualization of the number of retirees and enrollments by age
import matplotlib.pyplot as plt
plt.subplots(figsize=(12,4))
sns.countplot(x='Age', hue='Attrition', data=df, palette='colorblind') #Figure 2
【Figure 1】
【Figure 2】
Up to this point, we are checking the data as we always do. First of all, I think it is necessary to confirm the contents of the data firmly.
Next, check the unique value of the object type column of the data types you checked earlier.
for column in df.columns:
if df[column].dtype == object:
print(str(column) + ':' + str(df[column].unique()))
print(df[column].value_counts())
print('___________________________________________')
Remove columns that don't make sense to predict with .drop ()
.
df = df.drop('Over18', axis=1)
df = df.drop('EmployeeNumber', axis=1)
df = df.drop('StandardHours', axis=1)
df = df.drop('EmployeeCount', axis=1)
This is self-explanatory. Remove from df what is not a reason to retire.
I think this is also a familiar process. Check the correlation (correlation
) between each column and visualize heatmap
.
df.corr()
plt.figure(figsize=(14, 14))
sns.heatmap(df.corr(), annot=True, fmt='.0%')
This time, the following two are specified when creating the heatmap.
Item | Description |
---|---|
annot | When set to True, the value is output to the cell. |
fmt | annot=Specify the output format as a character string when set to True or when a data set is specified. |
Reference: Create a heatmap with Seaborn
from sklearn.preprocessing import LabelEncoder
for column in df.columns:
if df[columen].dtype == np.number:
continue
df[column] = LabelEncoder().fit_transform(df[column])
Here, sklearn's LabelEncoder is used to replace the object type data with numerical data ("Convert character data to discrete values (0, 1, ...) before applying to the classifier").
After the replacement, change the order of the df columns to make it easier to analyze.
#Duplicate Age to new column
df['Age_Years'] = df['Age']
#Drop the Age column
df = df.drop('Age', axis=1)
This is the actual production (not to mention that preprocessing is important).
#Divide df data into explanatory variables and explained variables
X = df.iloc[:, 1:df.shape[1]].values
Y = df.iloc[:, 0].values
#Test data size of training data and teacher data (25)%)
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, ramdom_state = 0)
#Classification by random forest
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
forest.fit(X_train, Y_train)
Let's look at it from above.
--_ ʻiloc [] _ is used to separate the explanatory variable and the dependent variable. --Split training data and test data using sklearn's _
train_test_split`_.
The arguments of train_test_split are as follows.
Item | Description |
---|---|
arrays | Array of Numpy, multiple lists with the same length to split, matrix,Specify a Pandas data frame. |
test_size | Specify a decimal or integer. If specified as a decimal, the percentage of test data is 0.0 〜 1.Specify between 0. If you specify an integer, specify the number of records to be included in the test data as an integer. If not specified or if None is set, train_Set to supplement the size of size. train_0 as default value if size is not set.Use 25. |
train_size | Specify a decimal or integer. If specified as a decimal, the percentage of training data is 0.0 〜 1.Specify between 0. If you specify an integer, specify the number of records to be included in the training data. If not specified or set to None, test from the entire dataset_It is the size obtained by subtracting size. |
random_state | Set an integer or RandomState instance to seed random number generation. If not specified, Numpy's np.Use random to set a random number. |
(See: [Create training and test data with scikit-learn](https://pythondatascience.plavox.info/scikit-learn/%e3%83%88%e3%83%ac%e3%83%bc % e3% 83% 8b% e3% 83% b3% e3% 82% b0% e3% 83% 87% e3% 83% bc% e3% 82% bf% e3% 81% a8% e3% 83% 86% e3 % 82% b9% e3% 83% 88% e3% 83% 87% e3% 83% bc% e3% 82% bf))
--Classify using Ranfam Forest. The argument here is
n_estimators: Specify the number of trees (default is 100)
criterion: Specify gini
or ʻentropy (default is
gini`)
After this, train the model with forest.fit (...)
.
Now let's look at the accuracy.
forest.score(X_train, Y_train)
After this, we use confusion_matrix
(confusion matrix) to calculate ʻAccuracy`.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(Y_test, forest.predict(X_test)) #cm: confusion_matrix
TN = cm[0][0]
TP = cm[1][1]
FN = cm[1][0]
FP = cm[0][1]
print(cm)
print('Model Testing Accuracy = {}'.format( (TP + TN) / (TP + TN + FN + FP)))
With the above, although it is easy, it is a copying of binary classification using sklean.
Although the content is not that difficult, I realized that there are still some parts that I do not understand, so I would like to continue studying.
that's all.