Last time, I summarized what I learned from theory of logistic regression.
I tried to deepen my understanding by creating a discriminator that can be binary classified by my own logistic regression. https://qiita.com/Fumio-eisan/items/e2c625c4d28d74cf02f3
This time, we made a model estimate using an actual data set. We have summarized the basic processing of so-called data preprocessing (dummy variableization, column deletion, combination), data interpretation, and multicollinearity, which is a problem in multivariate analysis. There are many implementation contents.
The outline is below.
The version used is as follows.
This time, we used the data set of the survey results of the presence or absence of infidelity conducted on married women in 1974.
affair.ipynb
df = sm.datasets.fair.load_pandas().data
df.head()
If you look at the data, you can see that the period since marriage, age, having children, etc. are described as explanatory variables. And finally, there is a number in the affairs column. 0 indicates that you are not infidelity, and 1 or more indicates that you are (or were) infidelity.
Evaluate the difference between the presence and absence of infidelity. First of all, in the current data, the numbers of affairs are different, so divide by affair (1 or more) and not affair (0).
affair.ipynb
def affair_check(x):
if x!=0:
return 1
else:
return 0
df['Had_Affair']=df['affairs'].apply(affair_check)
Interpret the data to look for parameters that are likely to be more relevant to the predictive model. For that purpose, classify by affair (1) and without (0) and make a histogram with each variable. With axes as the return value, give each as an argument in the graph you want to represent.
affair.ipynb
fig, axes = plt.subplots(nrows=3, ncols=3,figsize=(10,8))
sns.countplot(df['age'], hue=df['Had_Affair'],ax=axes[0,0])
sns.countplot(df['yrs_married'], hue=df['Had_Affair'],ax=axes[0,1])
sns.countplot(df['children'], hue=df['Had_Affair'],ax=axes[0,2])
sns.countplot(df['rate_marriage'], hue=df['Had_Affair'],ax=axes[1,0])
sns.countplot(df['religious'], hue=df['Had_Affair'],ax=axes[1,1])
sns.countplot(df['educ'], hue=df['Had_Affair'],ax=axes[1,2])
sns.countplot(df['occupation'], hue=df['Had_Affair'],ax=axes[2,0])
sns.countplot(df['occupation_husb'], hue=df['Had_Affair'],ax=axes[2,1])
Now that you can view it all at once, you can now interpret the data. Basically, I think you should focus on the parameters where the peaks are different between the ** infidelity group and the non-affair group. ** **
Now, pre-processing is performed to create a prediction model. In this affair dataset, the categorical variables are Occupation and Husband's Occupation. For these, we introduce dummy variables and classify them with 0/1 expression.
It is such an image. The implementation is as follows.
affair.ipynb
occ_dummies = pd.get_dummies(df['occupation'])
hus_occ_dummies = pd.get_dummies(df['occupation_husb'])
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']
occ_dummies
I was able to divide it safely.
Next, I want to remove the columns I don't need and connect the columns I need. Delete the occupation and Had_Affair columns.
affair.ipynb
X = df.drop(['occupation','occupation_husb','Had_Affair'],axis=1)
Then, put together the dummy variables.
affair.ipynb
dummies = pd.concat([occ_dummies,hus_occ_dummies],axis=1)
Finally, combine the dummy variable with the original data.
affair.ipynb
XX = pd.concat([X,dummies],axis= 1)
Next, let us consider multicollinearity. This is a problem that appears as the types of explanatory variables increase. Among these explanatory variables, the phenomenon in which the correlation coefficients are strong with each other is called ** multicollinearity **. If there is a lot of multicollinearity, the accuracy of the regression equation may become extremely poor, and the analysis result may become unstable.
For example, in a model that predicts house prices, "number of rooms" and "room area" are expected to have a strong correlation. In such cases, you can avoid multicollinearity by excluding one variable.
This time, I would like to make a model by excluding occ1, hocc1 = students from the dummy variable occupation.
affair.ipynb
XX = XX.drop('occ1',axis=1)
XX = XX.drop('hocc1',axis=1)
The relationship is as shown above.
Then predict the model. This time, I would like to make a simple prediction using the logistic regression of scikit learn. The first model is trained only with the training data. Then make a prediction with test data.
affair.ipynb
X_train, X_test, Y_train, Y_test = train_test_split(XX, Y)
model2 = LogisticRegression()
model2.fit(X_train, Y_train)
class_predict = model2.predict(X_test)
print(metrics.accuracy_score(Y_test,class_predict))
0.707286432160804
It turned out that the correct answer rate was about 70%. Then, what happens if the data that was erased while avoiding the multicollinearity mentioned earlier is not erased as it is (= the point is that the data is as it is)?
affair.ipynb
X2_train, X2_test, Y2_train, Y2_test = train_test_split(X2, Y)
model3 = LogisticRegression()
model3.fit(X2_train, Y2_train)
class_predict2 = model3.predict(X2_test)
print(metrics.accuracy_score(Y2_test,class_predict2))
0.9748743718592965
The correct answer rate was high at 97%. ** In this case, we can see that it was better to leave the data as it is because it does not cause multicollinearity. ** **
In other words, it seems that whether or not multicollinearity should be taken into consideration must be considered once when all the data is included in the calculation and when it is deleted. I found that the empirical part is the procedure of saying things.
Data was interpreted using pandas and matplotlib, and preprocessing was performed in consideration of multicollinearity. Since it is a tutorial-like dataset, it looks like it has progressed smoothly, but I thought that pandas was still handled, such as graph drawing and combining. Also, since the implementation of logistic regression itself is very simple, it was very convenient to be able to calculate without knowing what was happening inside.
The full program is here. https://github.com/Fumio-eisan/affairs_20200412
Recommended Posts