I tried to analyze data using python with reference to Udemy's [50,000 people in the world] Practical Python Data Science. .. The data used this time is sample data contained in a library called Statsmodels, which is a paper of a survey conducted in 1974 asking whether or not there was an affair with a married woman.
The purpose of this time is Using sample data, we will create a model that predicts the presence or absence of infidelity by machine learning, and predict which attributes are affecting the result.
*** There is no intention in choosing this data, and considering that there is a possibility that falsehood due to self-report is included, we do not consider the credibility of the data and treat it as sample data to the last. *** ***
environment: Pyhton3 scikit-learn version 0.21.2 (Udemy course and scikit-learn version are different) jupyter notebook+Anaconda
** Don't explain **: Environment Basic grammar for Python, Pandas, Numpy, matplotlib (others will be explained in comments) Explanation of mathematical background
** Explain **: Logistic regression Explanatory variables and objective variables Data preparation and visualization Data preprocessing Model construction using scikit-learn Summary
Logistic regression is a regression analysis in which the objective variable (the data you want to acquire) converges to a value between 0 and 1. Specifically, the value can be converged by using the sigmoid function. It seems that its characteristics are used for probability prediction and binary classification. This time, I used logistic regression because I classify the presence or absence of affair into binary classification of 1 and 0.
#Required library import
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import math
#seaborn is a library that can draw graphs beautifully. It seems to be popular.
#set_Change style with style. This time, select white grid and select with grit with a white background.
#If it is troublesome.set()Just be fashionable
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('whitegrid')
#scikit-Module import required for learn
#cross_validation can only be used with older versions
#2.From 0 model_use selection
from sklearn.linear_model import LogisticRegressin
from sklearn.model_selection import train_test_split
#Module used when evaluating a model
from sklearn import metrics
#Import to use statsmodels sample data
#It may be necessary to install other than Anaconda
import statsmodels.api as sm
Now that we're ready, let's take a look at the data overview.
#Load sample data into Pandas DataFrame
df = sm.datasets.fair.load_pandas().data
#Let's start with an overview of the data
df.info()
#output
# RangeIndex: 6366 entries, 0 to 6365
# Data columns (total 9 columns):
# rate_marriage 6366 non-null float64
# age 6366 non-null float64
# yrs_married 6366 non-null float64
# children 6366 non-null float64
# religious 6366 non-null float64
# educ 6366 non-null float64
# occupation 6366 non-null float64
# occupation_husb 6366 non-null float64
# affairs 6366 non-null float64
# dtypes: float64(9)
# memory usage: 447.7 KB
#Next, let's look at the first 5 lines
df.head()
rate_ marriage |
age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs |
---|---|---|---|---|---|---|---|---|
3 | 32 | 9.0 | 3 | 3 | 17 | 2 | 5 | 0.1111 |
3 | 27 | 13.0 | 3 | 1 | 14 | 3 | 4 | 3.2308 |
4 | 22 | 2.5 | 0 | 1 | 16 | 3 | 5 | 1.4000 |
4 | 37 | 16.5 | 4 | 3 | 16 | 5 | 5 | 0.7273 |
5 | 27 | 9.0 | 1 | 1 | 14 | 3 | 4 | 4.6667 |
The number of rows is 6366, the number of columns is composed of the objective variable affairs and the explanatory variable total 9, and you can see that Null does not exist. To supplement the column names
・ Rate_marriage: Self-evaluation of marriage ・ Educ: Educational background ・ Children: Number of children ・ Religious: Religious ・ Occupation: Occupation ・ Occupation_husb: Husband's occupation However, you can check the details on the statsmodels website.
*** Objective variable *** refers to the variable you want to predict. In this case, "affairs", which is a variable for the presence or absence of affair, is that. *** Explanatory variables *** are variables used to predict the objective variable. This time all variables except affairs.
This time, we need to set the variable to two values to check for affairs, but the objective variable affairs is a continuous real value. This is because the content of the question is the time when affairs are done. So we'll add a new Had_Affair column to store the result through a function that converts non-zero numbers to 1.
#Had if affairs is non-zero_affairs。
def affair_check(x):
if x != 0:
return 1
else:
return 0
#The apply argument applies the function to the specified column.
df['Had_Affair'] = df['affairs'].apply(affair_check)
#Output the first 5 lines
df.head()
rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_ husb |
affairs | Had_Affair |
---|---|---|---|---|---|---|---|---|---|
3 | 32 | 9.0 | 3 | 3 | 17 | 2 | 5 | 0.1111 | 1 |
3 | 27 | 13.0 | 3 | 1 | 14 | 3 | 4 | 3.2308 | 1 |
4 | 22 | 2.5 | 0 | 1 | 16 | 3 | 5 | 1.4000 | 1 |
4 | 37 | 16.5 | 4 | 3 | 16 | 5 | 5 | 0.7273 | 1 |
5 | 27 | 9.0 | 1 | 1 | 14 | 3 | 4 | 4.6667 | 1 |
I was able to add it. Now let's visualize the data and easily find out which explanatory variables are influencing. Group by Had_Affair and calculate the average for each column.
df.groupby('Had_Affair').mean()
Had_Affair | rate_marriage | age | yrs_married | children | religious | educ | occupation | occupation_husb | affairs |
---|---|---|---|---|---|---|---|---|---|
0 | 4.330 | 28.39 | 7.989 | 1.239 | 2.505 | 14.32 | 3.405 | 3.834 | 0.000 |
1 | 3.647 | 30.54 | 11.152 | 1.729 | 2.262 | 13.97 | 3.464 | 3.885 | 2.187 |
You can see that the column with "Had_Affair" in the second row has a long marriage and a low self-evaluation of the marriage. Now let's visualize the relationship with the length of marriage with a histogram using seaborn (like a fashionable matplotlib).
#Data is aggregated and visualized using seaborn's countplot method, arguments are X-axis, target DF, column name is Had_Binary classification and color specification with Affair
sns.countplot('yrs_married',data=df.sort_values('yrs_married'),hue='Had_Affair',palette='coolwarm')
There seems to be a relationship between marriage and the presence or absence of affair. Next, let's visualize the length of marriage and the rate of having an affair.
#The y-axis of barplot outputs the average. Had_Since Affair is a value of 1 and 0, the average is calculated to obtain the ratio of 1.
sns.barplot(data=df, x='yrs_married', y='Had_Affair')
If the marriage life exceeds 9 years, the rate of having an affair will exceed 40%. It seems that you can make some predictions by looking at other data in advance, but we will move on to the next step in this area.
Now that the visualization is complete, we will preprocess the data. Specifically, in order to fit the machine learning model, the explanatory variable and the objective variable are separated, the data values are aligned, and the missing values are dealt with.
Now let's align the data values. The data strings of "occupation" and "occupation_husb" that indicate occupations are only assigned numbers for convenience in order to categorize them, so the numbers are meaningless. There is no profession.
Therefore, create a new column for occupational categorical data by occupation. If the record is applicable, the data is arranged by dividing it into 2 values, 1 otherwise. It's a tedious task, but it's a snap with Pandas' dummy variable generation function.
Then, since the occupation column is no longer needed, delete it, assign the objective variable to Y, assign the explanatory variable to X, and delete affairs, which is the original data of the objective variable.
When outputting, it is one table, but since there are too many columns and it is hard to see with Qiita, it is divided into two.
#Use a function that creates a dummy variable for pandas. scikit-It seems that it is also in learn.
occ_dummies = pd.get_dummies(df['occupation'])
hus_occ_dummies = pd.get_dummies(df['occupation_husb'])
#Name the category name. Actually, it is easier to see the column name of the original data, but I gave up because it was troublesome.
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']
#The column of occupation that is no longer needed and the objective variable "Had"_Delete "Affair". Also affairs.
#For axis, 0 specifies the row and 1 specifies the column.
#drop method takes place as an argument=If you do not enter True, it will not be deleted from the original DataFrame.
X = df.drop(['occupation','occupation_husb','Had_Affair','affairs'],axis=1)
#Combine the dummy variables into the DataFrame of the explanatory variable X.
dummies = pd.concat([occ_dummies,hus_occ_dummies],axis=1)
X = pd.concat([X,dummies],axis=1)
#Assign the objective variable to Y
Y = df.Had_Affair
#output
X.head()
rate_marriage | age | yrs_married | children | religious | educ |
---|---|---|---|---|---|
3 | 32 | 9.0 | 3 | 3 | 17 |
3 | 27 | 13.0 | 3 | 1 | 14 |
4 | 22 | 2.5 | 0 | 1 | 16 |
4 | 37 | 16.5 | 4 | 3 | 16 |
5 | 27 | 9.0 | 1 | 1 | 14 |
occ1 | occ2 | occ3 | occ4 | occ5 | occ6 | hocc1 | hocc2 | hocc3 | hocc4 | hocc5 | hocc6 |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
It seems that analysis may not be possible if there is a strong correlation between independent variables. I call it ** multicollinearity **, but I couldn't understand the details even if I googled it, so I'll look into it while studying statistics around next month.
For the time being, the one with high correlation in this data is the occupation column using dummy variables, so it seems that it can be dealt with by deleting one by one.
#I can deal with this for the time being
X = X.drop('occ1',axis=1)
X = X.drop('hocc1',axis=1)
Since the objective variable Y is Series, change it to array, which is a primary array, to fit the model. This completes the data preprocessing.
type(Y)
Y = np.ravel(Y)
Build a logistic regression model using scikit-learn.
#Create an instance of the LogisticRegression class.
log_model = LogisticRegression()
#Create a model using the data.
log_model.fit(X,Y)
#Let's check the accuracy of the model.
log_model.score(X,Y)
#output
#0.7260446120012567
The accuracy of this model is about 73%. Is this reasonable because it trains the model and the parameters are the defaults? Now let's display the regression coefficient and explore "Which variable contributes to the prediction?"
#Create a DataFrame to store the variable name and its coefficients.
#coef_Displays the regression coefficient.
coeff_df = DataFrame([X.columns, log_model.coef_[0]]).T
coeff_df
0 | 1 |
---|---|
rate_marriage | -0.72992 |
age | -0.05343 |
yrs_married | 0.10210 |
children | 0.01495 |
religious | -0.37498 |
educ | 0.02590 |
occ2 | 0.27846 |
occ3 | 0.58384 |
occ4 | 0.35833 |
occ5 | 0.99972 |
occ6 | 0.31673 |
hocc2 | 0.48310 |
hocc3 | 0.65189 |
hocc4 | 0.42345 |
hocc5 | 0.44224 |
hocc6 | 0.39460 |
You can see the regression coefficient when the model was created for the explanatory variables. If the regression coefficient is positive, the higher the value of that variable, the greater the chance of infidelity. If it is negative, the opposite is true. From this table, it seems that the possibility of infidelity decreases as the self-evaluation of marriage and the view of religion increase, and the possibility of infidelity increases as the number of years after marriage increases. It is also displayed by occupation, but since the value of 1 is deleted when taking measures against multicollinearity, it seems better to look at it as a reference level (By the way, occ5, which is a fairly high value). Since the occupation is managerial, it may be a value that is intuitively convincing)
If you want to improve the accuracy, you can do normalization and trial and error of parameters. However, considering the credibility of the data, I thought it would be more learning to analyze the relationship to the results by attributes by looking at the regression coefficients used in the model.
Posting the table output by DataFrame to Qiita was very difficult and took about 5 hours.
At first, I tried to convert it to a matplotlib table and output it as an image, but I gave up because the characters in the index became small and I didn't know how to fix it. Next, I tried to use a library called pytablewriter that converts DataFrame to Markdown, but since it is not a library distributed by Anaconda, I had no choice but to install it with PIP. The error "cannot import name" is occurring in the imported library, so check it.
Oh! I'm surprised! If you think about it, the dependent library versions can be different for Anaconda and PIP, so it's likely to cause problems. I didn't care about it until now, so when I checked it on the Conda List, there were countless Pypi, so I didn't see it. I was wondering if there would be any problems with other languages, such as NPM and yarn, and when I asked my friend's engineer, I received a thankful answer, "The library is saved in the same place!", So the truth is in the dark. So the countermeasure is to create another Anaconda environment or create another environment that only installs PIP, but I select the latter and install the library with PIP from scratch, but install Statsmodels with PIP In some cases, an error occurs (it's easier with Anaconda), and if something goes wrong, it's solved safely.
*** I respect the posters who are creating tables in Markdown quickly. I would like to know if there is any way. *** ***
Recommended Posts