Similar to linear regression, but used when the objective variable is binary. For example, whether or not this person buys a product, whether or not he hits a stick, whether or not he moves, whether or not he changes jobs, and so on.
Create a prediction model using the following logistic function (sigmoid function).
The form of the logistic function is as follows. It takes a value from 0 to 1 and increases monotonically.
The relationship between the matrix x of the objective variable and the explanatory variable y is as follows. (The right side of y = ax + b is exp to the -1 power.)
Use affair data with sklearn.
{get_affair_dataset.py}
from sklearn.linear_model import LogisticRegression #For logistic regression
from sklearn.cross_validation import train_test_split #For cross-validation split
import statsmodels.api as sm
df = sm.datasets.fair.load_pandas().data #Loading affair data
{describe_affair.py}
df.head()
rate_marriage: happiness, age: age, yrs_married: years of marriage, children: number of children, religious: faith, educ: final academic background, occupation: wife's profession, occupation_husb: husband's profession, affairs: affair experience (greater than 0) And have an affair experience), Had_Affair: Affair flag (1 is set if affairs is 0>)
{easy_display1.py}
#Age and presence of affair
sns.countplot('age', data = df.sort('age'), hue = 'Had_Affair', palette='coolwarm')
{easy_display2.py}
#Years of marriage and presence or absence of affair
sns.countplot('yrs_married', data = df.sort('yrs_married'), hue = 'Had_Affair', palette='coolwarm')
{easy_display3.py}
#Number of children and presence or absence of affair
sns.countplot('children', data = df.sort('children'), hue = 'Had_Affair', palette='coolwarm')
Older age / more years of marriage / higher affair rate with children
Before doing this, the occupation variable is a categorical variable, so replace it with a dummy variable. A categorical variable is one in which the size of the value is meaningless.
{change_dummy_value.py}
#numpy get_Convert to dummy variable with dummies.
occ_dummies = pd.get_dummies(df.occupation)
hus_occ_dummies = pd.get_dummies(df.occupation_husb)
#Column name set
occ_dummies.columns = ['occ1','occ2','occ3','occ4','occ5','occ6']
hus_occ_dummies.columns = ['hocc1','hocc2','hocc3','hocc4','hocc5','hocc6']
occ_dummies.head()
With the above feeling, which of occ1 to 6 is hit will be replaced with the 0,1 flag.
Next, get the explanatory variables.
{get_x.py}
#Set X with the occupation, husband's occupation, and marital status removed from the original data frame.
X = df.drop(['occupation', 'occupation_husb', 'Had_Affair'], axis =1)
#Prepare a data frame with occupations as dummy variables
dummys = pd.concat([occ_dummies, hus_occ_dummies], axis =1)
#Combine occupation dummy variable data frame with data frame with occupation etc. deleted
X = pd.concat([X, dummys], axis=1)
X.head()
Data set for explanatory variables so far
When one explanatory variable can represent one or more other explanatory variables, it is said to be multicollinear. For example, this time, occ1 is uniquely determined by the values of occ2 to occ6. (If there is one or more 1s in occ2-6, occ1 = 0, otherwise occ1 = 1) In this case, the inverse matrix cannot be calculated, or even if it can be calculated, the reliability of the obtained result becomes low. So, in order to eliminate this, delete occ1 and hocc1.
{drop_nonavailable_value.py}
X = X.drop('hocc1', axis = 1)
X = X.drop('hocc1', axis = 1)
#Since affairs is used to create the objective variable, this is also excluded from the explanatory variables.
X = X.drop('affairs', axis =1 )
X.head()
Final shape
{do_logistic_regression.py}
#Objective variable set
Y = df.Had_Affair
Y = np.ravel(Y) # np.Make Y a one-dimensional array with ravel
#Logistic regression execution
log_model = LogisticRegression() #Instance generation
log_model.fit(X, Y) #Model creation execution
log_model.score(X, Y) #Confirmation of model prediction accuracy(72.6%)
> 0.7260446120012567
Check the coefficient of each variable
{confirm_coefficient.py}
#Of the instance.coef_[0]Contains the coefficient
coeff_df = DataFrame([X.columns, log_model.coef_[0]]).T
coeff_df
The place where this coefficient is large has a lot of influence. However, since the data units of the explanatory variables are not unified, it is not possible to simply compare them side by side. For example, occ5 is about 9 times as large as yrs_married, so it's not easy to say OK !! without looking at the number of years of marriage.
As usual, I will write how to divide it into train and test.
{do_logistic_regression_train_test.py}
#Data preparation for train and test
X_train, X_test, Y_train, Y_test = train_test_split(X, Y)
log_model2 = LogisticRegression()
log_model2.fit(X_train, Y_train) #Model creation with train data
class_predict = log_model2.predict(X_test) #predict test data
from sklearn import metrics #For checking prediction accuracy
metrics.accuracy_score(Y_test, class_predict) #Accuracy check
>0.73115577889447236
You can see that the accuracy is about 73%.
Recommended Posts