In python, scikit-learn and statsmodels are mainly used as libraries that can use logistic regression models. While statsmodels has advantages that scikit-learn does not have, such as automatically performing a significant difference test of coefficients, it does not support the holdout method and cross-validation method, which are typical model evaluation methods. So, this time, let's create the code to implement the k-fold cross-validation method with stats models.
See here for the implementation of the holdout method using stats models.
sample.ipynb
import numpy as np
import pandas as pd
import statsmodels.api as sm
For the data, I will use the data related to crowdfunding that I independently collected for my graduation research. This data is on my github page, so please download it to your environment if necessary.
sample.ipynb
#Read csv file
cultured = pd.read_csv("cultured.path to csv")
#Create objective variable 0:Crowdfunding failure 1:Crowdfunding success
cultured["achievement"] = cultured["Total amount of support"] // cultured["Target amount"]
cultured["target"] = 0
cultured.loc[cultured['achievement']>=1,'target'] = 1
#Objective variable(y)And explanatory variables(x)Divide into
#add_Create a constant term with constant
y = cultured["target"]
x_pre = cultured[["Target amount","Number of supporters","word count","Number of activity reports"]]
x = sm.add_constant(x_pre)
This data is for predicting whether the crowdfunding project succeeds (y = 1) or fails (y = 0) from the explanatory variables target amount, number of supporters, number of characters, and number of activity reports. In scikit-learn logistic regression, constant terms are generated arbitrarily, but statsmodels does not have that function, so they are generated using add_constant (). The explanatory variable (x) looks like this.
First, create a train_split function that splits the data.
sample.ipynb
#Divide the data using the remainder when the index is divided by the number of divisions k
#k:Division number, r:Residual when divided by k
def train_test(x, y, k, r):
#Create an ndarray array for the number of columns from 0 to x
#Consider this as an index
idx = np.arange(0, x.shape[0])
#The remainder of dividing the index by k is equal to r idx_Store in test
idx_test = idx[np.fmod(idx, k) == r]
idx_train = idx[np.fmod(idx, k) != r]
#idx_Only data with the same index as the number stored in test x_test(y_test)Store in
x_test = x.iloc[idx_test,:]
x_train = x.iloc[idx_train,:]
y_test = y[idx_test]
y_train = y[idx_train]
return x_train, x_test, y_train, y_test
Implement k-validation cross-validation using the train_test function.
sample.ipynb
def cross_validation(x,y,k):
#Set scores list
scores = []
#Covers the remainder that can be taken with a for statement
#If it is divided into 5 parts, the remainder that can be taken when the index is divided by 5 is 0.,1,2,3,4
for r in range(k):
X_train, X_test, y_train, y_test = train_test(x,y,k,r)
#Learning using training data
model = sm.Logit(y_train, X_train)
results = model.fit()
#Store predictions for test data in pred
#However, note that the output value is the probability that the objective variable will be 1 (in this case, the probability of success).
pred = results.predict(X_test)
#Probability is 0.Converts greater than 5 to 1 and others to 0
#Use in-list notation
result = [1 if i>0.5 else 0 for i in pred]
#train_The order of the indexes is messed up with the test function, so reassign
y_test_re = y_test.reset_index(drop=True)
#Store initial value in count
count=0
#y_Add 1 to count if test matches the predicted value
for i in range(len(X_test)):
if y_test_re[i] == result[i]:
count+=1
#Add the result for each remainder r to scores
scores.append(count/len(y_test))
#Outputs the average of k prediction accuracy stored in scores
return sum(scores) / len(scores)
sample.ipynb
cross_validation(x,y,5)
When you perform 5-fold cross validation ... It was 0.8485 in my environment!
Recommended Posts