In python, scikit-learn and statsmodels are mainly used as libraries that can use logistic regression models. While statsmodels has advantages that scikit-learn does not have, such as automatically performing a significant difference test of coefficients, it does not support the holdout method and cross-validation method, which are typical model evaluation methods. So, this time, let's write the code to implement the holdout method in stats models.
See here for the implementation of k-validation using stats models.
sample.ipynb
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
For the data, I will use the data related to crowdfunding that I independently collected for my graduation research. This data is on my github page, so please download it to your environment if necessary.
sample.ipynb
#Read csv file
cultured = pd.read_csv("cultured.path to csv")
#Create objective variable 0:Crowdfunding failure 1:Crowdfunding success
cultured["achievement"] = cultured["Total amount of support"] // cultured["Target amount"]
cultured["target"] = 0
cultured.loc[cultured['achievement']>=1,'target'] = 1
#Objective variable(y)And explanatory variables(x)Divide into
#add_Create a constant term with constant
y = cultured["target"]
x_pre = cultured[["Target amount","Number of supporters","word count","Number of activity reports"]]
x = sm.add_constant(x_pre)
This data is for predicting whether the crowdfunding project succeeds (y = 1) or fails (y = 0) from the explanatory variables target amount, number of supporters, number of characters, and number of activity reports. In scikit-learn logistic regression, constant terms are generated arbitrarily, but statsmodels does not have that function, so they are generated using add_constant (). The explanatory variable (x) looks like this.
sample.ipynb
#Holdout method
def hold_out(x,y):
#Divide the data into training data and test data
#test_size is the ratio of test data to total data
X_train, X_test, y_train, y_test = train_test_split(x,y,test_size=0.2,random_state=42)
#Learning using training data
model = sm.Logit(y_train, X_train)
results = model.fit()
#Store predictions for test data in pred
#However, note that the output value is the probability that the objective variable will be 1 (in this case, the probability of success).
pred = results.predict(X_test)
#Probability is 0.Converts greater than 5 to 1 and others to 0
#Use in-list notation
result = [1 if i>0.5 else 0 for i in pred]
#train_test_The index order is messed up with split, so reassign
y_test_re = y_test.reset_index(drop=True)
#Store initial value in count
count=0
#y_Add 1 to count if test matches the predicted value
for i in range(len(y_test)):
if y_test_re[i] == result[i]:
count+=1
#The return value is the accuracy of the prediction
return count/len(y_test)
sample.ipynb
hold_out(x,y)
When you execute the function ... It was 0.878 in my environment!
Recommended Posts