Effective P | No effect 1-P | Odds P/(1-P) | |
---|---|---|---|
Chemical A | 0.2 | 0.8 | 0.250 |
Chemical B | 0.05 | 0.95 | 0.053 |
To consider **, let's assume a model that predicts the pass / fail of the test (1 if pass, 0 if fail) from the number of study hours. ** **
#Library used for numerical calculation
import numpy as np
import pandas as pd
import scipy as sp
from scipy import stats
#Library for drawing graphs
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
#Library for estimating statistical models
import statsmodels.formula.api as smf
import statsmodels.api as sm
#Specifying the number of display digits
%precision 3
#Data acquisition
url = 'https://raw.githubusercontent.com/yumi-ito/sample_data/master/6-3-1-logistic-regression.csv'
#Data reading
df = pd.read_csv(url)
# #Output the first 5 lines of data
df.head()
#Output basic statistics of data
df.describe().apply(lambda s: s.apply(lambda x: format(x, 'g')))
#Draw bar chart
sns.set()
sns.barplot(x = "hours", y = "result", data = df, palette='summer_r')
sns.set ()
specifies the style of the graph area, which is the default state. Alternatively, use sns.set (style =" whitegrid ")
to create a gray scale line on a white background.seaborn.barplot (x column name, y column name, DataFrame name, color palette name)
.ci = None
to the argument to hide it.pass_rate = df.groupby("hours").mean()
groupby ()
to group the same elements in the columns specified as arguments, and find the average of each with .mean ()
.#Estimate the model
mod_glm = smf.glm(formula = "result ~ hours",
data = df,
family = sm.families.Binomial()).fit()
smf.glm ()
function. glm is an abbreviation for Generalized Linear Models.formula
specifies the structure of the model with the objective variable results
and the explanatory variable hours
.family
, is the specification of the probability distribution. Since this example is the binomial distribution, applysm.families.Binomial ()
.#Output summary of estimation results
mod_glm.summary()
#Draw a regression curve
sns.lmplot(x = "hours", y = "result",
data = df,
logistic = True,
scatter_kws = {"color": "green"},
line_kws = {"color": "black"},
x_jitter = 0.1, y_jitter = 0.02)
logistic
is True, then y is assumed to be a binary variable and statsmodels is used to estimate the logistic regression model. A binary variable is a variable that can only take two values, 0 and 1.scatter_kws
and line_kws
are additional keyword arguments to pass to plt.scatter and plt.plot to specify the color of the scatterplot dots and regression curves.x_jitter
and y_jitter
specify that the dots should be scattered up and down a little, just for appearance. Since the pass / fail is only 1 or 0, the dots are controlled to overlap.#Arithmetic progression with column name hours(0~9)Create a DataFrame for
predicted_value = pd.DataFrame({"hours": np.arange(0, 10, 1)})
#Calculate the predicted pass rate
pred = mod_glm.predict(predicted_value)
mod_glm
, use the functionpredict ()
to calculate the predicted value according to the created data frame predicted_value
.#Get 1 hour and 2 hour pass rates
pred_1 = pred[1]
pred_2 = pred[2]
#Calculate the odds for each
odds_1 = pred_1 / (1 - pred_1)
odds_2 = pred_2 / (1 - pred_2)
#Calculate log odds ratio
print("Log odds ratio:", round(sp.log(odds_2 / odds_1), 3))
#Calculate the coefficients of the model
value = mod_glm.params["hours"]
print("Model coefficients:", round(value, 3))
numpy.exp (x)
returns the natural logarithm $ e $ to the $ x $ power.#Take the regression coefficient exp
exp = sp.exp(mod_glm.params["hours"])
print("Coefficient exp:", round(exp, 3))
#Calculate odds ratio
odds = odds_2 / odds_1
print("Odds ratio:", round(odds, 3))
Recommended Posts