[Econometrics](https://www.amazon.co.jp/ Econometrics-New-Liberal-Arts-Selection / dp / 4641053855 / ref = asc_df_4641053855_nodl /? Tag = jpgo-22 & linkCode = df0 & hvadid = 342654745883 & hvpos = & hvnetw = g & hvrand = 3840609228244811524 & hvpone = & hvptwo = & hvqmt = & hvdev = m & hvdvcmdl = & hvlocint = & hvlocphy = 1009363 & hvtargid = pla-796815219068 & psc = 1 & th = 1 & psc = 1), the answer to the end of Chapter 8 by python.
Basically, I am making answers while looking at the official documents of stats models and linear models, github. I don't know about python at all, so if you have any mistakes, please point out and ask.
The end-of-chapter problem in Chapter 8 is a reproduction of Tables 8-1 and 8-5.
What is required in the reproduction of Table 8-1 is what to do when the explained variable is a binary variable. We will compare the linear probability model, probit model, and logit model as the means.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
sns.set()
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.formula.api import logit, probit
#Read and confirm data
df = pd.read_stata('piaac.dta')
df.head()
In the analysis in Table 8-1, [lfs] is limited to women and returns to [educ], [age], [couple], [child]. Get only the columns and records you need.
df_1 = df[df['gender']== 'Female']
df_1 = df_1[['lfs','educ','age','couple','child']]
The problem here is that the explained variable [lfs] is a qualitative variable. For the time being, let's check what value the explained variable takes.
#['lfs']Check the type of
df_1['lfs'].unique()
#[Employed, Out of the labour force, Unemployed, Not known]
Categories (4, object): [Employed < Unemployed < Out of the labour force < Not known]
#Also check your age
df_1['age'].unique()
#[45, 29, 35, 48, 60, ..., 41, 37, 19, 58, 28]
Length: 50
Categories (50, int64): [16 < 17 < 18 < 19 ... 62 < 63 < 64 < 65]
lfs seems to take four kinds of values: Employed, Unemployed, Out of the labor force, Not known. How to interpret this is a very delicate matter by nature, but Table 8-1 seems to count as non-working people without dropping Not known. What to do if the explained variable is not observed (sample selection becomes a problem) will be considered when implementing the Hekit model. Also, the out of labor force is literally a non-labor force, but since all the data is only for those over 16 years old, they can work by nature. However, I choose not to work for some reason, such as low wages. Here, consider the case where the explained variable takes a binary variable, so create a dummy variable with Employed = 1 and other = 0 for analysis.
#Dummy variable column['emp_dummy']Creation
df_1.loc[df_1['lfs']=='Employed','emp_dummy']=1
df_1.loc[df_1['lfs']!='Employed','emp_dummy']=0
# Check the data type just in case
df_1.dtypes
#lfs category
educ category
age category
couple float32
child float32
emp_dummy float64
dtype: object
For some reason, educ and age are categorical ... Convert to a numeric variable.
#Change data type
df_1[['educ','age']] = df_1[['educ','age']].apply(pd.to_numeric)
First, ignore the missing values and perform the analysis.
#OLS estimation
result_1_1 = ols(formula = 'emp_dummy ~ educ + age + couple + child',
data = df_1).fit(cov_type='HC1',use_t=True)
When using the degree-of-freedom correction white standard error, specify HC1 for cov_type. Specify HC0 to use the normal white standard error.
#Check the result
result_1_1.summary()
coef | std err | t | P>t | [0.025 | [0.025 | |
---|---|---|---|---|---|---|
Intercept | 0.3759 | 0.105 | 3.566 | 0.000 | 0.169 | 0.583 |
educ | 0.0195 | 0.006 | 3.312 | 0.001 | 0.008 | 0.031 |
age | 0.0020 | 0.001 | 1.726 | 0.085 | -0.000 | 0.004 |
couple | -0.1178 | 0.031 | -3.743 | 0.000 | -0.180 | -0.056 |
child | 0.0137 | 0.013 | 1.067 | 0.286 | -0.012 | 0.039 |
The result can be said to be exactly the same. Table 8-1 in the textbook seems to ignore the missing values and perform the analysis.
I will also analyze the missing values. This is a very delicate issue, but for now, let's put in the average value.
#Fill in the missing values with the average value
df_1['educ'] = df_1['educ'].fillna(df_1['educ'].mean())
df_1['age'] = df_1['age'].fillna(df_1['age'].mean())
df_1['couple'] = df_1['couple'].fillna(df_1['couple'].mean())
df_1['child'] = df_1['child'].fillna(df_1['couple'].mean())
#OLS estimation
result_1_nonan = ols(formula = 'emp_dummy ~ educ + age + couple + child',
data = df_1).fit(cov_type='HC1',use_t=True)
#Check the result
result_1_nonan_summary()
chef | std err | t | P>t | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 0.0189 | 0.062 | 0.302 | 0.763 | -0.104 | 0.141 |
educ | 0.0450 | 0.004 | 10.951 | 0.000 | 0.037 | 0.053 |
age | 0.0032 | 0.001 | 3.895 | 0.000 | 0.002 | 0.005 |
couple | -0.1035 | 0.022 | -4.735 | 0.000 | -0.146 | -0.061 |
child | -0.0006 | 0.009 | -0.059 | 0.953 | -0.019 | 0.018 |
The value has changed a little. After all, in the analysis of textbooks, it seems that all records including missing values are dropped and analyzed.
From here on, follow the textbook and ignore all missing values.
#Data shaping
df = pd.read_stata('piaac.dta')
df_2 = df[df['gender']== 'Female']
df_2 = df_2[['lfs','educ','age','couple','child']]
df_2.loc[df_2['lfs']=='Employed','emp_dummy']=1
df_2.loc[df_2['lfs']!='Employed','emp_dummy']=0
df_2[['educ','age']] = df_2[['educ','age']].apply(pd.to_numeric)
If you use a probit model, you can just use "probit" instead of "ols", if you take advantage of statsmodels.
###probit model
result_1_2 = probit(formula = 'emp_dummy ~ educ + age + couple + child',
data = df_2).fit()
#Check the result
result_1_2.summary().tables[1]
coef | std err | z | P>z | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.3443 | 0.287 | -1.198 | 0.231 | -0.908 | 0.219 |
educ | 0.0537 | 0.016 | 3.340 | 0.001 | 0.022 | 0.085 |
age | 0.0053 | 0.003 | 1.789 | 0.074 | -0.001 | 0.011 |
couple | -0.3391 | 0.097 | -3.489 | 0.000 | -0.530 | -0.149 |
child | 0.0397 | 0.032 | 1.256 | 0.209 | -0.022 | 0.102 |
This time, the result is almost the same as Table 8-1.
Next is the calculation of the marginal effect. If you use statsmodels, the marginal effect of the probit model is over with the method "get_margeff".
#Marginal effect
result_1_2.get_margeff().summary().tables[1]
dy/dx | std err | z | P>z | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
educ | 0.0197 | 0.006 | 3.372 | 0.001 | 0.008 | 0.031 |
age | 0.0019 | 0.001 | 1.794 | 0.073 | -0.000 | 0.004 |
couple | -0.1243 | 0.035 | -3.524 | 0.000 | -0.193 | -0.055 |
child | 0.0145 | 0.012 | 1.258 | 0.209 | -0.008 | 0.037 |
This is the same result as the textbook.
Just like with Probit, just use "logit". The marginal effect can be calculated in the same way with the "get_margeff" module.
#Logit model
result_1_4 = logit(formula = 'emp_dummy ~ educ + age + couple + child',
data = df_2).fit()
#Check the result
result_1_4.summary().tables[1]
#Marginal effect
result_1_4.get_margeff().summary().tables[1]
coef | std err | z | P>z | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | -0.5781 | 0.471 | -1.227 | 0.220 | -1.502 | 0.345 |
educ | 0.0869 | 0.026 | 3.307 | 0.001 | 0.035 | 0.138 |
age | 0.0088 | 0.005 | 1.806 | 0.071 | -0.001 | 0.018 |
couple | -0.5587 | 0.163 | -3.424 | 0.001 | -0.879 | -0.239 |
child | 0.0716 | 0.058 | 1.242 | 0.214 | -0.041 | 0.185 |
dy/dx | std err | z | P>z | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
educ | 0.0195 | 0.006 | 3.345 | 0.001 | 0.008 | 0.031 |
age | 0.0020 | 0.001 | 1.812 | 0.070 | -0.000 | 0.004 |
couple | -0.1255 | 0.036 | -3.463 | 0.001 | -0.197 | -0.054 |
child | 0.0161 | 0.013 | 1.244 | 0.213 | -0.009 | 0.041 |
statsmodels has a useful class "GenericLikelihoodModel", which officially says The GenericLikelihoodModel class eases the process by providing tools such as automatic numeric differentiation and a unified interface to scipy optimization functions. Using statsmodels, users can fit new MLE models simply by “plugging-in” a log-likelihood function.
It seems that it is convenient when you want to perform maximum likelihood estimation using a homemade likelihood function.
Since the example of the official document is only implemented by changing the data set, please refer to the official document as well.
#import
from scipy import stats
from statsmodels.base.model import GenericLikelihoodModel
#Drop records that contain missing values
df_4 = df_2.dropna()
#Create a matrix of explanatory variables.
exog = df_4.loc[:,['educ','age','couple','child']]
exog = sm.add_constant(exog,prepend=True)
#Create a vector of explained variables.
endog = df_4.loc[:,'emp_dummy']
Assuming that the latent variable is determined by the formula on the right, $ Y_ {i} ^ {*} = X_ {i}'\ beta + e_ {i} $ The log-likelihood function of the binomial selection model in general is
So all you have to do is change F () into a normal distribution function and plunge into it.
class MyProbit(GenericLikelihoodModel):
def loglike(self,params):
exog = self.exog
endog = self.endog
q= 2*endog-1
return stats.norm.logcdf(q*np.dot(exog,params)).sum()#Log-likelihood of the probit model
It seems that you can estimate with just this.
#Fits to probit models
result_mine = MyProbit(endog, exog).fit(maxiter=1000)
#Check the result
result_mine.summary().tables[1]
chef | stderr | z | P>z | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
const | -0.3444 | 0.287 | -1.198 | 0.231 | -0.908 | 0.219 |
educ | 0.0537 | 0.016 | 3.340 | 0.001 | 0.022 | 0.085 |
age | 0.0053 | 0.003 | 1.790 | 0.073 | -0.001 | 0.011 |
couple | -0.3391 | 0.097 | -3.489 | 0.000 | -0.530 | -0.149 |
child | 0.0397 | 0.032 | 1.256 | 0.209 | -0.022 | 0.102 |
The result is the same as when using the probit module of statsmodels. There is no officially merged module for Tobit models, but if you use GenericLikelihoodModels, it seems that you can create it just by raising the likelihood function.
Although it is a Heckitmodel, statsmodels and linearmodels did not have a module that can perform two-step estimation and maximum likelihood estimation as they are, as far as I read the document. I would be grateful if you could let me know. However, although not officially merged, there was a module called [Heckman] in the branch.
Use this.
import Heckman
#Data reading and formatting
read_stata('piaac.dta')
df_3 = df[df['gender']== 'Female']
df_3 = df_3[['educ','couple','child','age']]
df_3 = df_3.dropna()
df_3['wage'] = df['wage']
df_3[['educ','age']] = df_3[['educ','age']].apply(pd.to_numeric)
df_3['experience'] = df_3['age']-df_3['educ']-6
df_3['experience_2'] = (df_3['experience']**2)/100
df_3['logwage'] = np.log(df_3['wage'])
#Data confirmation
df_3
Some people have not observed their wages for some reason. There is a possibility that you are not working in the first place because the offered wage is less than the reserved wage. If wages are returned to the explanatory variables as they are, sample selection is likely to become a problem.
Anyway, in order to reproduce the table, I will run OLS as it is.
#OLS estimation
result_5_1 = ols(formula = 'logwage ~ educ + experience + experience_2',
data = df_3).fit(cov_type='HC1',use_t=True)
#Check the result
result_5_1.summary()
coef | stderr | t | P>t | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
Intercept | 5.8310 | 0.236 | 24.726 | 0.000 | 5.368 | 6.294 |
educ | 0.0862 | 0.014 | 6.074 | 0.000 | 0.058 | 0.114 |
experience | 0.0069 | 0.009 | 0.762 | 0.446 | -0.011 | 0.025 |
experience_2 | -0.0126 | 0.015 | -0.820 | 0.413 | -0.043 | 0.018 |
No. Observations: | 984 |
---|
There were 1747 samples in total, but the number of samples used is 984.
#Create a vector of explained variables.
endog = df_3['logwage']
#Create a matrix of explanatory variables for the selection expression.
exog_select = df_3[['educ','experience','experience_2','couple','child']]
exog_select = sm.add_constant(exog_select,prepend=True)
#Create a matrix of explanatory variables for the second-stage equation.
exog = df_3[['educ','experience','experience_2']]
exog = sm.add_constant(exog,prepend=True)
#Heckman's two-step estimation
result_5_2 = Heckman.Heckman(endog,exog,exog_select).fit(method='twostep')
#Check the result
result_5_2.summary()
#Maximum likelihood estimation
result_5_4 = Heckman.Heckman(endog,exog,exog_select).fit(method='mle',
method_mle='nm',
maxiter_mle=2000)
#Check the result
result_5_4.summary()
In preparing this answer, I referred to the following materials.
Burce E.Hansen(2020)"ECONOMETRICS"
Recommended Posts