Solve Wooldridge exercises in Python

Introduction

--Recently, I started studying econometrics, so I will solve the exercises in the textbook with Python.

About textbooks

Solve

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
import statsmodels.formula.api as smf
%matplotlib inline
import warnings
warnings.simplefilter('ignore')

Introduction

This notebook contains examples from Introductory Econometrics: A Modern Approach, 6e by Jeffrey M. Wooldridge. Each example illustrates how to load data, build econometric models, and compute estimates with Python.

Now, install and load the wooldridge package and lets get started!

# !pip install wooldridge
from wooldridge import *

Chapter 2: The Simple Regression Model

Example 2.10: A Log Wage Equation

Load the wage1 data and check out the documentation.

dataWoo('wage1', description=True)
name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
 | variable | label                           |
+----------+---------------------------------+
 | wage     | average hourly earnings         |
 | educ     | years of education              |
 | exper    | years potential experience      |
 | tenure   | years with current employer     |
 | nonwhite | =1 if nonwhite                  |
 | female   | =1 if female                    |
 | married  | =1 if married                   |
 | numdep   | number of dependents            |
 | smsa     | =1 if live in SMSA              |
 | northcen | =1 if live in north central U.S |
 | south    | =1 if live in southern region   |
 | west     | =1 if live in western region    |
 | construc | =1 if work in construc. indus.  |
 | ndurman  | =1 if in nondur. manuf. indus.  |
 | trcommpu | =1 if in trans, commun, pub ut  |
 | trade    | =1 if in wholesale or retail    |
 | services | =1 if in services indus.        |
 | profserv | =1 if in prof. serv. indus.     |
 | profocc  | =1 if in profess. occupation    |
 | clerocc  | =1 if in clerical occupation    |
 | servocc  | =1 if in service occupation     |
 | lwage    | log(wage)                       |
 | expersq  | exper^2                         |
 | tenursq  | tenure^2                        |
+----------+---------------------------------+

These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.

The documentation indicates these are data from the 1976 Current Population Survey, collected by Henry Farber when he and Wooldridge were colleagues at MIT in 1988.

wage: average hourly earnings

educ: years of education

First, make a scatter-plot of the two variables and look for possible patterns in the relationship between them.

wage1 = dataWoo('wage1')
plt.scatter(wage1.educ, wage1.wage)
plt.title("Wages vs. Education, 1976")
plt.xlabel("years of education")
plt.ylabel("Hourly wages")
plt.show()

output_9_0.png

It appears that on average, more years of education, leads to higher wages.

The example in the text investigates what the percentage change between wages and education might be. So, we must use the log(wage).

Build a linear model to estimate the relationship between the log of wage (lwage) and education (educ).

\widehat{log(wage)} = \beta_0 + \beta_1educ
y = wage1.lwage
x = wage1.educ

model = sm.OLS(x, y)
results = model.fit()

Print the summary of the results.

print(results.summary())
                                 OLS Regression Results                                
=======================================================================================
Dep. Variable:                   educ   R-squared (uncentered):                   0.916
Model:                            OLS   Adj. R-squared (uncentered):              0.916
Method:                 Least Squares   F-statistic:                              5717.
Date:                Sat, 09 May 2020   Prob (F-statistic):                   2.18e-284
Time:                        15:36:54   Log-Likelihood:                         -1438.9
No. Observations:                 526   AIC:                                      2880.
Df Residuals:                     525   BIC:                                      2884.
Df Model:                           1                                                  
Covariance Type:            nonrobust                                                  
==============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
lwage          7.2081      0.095     75.608      0.000       7.021       7.395
==============================================================================
Omnibus:                       12.119   Durbin-Watson:                   1.699
Prob(Omnibus):                  0.002   Jarque-Bera (JB):               13.454
Skew:                          -0.295   Prob(JB):                      0.00120
Kurtosis:                       3.515   Cond. No.                         1.00
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Plot the log(wage) vs educ, adding a line representing the least squares fit.

x_with_const = sm.add_constant(x)

model = sm.OLS(x_with_const, y)
results = model.fit()

slope, intercept = results.params[1].values/100, results.params[0].values


def abline(slope, intercept):
    """Plot a line from slope and intercept"""
    axes = plt.gca()
    x_vals = np.array(axes.get_xlim())
    y_vals = intercept + slope * x_vals
    plt.plot(x_vals, y_vals, '--')


plt.scatter(x, y)
abline(slope, intercept)
plt.title("A Log Wage Equation")
plt.xlabel("years of education")
plt.ylabel("log of average hourly wages")
plt.show()

output_15_0.png

Chapter 3: Multiple Regression Analysis: Estimation

Example 3.2: Hourly Wage Equation

Check the documentation for variable information

dataWoo('wage1', description=True)
name of dataset: wage1
no of variables: 24
no of observations: 526

+----------+---------------------------------+
 | variable | label                           |
+----------+---------------------------------+
 | wage     | average hourly earnings         |
 | educ     | years of education              |
 | exper    | years potential experience      |
 | tenure   | years with current employer     |
 | nonwhite | =1 if nonwhite                  |
 | female   | =1 if female                    |
 | married  | =1 if married                   |
 | numdep   | number of dependents            |
 | smsa     | =1 if live in SMSA              |
 | northcen | =1 if live in north central U.S |
 | south    | =1 if live in southern region   |
 | west     | =1 if live in western region    |
 | construc | =1 if work in construc. indus.  |
 | ndurman  | =1 if in nondur. manuf. indus.  |
 | trcommpu | =1 if in trans, commun, pub ut  |
 | trade    | =1 if in wholesale or retail    |
 | services | =1 if in services indus.        |
 | profserv | =1 if in prof. serv. indus.     |
 | profocc  | =1 if in profess. occupation    |
 | clerocc  | =1 if in clerical occupation    |
 | servocc  | =1 if in service occupation     |
 | lwage    | log(wage)                       |
 | expersq  | exper^2                         |
 | tenursq  | tenure^2                        |
+----------+---------------------------------+

These are data from the 1976 Current Population Survey, collected by
Henry Farber when he and I were colleagues at MIT in 1988.

lwage: log of the average hourly earnings

educ: years of education

exper: years of potential experience

tenutre: years with current employer

Plot the variables against lwage and compare their distributions and slope (\beta) of the simple regression lines.

y = wage1.lwage
x = wage1.educ

model = sm.OLS(x, y)
results = model.fit()

x_with_const = sm.add_constant(x)

model = sm.OLS(x_with_const, y)
results = model.fit()

slope, intercept = results.params[1].values/100, results.params[0].values

plt.scatter(x, y)
abline(slope, intercept)
plt.title("A Log Wage Equation")
plt.xlabel("years of education")
plt.ylabel("log of average hourly wages")
plt.show()

output_19_0.png

y = wage1.lwage
x = wage1.exper

model = sm.OLS(x, y)
results = model.fit()

x_with_const = sm.add_constant(x)

model = sm.OLS(x_with_const, y)
results = model.fit()

slope, intercept = results.params[1].values/100, results.params[0].values

plt.scatter(x, y)
abline(slope, intercept)
plt.title("A Log Wage Equation")
plt.xlabel("years of education")
plt.ylabel("log of average hourly wages")
plt.show()

output_20_0.png

y = wage1.lwage
x = wage1.tenure

model = sm.OLS(x, y)
results = model.fit()

x_with_const = sm.add_constant(x)

model = sm.OLS(x_with_const, y)
results = model.fit()

slope, intercept = results.params[1].values/100, results.params[0].values

plt.scatter(x, y)
abline(slope, intercept)
plt.title("A Log Wage Equation")
plt.xlabel("years of education")
plt.ylabel("log of average hourly wages")
plt.show()

output_21_0.png

Estimate the model regressing educ, exper, and tenure against log(wage).

\widehat{log(wage)} = \beta_0 + \beta_1educ + \beta_3exper + \beta_4tenure
results = smf.ols("lwage ~ educ + exper + tenure", data=wage1).fit()
results.params
Intercept    0.284360
educ         0.092029
exper        0.004121
tenure       0.022067
dtype: float64
coef_df = pd.DataFrame(results.params)
coef_df.drop(index=['Intercept'], inplace=True)
coef_df.rename(columns={0: "coef"}, inplace=True)
coef_df
coef
educ 0.092029
exper 0.004121
tenure 0.022067
coef_df.sort_values(by='coef', ascending=False).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1c1b4c9390>

output_26_1.png

Chapter 4: Multiple Regression Analysis: Inference

Example 4.1 Hourly Wage Equation

Using the same model estimated in example: 3.2, examine and compare the standard errors associated with each coefficient. Like the textbook, these are contained in parenthesis next to each associated coefficient.

results = smf.ols("lwage ~ educ + exper + tenure", data=wage1).fit()
results.summary()
OLS Regression Results
Dep. Variable: lwage R-squared: 0.316
Model: OLS Adj. R-squared: 0.312
Method: Least Squares F-statistic: 80.39
Date: Sat, 09 May 2020 Prob (F-statistic): 9.13e-43
Time: 15:36:55 Log-Likelihood: -313.55
No. Observations: 526 AIC: 635.1
Df Residuals: 522 BIC: 652.2
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.2844 0.104 2.729 0.007 0.080 0.489
educ 0.0920 0.007 12.555 0.000 0.078 0.106
exper 0.0041 0.002 2.391 0.017 0.001 0.008
tenure 0.0221 0.003 7.133 0.000 0.016 0.028
Omnibus: 11.534 Durbin-Watson: 1.769
Prob(Omnibus): 0.003 Jarque-Bera (JB): 20.941
Skew: 0.021 Prob(JB): 2.84e-05
Kurtosis: 3.977 Cond. No. 135.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
results.params[2]/results.bse[2]
2.3914371073142298

Example 4.7 Effect of Job Training on Firm Scrap Rates

Load the jtrain data set.

dataWoo('jtrain', description=True)
name of dataset: jtrain
no of variables: 30
no of observations: 471

+----------+---------------------------------+
 | variable | label                           |
+----------+---------------------------------+
 | year     | 1987, 1988, or 1989             |
 | fcode    | firm code number                |
 | employ   | # employees at plant            |
 | sales    | annual sales, $                 |
 | avgsal   | average employee salary         |
 | scrap    | scrap rate (per 100 items)      |
 | rework   | rework rate (per 100 items)     |
 | tothrs   | total hours training            |
 | union    | =1 if unionized                 |
 | grant    | = 1 if received grant           |
 | d89      | = 1 if year = 1989              |
 | d88      | = 1 if year = 1988              |
 | totrain  | total employees trained         |
 | hrsemp   | tothrs/totrain                  |
 | lscrap   | log(scrap)                      |
 | lemploy  | log(employ)                     |
 | lsales   | log(sales)                      |
 | lrework  | log(rework)                     |
 | lhrsemp  | log(1 + hrsemp)                 |
 | lscrap_1 | lagged lscrap; missing 1987     |
 | grant_1  | lagged grant; assumed 0 in 1987 |
 | clscrap  | lscrap - lscrap_1; year > 1987  |
 | cgrant   | grant - grant_1                 |
 | clemploy | lemploy - lemploy[_n-1]         |
 | clsales  | lavgsal - lavgsal[_n-1]         |
 | lavgsal  | log(avgsal)                     |
 | clavgsal | lavgsal - lavgsal[_n-1]         |
 | cgrant_1 | cgrant[_n-1]                    |
 | chrsemp  | hrsemp - hrsemp[_n-1]           |
 | clhrsemp | lhrsemp - lhrsemp[_n-1]         |
+----------+---------------------------------+

H. Holzer, R. Block, M. Cheatham, and J. Knott (1993), “Are Training
Subsidies Effective? The Michigan Experience,” Industrial and Labor
Relations Review 46, 625-636. The authors kindly provided the data.
jtrain = dataWoo('jtrain')
jtrain_subset = jtrain[jtrain.year == 1987][['year', 'union', 'lscrap',
                                             'hrsemp', 'lsales', 'lemploy']]
jtrain_subset.head()
year union lscrap hrsemp lsales lemploy
0 1987 0 NaN 12.0 17.665659 4.605170
3 1987 0 NaN 12.0 14.260197 2.484907
6 1987 0 NaN 37.5 13.527828 2.995732
9 1987 0 NaN 0.0 16.982714 5.298317
12 1987 0 NaN NaN 15.607270 NaN
jtrain_subset.isnull().sum()
year         0
union        0
lscrap     103
hrsemp      28
lsales      38
lemploy     13
dtype: int64
jtrain_clean = jtrain_subset.dropna()

Now create the linear model regressing hrsemp(total hours training/total employees trained), lsales(log of annual sales), and lemploy(the log of the number of the employees), against lscrap(the log of the scrape rate).

lscrap = \alpha + \beta_1 hrsemp + \beta_2 lsales + \beta_3 lemploy
results = smf.ols("lscrap ~ hrsemp + lsales + lemploy",
                  data=jtrain_clean).fit()
results.summary()
OLS Regression Results
Dep. Variable: lscrap R-squared: 0.310
Model: OLS Adj. R-squared: 0.257
Method: Least Squares F-statistic: 5.838
Date: Sat, 09 May 2020 Prob (F-statistic): 0.00215
Time: 15:36:55 Log-Likelihood: -70.198
No. Observations: 43 AIC: 148.4
Df Residuals: 39 BIC: 155.4
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 11.7443 4.575 2.567 0.014 2.491 20.997
hrsemp -0.0422 0.019 -2.259 0.030 -0.080 -0.004
lsales -0.9506 0.370 -2.570 0.014 -1.699 -0.203
lemploy 0.9921 0.357 2.780 0.008 0.270 1.714
Omnibus: 1.109 Durbin-Watson: 2.163
Prob(Omnibus): 0.574 Jarque-Bera (JB): 0.934
Skew: 0.093 Prob(JB): 0.627
Kurtosis: 2.302 Cond. No. 417.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Chapter 5: Multiple Regression Analysis: OLS Asymptotics

Example 5.1: Housing Prices and Distance From an Incinerator

Load the hprice3 data set.

dataWoo('hprice3', description=True)
name of dataset: hprice3
no of variables: 19
no of observations: 321

+----------+---------------------------------+
 | variable | label                           |
+----------+---------------------------------+
 | year     | 1978, 1981                      |
 | age      | age of house                    |
 | agesq    | age^2                           |
 | nbh      | neighborhood, 1-6               |
 | cbd      | dist. to cent. bus. dstrct, ft. |
 | inst     | dist. to interstate, ft.        |
 | linst    | log(inst)                       |
 | price    | selling price                   |
 | rooms    | # rooms in house                |
 | area     | square footage of house         |
 | land     | square footage lot              |
 | baths    | # bathrooms                     |
 | dist     | dist. from house to incin., ft. |
 | ldist    | log(dist)                       |
 | lprice   | log(price)                      |
 | y81      | =1 if year = 1981               |
 | larea    | log(area)                       |
 | lland    | log(land)                       |
 | linstsq  | linst^2                         |
+----------+---------------------------------+

-
hprice3 = dataWoo('hprice3')

Next, model the log(price) against the log(dist) to estimate the percentage relationship between the two.

price_dist_model = smf.ols("lprice ~ ldist", data=hprice3).fit()

Create another model that controls for "quality" variables, such as square footage area per house.

price_area_model = smf.ols("lprice ~ ldist + larea", data=hprice3).fit()
price_dist_model.summary()
OLS Regression Results
Dep. Variable: lprice R-squared: 0.120
Model: OLS Adj. R-squared: 0.117
Method: Least Squares F-statistic: 43.48
Date: Sat, 09 May 2020 Prob (F-statistic): 1.78e-10
Time: 15:36:55 Log-Likelihood: -169.60
No. Observations: 321 AIC: 343.2
Df Residuals: 319 BIC: 350.7
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 8.2575 0.474 17.427 0.000 7.325 9.190
ldist 0.3172 0.048 6.594 0.000 0.223 0.412
Omnibus: 3.073 Durbin-Watson: 0.944
Prob(Omnibus): 0.215 Jarque-Bera (JB): 2.868
Skew: 0.228 Prob(JB): 0.238
Kurtosis: 3.079 Cond. No. 205.


Warnings:

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

price_area_model.summary()
OLS Regression Results
Dep. Variable: lprice R-squared: 0.474
Model: OLS Adj. R-squared: 0.471
Method: Least Squares F-statistic: 143.2
Date: Sat, 09 May 2020 Prob (F-statistic): 4.53e-45
Time: 15:36:55 Log-Likelihood: -87.041
No. Observations: 321 AIC: 180.1
Df Residuals: 318 BIC: 191.4
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 3.4939 0.491 7.121 0.000 2.529 4.459
ldist 0.1962 0.038 5.142 0.000 0.121 0.271
larea 0.7837 0.054 14.625 0.000 0.678 0.889
Omnibus: 13.639 Durbin-Watson: 1.118
Prob(Omnibus): 0.001 Jarque-Bera (JB): 18.845
Skew: -0.334 Prob(JB): 8.09e-05
Kurtosis: 3.982 Cond. No. 345.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Chapter 6: Multiple Regression: Further Issues

Example 6.1: Effects of Pollution on Housing Prices, standardized.

Load the hprice2 data and view the documentation.

dataWoo('hprice2', description=True)
name of dataset: hprice2
no of variables: 12
no of observations: 506

+----------+-------------------------------+
 | variable | label                         |
+----------+-------------------------------+
 | price    | median housing price, $       |
 | crime    | crimes committed per capita   |
 | nox      | nit ox concen; parts per 100m |
 | rooms    | avg number of rooms           |
 | dist     | wght dist to 5 employ centers |
 | radial   | access. index to rad. hghwys  |
 | proptax  | property tax per $1000        |
 | stratio  | average student-teacher ratio |
 | lowstat  | perc of people 'lower status' |
 | lprice   | log(price)                    |
 | lnox     | log(nox)                      |
 | lproptax | log(proptax)                  |
+----------+-------------------------------+

D. Harrison and D.L. Rubinfeld (1978), “Hedonic Housing Prices and the
Demand for Clean Air,” by Harrison, D. and D.L.Rubinfeld, Journal of
Environmental Economics and Management 5, 81-102. Diego Garcia, a
former Ph.D. student in economics at MIT, kindly provided these data,
which he obtained from the book Regression Diagnostics: Identifying
Influential Data and Sources of Collinearity, by D.A. Belsey, E. Kuh,
and R. Welsch, 1990. New York: Wiley.
price = \beta_0 + \beta_1nox + \beta_2crime + \beta_3rooms + \beta_4dist + \beta_5stratio + \mu

Estimate the usual lm model.

hprice2 = dataWoo('hprice2')
housing_level = smf.ols("price ~ nox + crime + rooms + dist + stratio",
                        data=hprice2).fit()

Estimate the same model, but standardized coefficients by wrapping each variable with R’s scale function:

\widehat{zprice} = \beta_1znox + \beta_2zcrime + \beta_3zrooms + \beta_4zdist + \beta_5zstratio
from sklearn.preprocessing import scale

zprice = pd.DataFrame({"zprice":scale(hprice2["price"])})
znox = pd.DataFrame({"znox":scale(hprice2["nox"])})
zcrime = pd.DataFrame({"zcrime":scale(hprice2["crime"])})
zrooms = pd.DataFrame({"zrooms":scale(hprice2["rooms"])})
zdist = pd.DataFrame({"zdist":scale(hprice2["dist"])})
zstratio = pd.DataFrame({"zstratio":scale(hprice2["stratio"])})
hprice2 = pd.concat([hprice2, zprice, znox, zcrime, zrooms, zdist, zstratio],axis=1)

housing_standardized = smf.ols("zprice ~ znox + zcrime + zrooms + zdist + zstratio"
                               , data = hprice2).fit()
housing_level.summary()
OLS Regression Results
Dep. Variable: price R-squared: 0.636
Model: OLS Adj. R-squared: 0.632
Method: Least Squares F-statistic: 174.5
Date: Sat, 09 May 2020 Prob (F-statistic): 3.61e-107
Time: 15:36:56 Log-Likelihood: -5080.8
No. Observations: 506 AIC: 1.017e+04
Df Residuals: 500 BIC: 1.020e+04
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.087e+04 5054.599 4.129 0.000 1.09e+04 3.08e+04
nox -2706.4326 354.087 -7.643 0.000 -3402.114 -2010.751
crime -153.6010 32.929 -4.665 0.000 -218.297 -88.905
rooms 6735.4983 393.604 17.112 0.000 5962.177 7508.819
dist -1026.8063 188.108 -5.459 0.000 -1396.386 -657.227
stratio -1149.2038 127.429 -9.018 0.000 -1399.566 -898.842
Omnibus: 272.145 Durbin-Watson: 0.865
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2647.578
Skew: 2.150 Prob(JB): 0.00
Kurtosis: 13.348 Cond. No. 432.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
housing_standardized.summary()
OLS Regression Results
Dep. Variable: zprice R-squared: 0.636
Model: OLS Adj. R-squared: 0.632
Method: Least Squares F-statistic: 174.5
Date: Sat, 09 May 2020 Prob (F-statistic): 3.61e-107
Time: 15:36:56 Log-Likelihood: -462.53
No. Observations: 506 AIC: 937.1
Df Residuals: 500 BIC: 962.4
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 9.368e-17 0.027 3.47e-15 1.000 -0.053 0.053
znox -0.3404 0.045 -7.643 0.000 -0.428 -0.253
zcrime -0.1433 0.031 -4.665 0.000 -0.204 -0.083
zrooms 0.5139 0.030 17.112 0.000 0.455 0.573
zdist -0.2348 0.043 -5.459 0.000 -0.319 -0.150
zstratio -0.2703 0.030 -9.018 0.000 -0.329 -0.211
Omnibus: 272.145 Durbin-Watson: 0.865
Prob(Omnibus): 0.000 Jarque-Bera (JB): 2647.578
Skew: 2.150 Prob(JB): 0.00
Kurtosis: 13.348 Cond. No. 3.33


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 6.2: Effects of Pollution on Housing Prices, Quadratic Interactive Term

Modify the housing model from example 4.5, adding a quadratic term in rooms:

log(price) = \beta_0 + \beta_1log(nox) + \beta_2log(dist) + \beta_3rooms + \beta_4rooms^2 + \beta_5stratio + \mu
ldist = pd.DataFrame({"ldist":np.log(hprice2["dist"])})
roomssq = pd.DataFrame({"roomssq":hprice2["rooms"]**2})
hprice2 = pd.concat([hprice2, ldist, roomssq],axis=1)

housing_model_4_5 = smf.ols("lprice ~ lnox + ldist + rooms + stratio"
                            , data=hprice2).fit()
housing_model_6_2 = smf.ols("lprice ~ lnox + ldist + rooms + roomssq + stratio"
                            , data=hprice2).fit()
housing_model_4_5.summary()
OLS Regression Results
Dep. Variable: lprice R-squared: 0.584
Model: OLS Adj. R-squared: 0.581
Method: Least Squares F-statistic: 175.9
Date: Sat, 09 May 2020 Prob (F-statistic): 5.53e-94
Time: 15:36:56 Log-Likelihood: -43.495
No. Observations: 506 AIC: 96.99
Df Residuals: 501 BIC: 118.1
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 11.0839 0.318 34.843 0.000 10.459 11.709
lnox -0.9535 0.117 -8.168 0.000 -1.183 -0.724
ldist -0.1343 0.043 -3.117 0.002 -0.219 -0.050
rooms 0.2545 0.019 13.736 0.000 0.218 0.291
stratio -0.0525 0.006 -8.894 0.000 -0.064 -0.041
Omnibus: 61.317 Durbin-Watson: 0.682
Prob(Omnibus): 0.000 Jarque-Bera (JB): 480.143
Skew: 0.051 Prob(JB): 5.47e-105
Kurtosis: 7.771 Cond. No. 560.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
housing_model_6_2.summary()
OLS Regression Results
Dep. Variable: lprice R-squared: 0.603
Model: OLS Adj. R-squared: 0.599
Method: Least Squares F-statistic: 151.8
Date: Sat, 09 May 2020 Prob (F-statistic): 7.89e-98
Time: 15:36:56 Log-Likelihood: -31.806
No. Observations: 506 AIC: 75.61
Df Residuals: 500 BIC: 101.0
Df Model: 5
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 13.3855 0.566 23.630 0.000 12.273 14.498
lnox -0.9017 0.115 -7.862 0.000 -1.127 -0.676
ldist -0.0868 0.043 -2.005 0.045 -0.172 -0.002
rooms -0.5451 0.165 -3.295 0.001 -0.870 -0.220
roomssq 0.0623 0.013 4.862 0.000 0.037 0.087
stratio -0.0476 0.006 -8.129 0.000 -0.059 -0.036
Omnibus: 56.649 Durbin-Watson: 0.691
Prob(Omnibus): 0.000 Jarque-Bera (JB): 384.168
Skew: -0.100 Prob(JB): 3.79e-84
Kurtosis: 7.264 Cond. No. 2.30e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.3e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Chapter 7: Multiple Regression Analysis with Qualitative Information

Example 7.4: Housing Price Regression, Qualitative Binary variable

This time, use the hrprice1 data.

dataWoo('hprice1', description=True)
name of dataset: hprice1
no of variables: 10
no of observations: 88

+----------+------------------------------+
 | variable | label                        |
+----------+------------------------------+
 | price    | house price, $1000s          |
 | assess   | assessed value, $1000s       |
 | bdrms    | number of bdrms              |
 | lotsize  | size of lot in square feet   |
 | sqrft    | size of house in square feet |
 | colonial | =1 if home is colonial style |
 | lprice   | log(price)                   |
 | lassess  | log(assess                   |
 | llotsize | log(lotsize)                 |
 | lsqrft   | log(sqrft)                   |
+----------+------------------------------+

Collected from the real estate pages of the Boston Globe during 1990.
These are homes that sold in the Boston, MA area.
\widehat{log(price)} = \beta_0 + \beta_1log(lotsize) + \beta_2log(sqrft) + \beta_3bdrms + \beta_4colonial

Estimate the coefficients of the above linear model on the hprice data set.

hprice1 = dataWoo('hprice1')
housing_qualitative = smf.ols("lprice ~ llotsize + lsqrft + bdrms + colonial",
                              data=hprice1).fit()
housing_qualitative.summary()
OLS Regression Results
Dep. Variable: lprice R-squared: 0.649
Model: OLS Adj. R-squared: 0.632
Method: Least Squares F-statistic: 38.38
Date: Sat, 09 May 2020 Prob (F-statistic): 3.74e-18
Time: 15:36:56 Log-Likelihood: 26.619
No. Observations: 88 AIC: -43.24
Df Residuals: 83 BIC: -30.85
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -1.3496 0.651 -2.073 0.041 -2.644 -0.055
llotsize 0.1678 0.038 4.395 0.000 0.092 0.244
lsqrft 0.7072 0.093 7.620 0.000 0.523 0.892
bdrms 0.0268 0.029 0.934 0.353 -0.030 0.084
colonial 0.0538 0.045 1.202 0.233 -0.035 0.143
Omnibus: 13.728 Durbin-Watson: 2.077
Prob(Omnibus): 0.001 Jarque-Bera (JB): 50.828
Skew: -0.053 Prob(JB): 9.18e-12
Kurtosis: 6.722 Cond. No. 411.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Chapter 8: Heteroskedasticity

Example 8.9: Determinants of Personal Computer Ownership

\widehat{PC} = \beta_0 + \beta_1hsGPA + \beta_2ACT + \beta_3parcoll + \beta_4colonial

Christopher Lemmon, a former MSU undergraduate, collected these data from a survey he took of MSU students in Fall 1994. Load gpa1 and create a new variable combining the fathcoll and mothcoll, into parcoll. This new column indicates if either parent went to college.

dataWoo('gpa1', description=True)
name of dataset: gpa1
no of variables: 29
no of observations: 141

+----------+--------------------------------+
 | variable | label                          |
+----------+--------------------------------+
 | age      | in years                       |
 | soph     | =1 if sophomore                |
 | junior   | =1 if junior                   |
 | senior   | =1 if senior                   |
 | senior5  | =1 if fifth year senior        |
 | male     | =1 if male                     |
 | campus   | =1 if live on campus           |
 | business | =1 if business major           |
 | engineer | =1 if engineering major        |
 | colGPA   | MSU GPA                        |
 | hsGPA    | high school GPA                |
 | ACT      | 'achievement' score            |
 | job19    | =1 if job <= 19 hours          |
 | job20    | =1 if job >= 20 hours          |
 | drive    | =1 if drive to campus          |
 | bike     | =1 if bicycle to campus        |
 | walk     | =1 if walk to campus           |
 | voluntr  | =1 if do volunteer work        |
 | PC       | =1 of pers computer at sch     |
 | greek    | =1 if fraternity or sorority   |
 | car      | =1 if own car                  |
 | siblings | =1 if have siblings            |
 | bgfriend | =1 if boy- or girlfriend       |
 | clubs    | =1 if belong to MSU club       |
 | skipped  | avg lectures missed per week   |
 | alcohol  | avg # days per week drink alc. |
 | gradMI   | =1 if Michigan high school     |
 | fathcoll | =1 if father college grad      |
 | mothcoll | =1 if mother college grad      |
+----------+--------------------------------+

Christopher Lemmon, a former MSU undergraduate, collected these data
from a survey he took of MSU students in Fall 1994.
gpa1 = dataWoo('gpa1')

gpa1['parcoll'] = 0
gpa1.parcoll.loc[gpa1.fathcoll == 1 | gpa1.mothcoll] = 1
GPA_OLS = smf.ols("PC ~ hsGPA + ACT + parcoll", data=gpa1).fit()
GPA_OLS.summary()
OLS Regression Results
Dep. Variable: PC R-squared: 0.023
Model: OLS Adj. R-squared: 0.002
Method: Least Squares F-statistic: 1.087
Date: Sat, 09 May 2020 Prob (F-statistic): 0.357
Time: 15:36:56 Log-Likelihood: -97.630
No. Observations: 141 AIC: 203.3
Df Residuals: 137 BIC: 215.1
Df Model: 3
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 0.1107 0.492 0.225 0.822 -0.862 1.083
hsGPA 0.0489 0.138 0.354 0.724 -0.224 0.322
ACT 0.0014 0.016 0.090 0.929 -0.030 0.032
parcoll 0.1463 0.085 1.728 0.086 -0.021 0.314
Omnibus: 1290.376 Durbin-Watson: 1.780
Prob(Omnibus): 0.000 Jarque-Bera (JB): 21.607
Skew: 0.405 Prob(JB): 2.03e-05
Kurtosis: 1.262 Cond. No. 298.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Chapter 9: More on Specification and Data Issues

Example 9.8: R&D Intensity and Firm Size

rdintens = \beta_0 + \beta_1sales + \beta_2profmarg + \mu

From Businessweek R&D Scoreboard, October 25, 1991. Load the data and estimate the model.

rdchem = dataWoo("rdchem")

all_rdchem = smf.ols("rdintens ~ sales + profmarg", data=rdchem).fit()
all_rdchem.summary()
OLS Regression Results
Dep. Variable: rdintens R-squared: 0.076
Model: OLS Adj. R-squared: 0.012
Method: Least Squares F-statistic: 1.195
Date: Sat, 09 May 2020 Prob (F-statistic): 0.317
Time: 15:36:56 Log-Likelihood: -63.725
No. Observations: 32 AIC: 133.4
Df Residuals: 29 BIC: 137.8
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.6253 0.586 4.484 0.000 1.428 3.823
sales 5.338e-05 4.41e-05 1.211 0.236 -3.68e-05 0.000
profmarg 0.0446 0.046 0.966 0.342 -0.050 0.139
Omnibus: 20.499 Durbin-Watson: 1.694
Prob(Omnibus): 0.000 Jarque-Bera (JB): 28.349
Skew: 1.674 Prob(JB): 6.98e-07
Kurtosis: 6.170 Cond. No. 1.49e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.49e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

So, we can estimate the model without that data point to gain a better understanding of how sales and profmarg describe rdintens for most firms. We can use the subset argument of the linear model function to indicate that we only want to estimate the model using data that is less than the highest sales.

rdchem_sub = rdchem[rdchem.sales < max(rdchem.sales)]

smallest_rdchem = smf.ols("rdintens ~ sales + profmarg", data=rdchem_sub).fit()
smallest_rdchem.summary()
OLS Regression Results
Dep. Variable: rdintens R-squared: 0.173
Model: OLS Adj. R-squared: 0.114
Method: Least Squares F-statistic: 2.925
Date: Sat, 09 May 2020 Prob (F-statistic): 0.0702
Time: 15:36:56 Log-Likelihood: -60.496
No. Observations: 31 AIC: 127.0
Df Residuals: 28 BIC: 131.3
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 2.2969 0.592 3.881 0.001 1.085 3.509
sales 0.0002 8.42e-05 2.204 0.036 1.31e-05 0.000
profmarg 0.0478 0.044 1.075 0.291 -0.043 0.139
Omnibus: 19.377 Durbin-Watson: 1.694
Prob(Omnibus): 0.000 Jarque-Bera (JB): 25.558
Skew: 1.623 Prob(JB): 2.82e-06
Kurtosis: 6.040 Cond. No. 8.56e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.56e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Chapter 10: Basic Regression Analysis with Time Series Data

Example 10.2: Effects of Inflation and Deficits on Interest Rates

\widehat{i3} = \beta_0 + \beta_1inf_t + \beta_2def_t

Data from the Economic Report of the President, 2004, Tables B-64, B-73, and B-79.

dataWoo("intdef", description=True)
name of dataset: intdef
no of variables: 13
no of observations: 56

+----------+----------------------------------+
 | variable | label                            |
+----------+----------------------------------+
 | year     | 1948 to 2003                     |
 | i3       | 3 month T-bill rate              |
 | inf      | CPI inflation rate               |
 | rec      | federal receipts, % GDP          |
 | out      | federal outlays, % GDP           |
 | def      | out - rec                        |
 | i3_1     | i3[_n-1]                         |
 | inf_1    | inf[_n-1]                        |
 | def_1    | def[_n-1]                        |
 | ci3      | i3 - i3_1                        |
 | cinf     | inf - inf_1                      |
 | cdef     | def - def_1                      |
 | y77      | =1 if year >= 1977; change in FY |
+----------+----------------------------------+

Economic Report of the President, 2004, Tables B-64, B-73, and B-79.
intdef = dataWoo("intdef")
intdef['infla'] = intdef['inf']
intdef['defla'] = intdef['def']
tbill_model = smf.ols("i3 ~ infla + defla", data=intdef).fit()
tbill_model.summary()
OLS Regression Results
Dep. Variable: i3 R-squared: 0.602
Model: OLS Adj. R-squared: 0.587
Method: Least Squares F-statistic: 40.09
Date: Sat, 09 May 2020 Prob (F-statistic): 2.48e-11
Time: 15:36:56 Log-Likelihood: -112.16
No. Observations: 56 AIC: 230.3
Df Residuals: 53 BIC: 236.4
Df Model: 2
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 1.7333 0.432 4.012 0.000 0.867 2.600
infla 0.6059 0.082 7.376 0.000 0.441 0.771
defla 0.5131 0.118 4.334 0.000 0.276 0.751
Omnibus: 0.260 Durbin-Watson: 0.716
Prob(Omnibus): 0.878 Jarque-Bera (JB): 0.015
Skew: -0.028 Prob(JB): 0.992
Kurtosis: 3.058 Cond. No. 9.28


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Example 10.11: Seasonal Effects of Antidumping Filings

C.M. Krupp and P.S. Pollard (1999), Market Responses to Antidumpting Laws: Some Evidence from the U.S. Chemical Industry, Canadian Journal of Economics 29, 199-227. Dr. Krupp kindly provided the data. They are monthly data covering February 1978 through December 1988.

dataWoo("barium", description=True)
name of dataset: barium
no of variables: 31
no of observations: 131

+----------+---------------------------------+
 | variable | label                           |
+----------+---------------------------------+
 | chnimp   | Chinese imports, bar. chl.      |
 | bchlimp  | total imports bar. chl.         |
 | befile6  | =1 for all 6 mos before filing  |
 | affile6  | =1 for all 6 mos after filing   |
 | afdec6   | =1 for all 6 mos after decision |
 | befile12 | =1 all 12 mos before filing     |
 | affile12 | =1 all 12 mos after filing      |
 | afdec12  | =1 all 12 mos after decision    |
 | chempi   | chemical production index       |
 | gas      | gasoline production             |
 | rtwex    | exchange rate index             |
 | spr      | =1 for spring months            |
 | sum      | =1 for summer months            |
 | fall     | =1 for fall months              |
 | lchnimp  | log(chnimp)                     |
 | lgas     | log(gas)                        |
 | lrtwex   | log(rtwex)                      |
 | lchempi  | log(chempi)                     |
 | t        | time trend                      |
 | feb      | =1 if month is feb              |
 | mar      | =1 if month is march            |
 | apr      |                                 |
 | may      |                                 |
 | jun      |                                 |
 | jul      |                                 |
 | aug      |                                 |
 | sep      |                                 |
 | oct      |                                 |
 | nov      |                                 |
 | dec      |                                 |
 | percchn  | % imports from china            |
+----------+---------------------------------+

C.M. Krupp and P.S. Pollard (1999), "Market Responses to Antidumpting
Laws: Some Evidence from the U.S. Chemical Industry," Canadian Journal
of Economics 29, 199-227. Dr. Krupp kindly provided the data. They are
monthly data covering February 1978 through December 1988.
barium = dataWoo("barium")

barium_imports = smf.ols(
    "lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6",
    data=barium).fit()
barium_imports.summary()
OLS Regression Results
Dep. Variable: lchnimp R-squared: 0.305
Model: OLS Adj. R-squared: 0.271
Method: Least Squares F-statistic: 9.064
Date: Sat, 09 May 2020 Prob (F-statistic): 3.25e-08
Time: 15:36:56 Log-Likelihood: -114.79
No. Observations: 131 AIC: 243.6
Df Residuals: 124 BIC: 263.7
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept -17.8030 21.045 -0.846 0.399 -59.458 23.852
lchempi 3.1172 0.479 6.505 0.000 2.169 4.066
lgas 0.1964 0.907 0.217 0.829 -1.598 1.991
lrtwex 0.9830 0.400 2.457 0.015 0.191 1.775
befile6 0.0596 0.261 0.228 0.820 -0.457 0.576
affile6 -0.0324 0.264 -0.123 0.903 -0.556 0.491
afdec6 -0.5652 0.286 -1.978 0.050 -1.131 0.001
Omnibus: 9.160 Durbin-Watson: 1.458
Prob(Omnibus): 0.010 Jarque-Bera (JB): 9.978
Skew: -0.491 Prob(JB): 0.00681
Kurtosis: 3.930 Cond. No. 9.62e+03


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 9.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Estimate a new model, barium_seasonal which accounts for seasonality by adding dummy variables contained in the data.

barium_seasonal = smf.ols(
    "lchnimp ~ lchempi + lgas + lrtwex + befile6 + affile6 + afdec6 + \
    feb + mar + apr + may + jun + jul + aug + sep + oct + nov + dec",
    data=barium).fit()
barium_seasonal.summary()
OLS Regression Results
Dep. Variable: lchnimp R-squared: 0.358
Model: OLS Adj. R-squared: 0.262
Method: Least Squares F-statistic: 3.712
Date: Sat, 09 May 2020 Prob (F-statistic): 1.28e-05
Time: 15:36:56 Log-Likelihood: -109.54
No. Observations: 131 AIC: 255.1
Df Residuals: 113 BIC: 306.8
Df Model: 17
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 16.7788 32.429 0.517 0.606 -47.468 81.026
lchempi 3.2651 0.493 6.624 0.000 2.288 4.242
lgas -1.2781 1.389 -0.920 0.359 -4.030 1.474
lrtwex 0.6630 0.471 1.407 0.162 -0.271 1.597
befile6 0.1397 0.267 0.524 0.602 -0.389 0.668
affile6 0.0126 0.279 0.045 0.964 -0.539 0.565
afdec6 -0.5213 0.302 -1.726 0.087 -1.120 0.077
feb -0.4177 0.304 -1.372 0.173 -1.021 0.185
mar 0.0591 0.265 0.223 0.824 -0.465 0.584
apr -0.4515 0.268 -1.682 0.095 -0.983 0.080
may 0.0333 0.269 0.124 0.902 -0.500 0.567
jun -0.2063 0.269 -0.766 0.445 -0.740 0.327
jul 0.0038 0.279 0.014 0.989 -0.548 0.556
aug -0.1571 0.278 -0.565 0.573 -0.708 0.394
sep -0.1342 0.268 -0.501 0.617 -0.664 0.396
oct 0.0517 0.267 0.194 0.847 -0.477 0.580
nov -0.2463 0.263 -0.937 0.351 -0.767 0.274
dec 0.1328 0.271 0.489 0.626 -0.405 0.671
Omnibus: 9.169 Durbin-Watson: 1.325
Prob(Omnibus): 0.010 Jarque-Bera (JB): 9.324
Skew: -0.540 Prob(JB): 0.00945
Kurtosis: 3.736 Cond. No. 1.47e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.47e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Now, compute the anova between the two models.

from statsmodels.stats.api import anova_lm

anova_lm(barium_imports, barium_seasonal)
df_resid ssr df_diff ss_diff F Pr(>F)
0 124.0 44.247088 0.0 NaN NaN NaN
1 113.0 40.843896 11.0 3.403192 0.855943 0.585201

in conclusion

--I would be grateful if you could point out or comment on Qiita or github. -I also do Twitter.

Recommended Posts

Solve Wooldridge exercises in Python
Solve ABC168D in Python
Solve ABC167-D in Python
Solve ABC146-C in Python
Solve ABC098-C in Python
Solve ABC159-D in Python
Solve ABC169 in Python
Solve ABC160-E in Python
Solve ABC176 E in Python
Solve ABC175 D in Python
Solve optimization problems in Python
Solve Atcoder ABC169 A-D in Python
Solve ABC036 A ~ C in Python
Solve ABC037 A ~ C in Python
Solve ordinary differential equations in Python
Python in optimization
CURL in python
Metaprogramming in Python
Solve ABC175 A, B, C in Python
Python 3.3 in Anaconda
Geocoding in python
Meta-analysis in Python
Unittest in python
ABC 157 D --Solve Friend Suggestions in Python!
Epoch in Python
Discord in Python
Sudoku in Python
DCI in Python
quicksort in python
nCr in python
N-Gram in Python
Programming in python
Plink in Python
Constant in python
I wanted to solve ABC159 in Python
Lifegame in Python.
FizzBuzz in Python
Sqlite in python
StepAIC in Python
Solve ABC165 A, B, D in Python
N-gram in python
LINE-Bot [0] in Python
Csv in python
Disassemble in Python
Reflection in Python
Solve the maximum subarray problem in Python
Constant in python
nCr in Python.
format in python
Scons in Python3
Puyo Puyo in python
python in virtualenv
PPAP in Python
Quad-tree in Python
Reflection in Python
Chemistry in Python
Hashable in python
DirectLiNGAM in Python
LiNGAM in Python
Flatten in python
flatten in python