Baixe econometric analysis - GREENE 6TH EDITION - ch 02 e outras Manuais, Projetos, Pesquisas em PDF para Matemática Computacional, somente na Docsity! 241
2
THE CLASSICAL MULTIPLE
LINEAR REGRESSION
MODEL
—eari/à/ip=>—
INTRODUCTION
An econometric study begins with a set of propositions about some aspect of the
economy. The theory specifies a set of precise, deterministic relationships among vari-
ables. Familiar examples are demand equations, production functions, and macroeco-
nomic models. The empirical investigation provides estimates ol unknown parameters
in the model, such as elasticilies or the effects of monetary policy, and usually attempts to
measure the validity of the theory against the behavior of obscrvable data. Onec suitably
constructed. the model might then be used for prediction or analysis of behavior, This
book will develop a large number of models and techniques used in Lhis framework.
The linear regression model is the single most uscful tool in the econometrician's
kit. Though to an increasing degree in the contemporary literature, it is often only
the departure point for the full analysis, it remains the device used to begin almost all
empirical research. This chapter will develop the model. The next several chapters will
discuss more elaborate specifications and complications that arise in the application of
techniques that are based on the simple models presented hcre.
2.2 THE LINEAR REGRESSION MODEL
“The multiple lincar regression model is used to study the relationship betwecn a depen-
dent variable and one or more independent variables, The generic form of the linear
regression model is
y=SOnx sure eu
=nBitafo+ +xeBr+e
where y is the dependent or explained variable and x,.... xx are the independent
or explanatory variables. One's theory will speeily f(x, x2....,x4). This [unction is
commonly called the population regression equation of y on x1,....xx. In this set-
ting, y is the regressand and x,,k=1,.... K, are thc regressors or covariates. The
underlying theory will specify the dependent and independent variables in the model,
Tt is not always obvious which is appropriately defined as each of these—for exam-
ple, a demand equation, quantity = Bi + price x B> + income x B3 + &, and an inverse
demand equation, price = y + quentity xy» + income x y; + u are equally valid rep-
resentations of a market. For modeling purposes, it will often prove useful to think in
terms of “autonomous variation” One can conceive of movement of the independent
7
8 CHAPTER2 4 The Classical Multiple Linear Regression Model
variables outside the relationships defined by the model while movement of the depen-
dent variable is considered in response to some independent or exogenous stimulus.!
The term « is a random disturbance, so named because it “disturbs” an otherwise
stable relationship. Thc disturbance arises [or several reasons, primarily because we
cannot hope to capture every influence on an economic variable in a model, no matter
how claborate. The net effect, which can be positive or negative, of these omitted factors
is capturcd in the disturbance. There are many other contributors to the disturbance
in an cmpirical model. Probably the most significant is errors of measurement. Tt is
casy to theorize about the relationships among precisely defined variables, it is quite
another to obtain accurate measures of thesc variables. For example, the difficulty of
obtaining reasonable measures of profits, interest rates, capital stocks, or, worse yet,
flows of services from capital stocks is a recurrent theme in the empirical literature.
At the extreme, there may be no observable counterpart to the theoretical variable.
The literature on the permanent income model of consumption [e.g., Friedman (1957)]
provides an interesting example.
We assume that each observation in a sample (3, Xi, Xj2,..., XikK). É
generated by an underlying process described by
se = 4081 + xao + + KB + er
The observed value of y; is the sum of two parts, a deterministie part and the random
part, e. Our objective is to estimate the unknown parameters of the model, use the
data to study the validity of the theoretical propositions, and perhaps use the model to
predict the variable y. How wc procecd from here depends crncially on what we assume
about the stochastic process that has led to our observations of the data in hand.
Example 2.1 Keynes's Consumption Function
Example 1.1 discussed a model of consumption proposed by Keynes and his General Theory
(1936). The theory that consumption, C, and income, X, are related certainly seems consistent
with the observed “facts” in Figutes 1.1 and 2.1. (These data are in Data Table F2.1.) Of
course, the linear function is only approximate. Even ignoring the anomalous wartime years,
consumption and income cannot be connected by any simple deterministic relationship.
The linear model, C = a + 8X, is intended only to represent the salient features of this part
of the economy. It is hopeless to attempt to capture every influence in the relationship. The
next step is to incorporate the inherent randomness in its real world counterpart. Thus. we
write C = F(X, e), where « is a stochastic element. It is important not to view = as a catchall
for the inadequacies of the model, The model including « appears adequate for the data
not including the war years, but for 1942-1945, something systematic clearly seems to be
missing. Consumption in these years could not rise to rates historically consistent with these
levels of income because of wartime rationing. A model meant to describe consumption in
this period would have to accommodate this influence.
It remains to establish how the stochastic element will be incorporated in the equation.
The most frequent approach is to assume that it is additive. Thus, we recast the equation
in stochastic terms: C=a + 8X +e. This equation is an empirical counterpart to Keynes's
theoretical model. But, what of those anomalous years of rationing? If we were to ignore
our intuition and attempt to “fit” a line to all these data—the next chapter will discuss
at length how we should do that— we might arrive at the dotted line in the figure as our best
guess. This line, however, is obviously being distorted by the rationing. À more appropriate
!By this definition, it would seem that in our demand relationship. only income would be an independent
variable while both price and quantity would be dependent. That makes sense—in a market, price and quantity
are determined at lhe same time. and do change only when something outside the market changes. We will
retum to this specific case in Chapter 15.
CHAPTER 2 + The Classical Multiple Linear Regression Model 11
The model in (2-1) as it applies to all n observations can now be written
y=nfi+o+agBe te, (2-2)
or in the form of Assumption 1,
ASSUMPTION: y=XB+e. (23)
A NOTATIONAL CONVENTION.:
Henceforth, tô avoid a possibly contusing: and cumbersome notation; we will usê
boldface.x to:denote a column or a row of X. Which applies will be clear from th
context; (2-2), xy.is the Ath column:of X. Subpcripts and k will be used to:den:
columins (variables) É wi
which we would vi
Subsoripts / and til generally ba:ised to denote: rows: fobservati
Marhri vector that (she transpose of the:thif.x Ki
Our primary interest is in estimation and inference about the parameter vector 8.
Note that the simple regression model in Example 2.1 is a special case in which X has
only two columns, the first of which is a column of Is. The assumption of linearily of the
regression model includes the additive disturbance. For the regression to be linear in
the sense described here, it must be of the form in (2-1) either in the original variables
or after some suitable transformation, For example, the model
y— Axtef
is linear (alter taking logs on both sides of the equation), whercas
y= Axt+e
is not, The observed dependent variable is thus the sum of two components, a deter-
ministic element a + 8x and a random variable e. Il is worth emphasizing tha neither
of the two paris is directly observed because « and are unknown.
The linearity assumption is not so narrow as it might first appear. In the regression
context, finearity refers to the manner in which the parameters and the disturbance enter
the equation, not necessarily to the relationship among the variables. For example, lhe
equationsy=a+bx+te, y=u+fco(x)+e, y=u+B/x+e andy=a+flnx+e
are all linear in some function of x by the definition we have used here. In the examples,
only x has been transformed, but y could have been as well, asin y = Axe whichisa
lincar relationship in the logs of x and y;ln y = « + Bln x + &. The variety of functions
is unlimited. This aspect of the model is used in a number of commonly used functional
forms. For example, thc loglinear model is
ny=8+Blng+Bnx+--+Belnxe+e.
This equation is also known as the constant elastieity form as in this equation, the
elasticity of y with respect to changes in x is 9)n y/9lnxk = 8k, which does not vary
12 CHAPTER2 + The Classical Multiple Linear Regression Model
with xx. The loglinear form is often uscd in models of demand and production. Different
values of £ produce widely varying functions.
Example 2.3 The U.S. Gasoline Market
Data on the U.S. gasoline market for the years 1960— 1995 are given in Table F2.2 in
Appendix F. We will use these data to obtain, among other things, estimates of the income,
own price, and cross-price elasticities of demand in this market. These data also present an
interesting question on the issue of holding “all other things constant,” that was suggested
in Example 2.2. In particular, consider a sormewhat abbreviated model of per capita gasoline
consumption:
In(G/pop) = 8 + fa In income + £s In priceg + Ba In Premcars + Bs IN Poseccars + 8.
This model will provide estimates of the income and price elasticities of demand for gasoline
and an estimate of the elasticity of demand with respect to the prices of new and used cars,
What should we expect for the sign of 84? Cars and gasoline are complementary goods, so if
the prices of new cars rise, ceteris paribus, gasoline consumption should fall. Or should it? If
the prices of new cars rise, then consumers will buy fewer of them; they will keep their used
cars longer and buy fewer new cars. If older cars use more gasoline than newer ones, then
the rise in the prices of new cars would lead to higher gasoline consumption than otherwise,
not lower. We can use the multiple regression model and the gasoline data to attempt to
answer the question,
A semilog modcl is often used to model growth rates:
Iny=xf+ót+s.
In this model, the autonomous (at Icast not explained by the model itself) proportional,
per period growth rate is dIn y/dt = ô. Other variations of the gencral form
FO) = g(x8+e)
will allow a tremendous variety of functional forms, all of which fit into our definition
of a linear model.
The lincar regression model is sometimes interpreted as an approximation to some
unknown, underlying function. (Sec Section A.8.1 [or discussion.) By this interpretation,
however, the lincar model, even with quadratic terms, is fairly limited in that such
an approximation is likely to be useful only over a small tange of variation of the
independent variables. The translog model discussed in Example 2.4, in contrast, has
proved far more effective as an approximaling function.
Example 2.4 The Transtog Model
Modern studies of demand and production are usually done in the context of a flexible func-
tional form. Flexible functional forms are used in econometrics because they allow analysts
to model second-order effects such as elasticities of substitution, which are functions of the
second derivatives of production, cost, or utility functions. The linear model restricts these to
equal zero, whereas the loglinear model (e.g., the Cobb-Douglas model) restricts the inter-
esting elasticities to the uninteresting values of -1 or +1. The most popular flexible functional
form is the translog model, which is often interpreted as a second-order approximation to
an unknown functional form. [See Berndt and Christensen (1973).] One way to derive it is
as follows. We firstwrite y = g(x1,..., xx). Then, Iny = Ing(..) = f(...). Since by a trivial
iransformation x, = exp(ln x), we interpret the function as a function of the logarithms of
, the ȼs. Thus, Iny = F(Inxy,...,inxe).
CHAPTER 2 + The Classical Multiple Linear Regression Model 13
Now, expand this function in a second-order Taylor series around the pointx = [1,1,...,1/
so that at the expansion point, the log of each variable is a convenient zero. Then
K
Iny= f(0) +53 [940)/9 n%inaco In xk
o
1
1
+a Lp J/anxeo nx] muco IN xe Inox; + 8.
AA
The disturbance in this model is assumed to embody the familiar factors and the error of
approximation to the unknown function. Since the function and its derivatives evaluated at
the fixed value OQ are constants, we interpret them as the coefficients and write
Iny= ps Dna 353 aula +
“= 141
This model is linear by our definition but can, in fact, mimic an impressive amount of curvature
when it is used to approximate another function. An interesting feature of this formulation
is that the loglinear model is a special case, yy = O. Also, there is an interesting test of the
underlying theory possible because if the underlying function were assumed to be continuous
and twice continuously differentiable, then by Young's theorem it must be true that yy = yu.
We will see in Chapter 14 how this feature is studied in practice.
Despite its great flexibility, the lincar model does not include all the situations we
encounter in practice. For a simple example, there is no transformation that will reduce
y=0+ 1/6 + Box) + e to linearity. The methods wc consider in this chapter arc not
appropriate for estimating the parameters of such a model. Relativcly straightforward
techniques have becn developed for nonlinear models such as Lhis, however. We shall
treat them in detail in Chapter 9.
2.3.2 FULL RANK
Assumption 2 is that there are no exact lincar relationships among the variables.
AssumPTION! Xisana x K matrix with rank . (2-5)
Hence, X has full column rank: the columns o! X are linearly independent and there
are at least K obscrvations. [See (A-42) and the surrounding text.) This assumption is
known as an identification condition. To see the nccd for Lhis assumprion, consider an
example.
Example 2.5 Short Rank
Suppose that a cross-section model specifies
C= + bo nontabor income + &; salary + Bu total income + e,
where total income is exactiy equal to salary plus noniabor income. Clearly, there is an exact
linear dependency in the model. Now let
bi=bo+a,
B=b+a,
and
Bi=bs—a,
16 CHAPTER2 + The Classical Multiple Linear Regression Model
The two assumptions imply that
Efe |X) Eles ]X) o Eles |X]
Eles |X] Eles X) o Eles |X
gles'px| = |U XI Eleisa 1X] oo Elen |)
Elenei |X] Efenez|X] Eleven |X]
o? 0 0
0 o? 0
00 o?
which we summarize in Assumption 4:
AssumPriON: Elee|X]=o (2-9)
By using the variance decomposition formula in (B-70), we find
Var[e] = E[Var[e |X]] + Var[ Efe |X]]
Once again, we should emphasize that this assumption describes the information about
the variances and covariances among the disturbances that is provided by the indepen-
dent variables. For the present, we assume that there is nonc. We will also drop this
assumption later when we enrich the regression model. We are also assuming that the
disturbances themsclves provide no information about the variances and covariances.
Although a minor issue at this point, it will become crucial in our treatment of time-
series applications. Models such as Var[s, | e, 1] = 0? +ae2 ,—a “GARCH” model (sec
Section 11.8) —do not violate our conditional variance assumption, but do assume that
Verte, 16.4) Vartes].
Disturbances that meet the twin assumptions o! homoscedasticity and nonautocor-
relation arc sometimes called spherical disturbances.?
2.3.5 DATA GENERATING PROCESS FOR THE REGRESSORS
Tt is common to assume that x; is nonstochastic, as it would be in an experimental
situation. Herc the analyst chooses the values of the regressors and then obscrves y;.
This process might apply. for example, in an agricultural experiment in which y; is yield
and x; is ferlilizer concentration and water applied. The assumption of nonstochastic
regressors at this point would be a mathematical convenience, With it, we could use the
results of clementary statistics to obtain our results by treating the vector x; simply as a
known constant in the probability distribution of y;. With this simplification, Assump-
tions A3 and Ad would be made unconditional and the counterparts would now simply
state that the probability distribution of e; involves none of the constants in X,
Social scicntists are almost never able to analyze experimental data, and relatively
few of their models are built around nonrandom regressors. Clearly, for example, in
The term will describe the mullivariate normal distribution; see (B-95). TF E = o?Nin the multivariate normal
density then (he equation f(3) = c is the formula for a “ball” contered at p with radius o in n-dimensional
space, The name sphericalis used whelher or not the normal distribution is assumed; sometimes the “sphericat
normal” distribution is assumed explicitly.
CHAPTER 2 + The Classical Multiple Linear Regression Model 17
any model of the macroeconomy, it would be difficult to defend such am asymmetric
treatment of aggregate data. Realistically, we have to allow the data on x; to be random
the same as x; so an alternative [ormulation is to assume that x; is a random vector and
our formal assumption concerns the nature of the random process that produces x;. If x;
is taken to be a random vector, then Assumptions 1 through 4 become a statement abont
Lhe joint distribution of y; and x;. The precise nature of the regressor and how we view
the sampling process will be a major determinant of our derivation of the statistical
properties of our estimators and test statistics. m the end, the crucial assumption is
Assumption 3, the uncorrelatedness of X and £, Now, we do note that this alternative
is not completely satisfactory either, since X may well contain nonstochastic clements,
including a constant, a time trend, and dummy variables that mark specific episodes
in time. This makes for an ambiguous conclusion, but there is a straightforward and
cconomically uscíul way out of it. We will assume that X can be a mixture of constants
and random variables, but the important assumption is that the ultimate source of the
data in X is unrclated (statistically and economically) to the source of e.
Assumprion: X may be fixed or random, but it is generated by a
mechanism that is unrclated to e. (210)
2.3.6 NORMALITY
Ttis convenient to assume that the disturbances arc normally distributed, with zero mean
and constant variance. That is, we add normality of the distribution to Assumptions 3
and 4.
ÁsSUMPTION: eJX- N[0, 071]. (211)
In view of our description of the source of e, the conditions of the central limit the-
orem will generally apply, at least approximately, and the normality assumption will be
reasonable in most settings. A useful implication of Assumption 6 is Lhat it implies that
observations on s; are statistically independent as well as uncorrelated. [See the third
pointin Section B.8, (B-97) and (B-99).] Normality is often viewed as an unnecessary and
possibly inappropriate addition to the regression model, Except in those cases in which
some alternative distribution is explicitly assumed, as in the stochastic frontier model
discussed in Section 17.6.3, the normality assumption is probably quite reasonable.
Normality is not necessary to obtain many of the results wc use in multiple regression
analysis, although it will cnable us to obtain several exact statistical results. It does prove
useful in constructing test statistics, as shown in Seclion 4.7. Later, it will be possible
to relax this assumption and retain most of the statistical results we obtain here. (See
Sections 5.3, 5.4 and 6.4.)
2.4 SUMMARY AND CONCLUSIONS
This chapter has framed the linear regression model, thc basic platform for model build-
ing in econometrics. The assumptions of the classical regression model are summarized
in Figure 2.2, which shows the two-variable case,
18
CHAPTER 2 4 The Classical Multiple Linear Regression Model
EGO)
Elpix = 22)
Egue= a) hn
EG =p,
a+pr
Na + Bro?)
Key Terms and Concepts
* Autocorrelation
« Constant elasticity
* Covariate
* Dependent variable
* Deterministic relationship
* Disturbance
+ Exogencily
« Explaíncd variable
« Explanatory variable
* Flexible functional form
* Fulirank
+ Heteroscedasticity
* Homoscedasticity
* Identification condition
+ Independent variable
+ Lincar regression model
« Loglinear model
+ Multiple linear regression
model
* Nonautocorrelation
« Nonstochastie regressors
* Normality
+ Normally distributed
+ Population regression
equation
» Regressand
+ Regression
« Regressor
* Second-order effects
+ Semilog
* Spherical disturbances
+ Translog modcl