Baixe econometric analysis - GREENE 6TH EDITION - ch 02 e outras Manuais, Projetos, Pesquisas em PDF para Matemática Computacional, somente na Docsity! 241 2 THE CLASSICAL MULTIPLE LINEAR REGRESSION MODEL —eari/à/ip=>— INTRODUCTION An econometric study begins with a set of propositions about some aspect of the economy. The theory specifies a set of precise, deterministic relationships among vari- ables. Familiar examples are demand equations, production functions, and macroeco- nomic models. The empirical investigation provides estimates ol unknown parameters in the model, such as elasticilies or the effects of monetary policy, and usually attempts to measure the validity of the theory against the behavior of obscrvable data. Onec suitably constructed. the model might then be used for prediction or analysis of behavior, This book will develop a large number of models and techniques used in Lhis framework. The linear regression model is the single most uscful tool in the econometrician's kit. Though to an increasing degree in the contemporary literature, it is often only the departure point for the full analysis, it remains the device used to begin almost all empirical research. This chapter will develop the model. The next several chapters will discuss more elaborate specifications and complications that arise in the application of techniques that are based on the simple models presented hcre. 2.2 THE LINEAR REGRESSION MODEL “The multiple lincar regression model is used to study the relationship betwecn a depen- dent variable and one or more independent variables, The generic form of the linear regression model is y=SOnx sure eu =nBitafo+ +xeBr+e where y is the dependent or explained variable and x,.... xx are the independent or explanatory variables. One's theory will speeily f(x, x2....,x4). This [unction is commonly called the population regression equation of y on x1,....xx. In this set- ting, y is the regressand and x,,k=1,.... K, are thc regressors or covariates. The underlying theory will specify the dependent and independent variables in the model, Tt is not always obvious which is appropriately defined as each of these—for exam- ple, a demand equation, quantity = Bi + price x B> + income x B3 + &, and an inverse demand equation, price = y + quentity xy» + income x y; + u are equally valid rep- resentations of a market. For modeling purposes, it will often prove useful to think in terms of “autonomous variation” One can conceive of movement of the independent 7 8 CHAPTER2 4 The Classical Multiple Linear Regression Model variables outside the relationships defined by the model while movement of the depen- dent variable is considered in response to some independent or exogenous stimulus.! The term « is a random disturbance, so named because it “disturbs” an otherwise stable relationship. Thc disturbance arises [or several reasons, primarily because we cannot hope to capture every influence on an economic variable in a model, no matter how claborate. The net effect, which can be positive or negative, of these omitted factors is capturcd in the disturbance. There are many other contributors to the disturbance in an cmpirical model. Probably the most significant is errors of measurement. Tt is casy to theorize about the relationships among precisely defined variables, it is quite another to obtain accurate measures of thesc variables. For example, the difficulty of obtaining reasonable measures of profits, interest rates, capital stocks, or, worse yet, flows of services from capital stocks is a recurrent theme in the empirical literature. At the extreme, there may be no observable counterpart to the theoretical variable. The literature on the permanent income model of consumption [e.g., Friedman (1957)] provides an interesting example. We assume that each observation in a sample (3, Xi, Xj2,..., XikK). É generated by an underlying process described by se = 4081 + xao + + KB + er The observed value of y; is the sum of two parts, a deterministie part and the random part, e. Our objective is to estimate the unknown parameters of the model, use the data to study the validity of the theoretical propositions, and perhaps use the model to predict the variable y. How wc procecd from here depends crncially on what we assume about the stochastic process that has led to our observations of the data in hand. Example 2.1 Keynes's Consumption Function Example 1.1 discussed a model of consumption proposed by Keynes and his General Theory (1936). The theory that consumption, C, and income, X, are related certainly seems consistent with the observed “facts” in Figutes 1.1 and 2.1. (These data are in Data Table F2.1.) Of course, the linear function is only approximate. Even ignoring the anomalous wartime years, consumption and income cannot be connected by any simple deterministic relationship. The linear model, C = a + 8X, is intended only to represent the salient features of this part of the economy. It is hopeless to attempt to capture every influence in the relationship. The next step is to incorporate the inherent randomness in its real world counterpart. Thus. we write C = F(X, e), where « is a stochastic element. It is important not to view = as a catchall for the inadequacies of the model, The model including « appears adequate for the data not including the war years, but for 1942-1945, something systematic clearly seems to be missing. Consumption in these years could not rise to rates historically consistent with these levels of income because of wartime rationing. A model meant to describe consumption in this period would have to accommodate this influence. It remains to establish how the stochastic element will be incorporated in the equation. The most frequent approach is to assume that it is additive. Thus, we recast the equation in stochastic terms: C=a + 8X +e. This equation is an empirical counterpart to Keynes's theoretical model. But, what of those anomalous years of rationing? If we were to ignore our intuition and attempt to “fit” a line to all these data—the next chapter will discuss at length how we should do that— we might arrive at the dotted line in the figure as our best guess. This line, however, is obviously being distorted by the rationing. À more appropriate !By this definition, it would seem that in our demand relationship. only income would be an independent variable while both price and quantity would be dependent. That makes sense—in a market, price and quantity are determined at lhe same time. and do change only when something outside the market changes. We will retum to this specific case in Chapter 15. CHAPTER 2 + The Classical Multiple Linear Regression Model 11 The model in (2-1) as it applies to all n observations can now be written y=nfi+o+agBe te, (2-2) or in the form of Assumption 1, ASSUMPTION: y=XB+e. (23) A NOTATIONAL CONVENTION.: Henceforth, tô avoid a possibly contusing: and cumbersome notation; we will usê boldface.x to:denote a column or a row of X. Which applies will be clear from th context; (2-2), xy.is the Ath column:of X. Subpcripts and k will be used to:den: columins (variables) É wi which we would vi Subsoripts / and til generally ba:ised to denote: rows: fobservati Marhri vector that (she transpose of the:thif.x Ki Our primary interest is in estimation and inference about the parameter vector 8. Note that the simple regression model in Example 2.1 is a special case in which X has only two columns, the first of which is a column of Is. The assumption of linearily of the regression model includes the additive disturbance. For the regression to be linear in the sense described here, it must be of the form in (2-1) either in the original variables or after some suitable transformation, For example, the model y— Axtef is linear (alter taking logs on both sides of the equation), whercas y= Axt+e is not, The observed dependent variable is thus the sum of two components, a deter- ministic element a + 8x and a random variable e. Il is worth emphasizing tha neither of the two paris is directly observed because « and are unknown. The linearity assumption is not so narrow as it might first appear. In the regression context, finearity refers to the manner in which the parameters and the disturbance enter the equation, not necessarily to the relationship among the variables. For example, lhe equationsy=a+bx+te, y=u+fco(x)+e, y=u+B/x+e andy=a+flnx+e are all linear in some function of x by the definition we have used here. In the examples, only x has been transformed, but y could have been as well, asin y = Axe whichisa lincar relationship in the logs of x and y;ln y = « + Bln x + &. The variety of functions is unlimited. This aspect of the model is used in a number of commonly used functional forms. For example, thc loglinear model is ny=8+Blng+Bnx+--+Belnxe+e. This equation is also known as the constant elastieity form as in this equation, the elasticity of y with respect to changes in x is 9)n y/9lnxk = 8k, which does not vary 12 CHAPTER2 + The Classical Multiple Linear Regression Model with xx. The loglinear form is often uscd in models of demand and production. Different values of £ produce widely varying functions. Example 2.3 The U.S. Gasoline Market Data on the U.S. gasoline market for the years 1960— 1995 are given in Table F2.2 in Appendix F. We will use these data to obtain, among other things, estimates of the income, own price, and cross-price elasticities of demand in this market. These data also present an interesting question on the issue of holding “all other things constant,” that was suggested in Example 2.2. In particular, consider a sormewhat abbreviated model of per capita gasoline consumption: In(G/pop) = 8 + fa In income + £s In priceg + Ba In Premcars + Bs IN Poseccars + 8. This model will provide estimates of the income and price elasticities of demand for gasoline and an estimate of the elasticity of demand with respect to the prices of new and used cars, What should we expect for the sign of 84? Cars and gasoline are complementary goods, so if the prices of new cars rise, ceteris paribus, gasoline consumption should fall. Or should it? If the prices of new cars rise, then consumers will buy fewer of them; they will keep their used cars longer and buy fewer new cars. If older cars use more gasoline than newer ones, then the rise in the prices of new cars would lead to higher gasoline consumption than otherwise, not lower. We can use the multiple regression model and the gasoline data to attempt to answer the question, A semilog modcl is often used to model growth rates: Iny=xf+ót+s. In this model, the autonomous (at Icast not explained by the model itself) proportional, per period growth rate is dIn y/dt = ô. Other variations of the gencral form FO) = g(x8+e) will allow a tremendous variety of functional forms, all of which fit into our definition of a linear model. The lincar regression model is sometimes interpreted as an approximation to some unknown, underlying function. (Sec Section A.8.1 [or discussion.) By this interpretation, however, the lincar model, even with quadratic terms, is fairly limited in that such an approximation is likely to be useful only over a small tange of variation of the independent variables. The translog model discussed in Example 2.4, in contrast, has proved far more effective as an approximaling function. Example 2.4 The Transtog Model Modern studies of demand and production are usually done in the context of a flexible func- tional form. Flexible functional forms are used in econometrics because they allow analysts to model second-order effects such as elasticities of substitution, which are functions of the second derivatives of production, cost, or utility functions. The linear model restricts these to equal zero, whereas the loglinear model (e.g., the Cobb-Douglas model) restricts the inter- esting elasticities to the uninteresting values of -1 or +1. The most popular flexible functional form is the translog model, which is often interpreted as a second-order approximation to an unknown functional form. [See Berndt and Christensen (1973).] One way to derive it is as follows. We firstwrite y = g(x1,..., xx). Then, Iny = Ing(..) = f(...). Since by a trivial iransformation x, = exp(ln x), we interpret the function as a function of the logarithms of , the »ºs. Thus, Iny = F(Inxy,...,inxe). CHAPTER 2 + The Classical Multiple Linear Regression Model 13 Now, expand this function in a second-order Taylor series around the pointx = [1,1,...,1/ so that at the expansion point, the log of each variable is a convenient zero. Then K Iny= f(0) +53 [940)/9 n%inaco In xk o 1 1 +a Lp J/anxeo nx] muco IN xe Inox; + 8. AA The disturbance in this model is assumed to embody the familiar factors and the error of approximation to the unknown function. Since the function and its derivatives evaluated at the fixed value OQ are constants, we interpret them as the coefficients and write Iny= ps Dna 353 aula + “= 141 This model is linear by our definition but can, in fact, mimic an impressive amount of curvature when it is used to approximate another function. An interesting feature of this formulation is that the loglinear model is a special case, yy = O. Also, there is an interesting test of the underlying theory possible because if the underlying function were assumed to be continuous and twice continuously differentiable, then by Young's theorem it must be true that yy = yu. We will see in Chapter 14 how this feature is studied in practice. Despite its great flexibility, the lincar model does not include all the situations we encounter in practice. For a simple example, there is no transformation that will reduce y=0+ 1/6 + Box) + e to linearity. The methods wc consider in this chapter arc not appropriate for estimating the parameters of such a model. Relativcly straightforward techniques have becn developed for nonlinear models such as Lhis, however. We shall treat them in detail in Chapter 9. 2.3.2 FULL RANK Assumption 2 is that there are no exact lincar relationships among the variables. AssumPTION! Xisana x K matrix with rank . (2-5) Hence, X has full column rank: the columns o! X are linearly independent and there are at least K obscrvations. [See (A-42) and the surrounding text.) This assumption is known as an identification condition. To see the nccd for Lhis assumprion, consider an example. Example 2.5 Short Rank Suppose that a cross-section model specifies C= + bo nontabor income + &; salary + Bu total income + e, where total income is exactiy equal to salary plus noniabor income. Clearly, there is an exact linear dependency in the model. Now let bi=bo+a, B=b+a, and Bi=bs—a, 16 CHAPTER2 + The Classical Multiple Linear Regression Model The two assumptions imply that Efe |X) Eles ]X) o Eles |X] Eles |X] Eles X) o Eles |X gles'px| = |U XI Eleisa 1X] oo Elen |) Elenei |X] Efenez|X] Eleven |X] o? 0 0 0 o? 0 00 o? which we summarize in Assumption 4: AssumPriON: Elee|X]=o (2-9) By using the variance decomposition formula in (B-70), we find Var[e] = E[Var[e |X]] + Var[ Efe |X]] Once again, we should emphasize that this assumption describes the information about the variances and covariances among the disturbances that is provided by the indepen- dent variables. For the present, we assume that there is nonc. We will also drop this assumption later when we enrich the regression model. We are also assuming that the disturbances themsclves provide no information about the variances and covariances. Although a minor issue at this point, it will become crucial in our treatment of time- series applications. Models such as Var[s, | e, 1] = 0? +ae2 ,—a “GARCH” model (sec Section 11.8) —do not violate our conditional variance assumption, but do assume that Verte, 16.4) Vartes]. Disturbances that meet the twin assumptions o! homoscedasticity and nonautocor- relation arc sometimes called spherical disturbances.? 2.3.5 DATA GENERATING PROCESS FOR THE REGRESSORS Tt is common to assume that x; is nonstochastic, as it would be in an experimental situation. Herc the analyst chooses the values of the regressors and then obscrves y;. This process might apply. for example, in an agricultural experiment in which y; is yield and x; is ferlilizer concentration and water applied. The assumption of nonstochastic regressors at this point would be a mathematical convenience, With it, we could use the results of clementary statistics to obtain our results by treating the vector x; simply as a known constant in the probability distribution of y;. With this simplification, Assump- tions A3 and Ad would be made unconditional and the counterparts would now simply state that the probability distribution of e; involves none of the constants in X, Social scicntists are almost never able to analyze experimental data, and relatively few of their models are built around nonrandom regressors. Clearly, for example, in The term will describe the mullivariate normal distribution; see (B-95). TF E = o?Nin the multivariate normal density then (he equation f(3) = c is the formula for a “ball” contered at p with radius o in n-dimensional space, The name sphericalis used whelher or not the normal distribution is assumed; sometimes the “sphericat normal” distribution is assumed explicitly. CHAPTER 2 + The Classical Multiple Linear Regression Model 17 any model of the macroeconomy, it would be difficult to defend such am asymmetric treatment of aggregate data. Realistically, we have to allow the data on x; to be random the same as x; so an alternative [ormulation is to assume that x; is a random vector and our formal assumption concerns the nature of the random process that produces x;. If x; is taken to be a random vector, then Assumptions 1 through 4 become a statement abont Lhe joint distribution of y; and x;. The precise nature of the regressor and how we view the sampling process will be a major determinant of our derivation of the statistical properties of our estimators and test statistics. m the end, the crucial assumption is Assumption 3, the uncorrelatedness of X and £, Now, we do note that this alternative is not completely satisfactory either, since X may well contain nonstochastic clements, including a constant, a time trend, and dummy variables that mark specific episodes in time. This makes for an ambiguous conclusion, but there is a straightforward and cconomically uscíul way out of it. We will assume that X can be a mixture of constants and random variables, but the important assumption is that the ultimate source of the data in X is unrclated (statistically and economically) to the source of e. Assumprion: X may be fixed or random, but it is generated by a mechanism that is unrclated to e. (210) 2.3.6 NORMALITY Ttis convenient to assume that the disturbances arc normally distributed, with zero mean and constant variance. That is, we add normality of the distribution to Assumptions 3 and 4. ÁsSUMPTION: eJX- N[0, 071]. (211) In view of our description of the source of e, the conditions of the central limit the- orem will generally apply, at least approximately, and the normality assumption will be reasonable in most settings. A useful implication of Assumption 6 is Lhat it implies that observations on s; are statistically independent as well as uncorrelated. [See the third pointin Section B.8, (B-97) and (B-99).] Normality is often viewed as an unnecessary and possibly inappropriate addition to the regression model, Except in those cases in which some alternative distribution is explicitly assumed, as in the stochastic frontier model discussed in Section 17.6.3, the normality assumption is probably quite reasonable. Normality is not necessary to obtain many of the results wc use in multiple regression analysis, although it will cnable us to obtain several exact statistical results. It does prove useful in constructing test statistics, as shown in Seclion 4.7. Later, it will be possible to relax this assumption and retain most of the statistical results we obtain here. (See Sections 5.3, 5.4 and 6.4.) 2.4 SUMMARY AND CONCLUSIONS This chapter has framed the linear regression model, thc basic platform for model build- ing in econometrics. The assumptions of the classical regression model are summarized in Figure 2.2, which shows the two-variable case, 18 CHAPTER 2 4 The Classical Multiple Linear Regression Model EGO) Elpix = 22) Egue= a) hn EG =p, a+pr Na + Bro?) Key Terms and Concepts * Autocorrelation « Constant elasticity * Covariate * Dependent variable * Deterministic relationship * Disturbance + Exogencily « Explaíncd variable « Explanatory variable * Flexible functional form * Fulirank + Heteroscedasticity * Homoscedasticity * Identification condition + Independent variable + Lincar regression model « Loglinear model + Multiple linear regression model * Nonautocorrelation « Nonstochastie regressors * Normality + Normally distributed + Population regression equation » Regressand + Regression « Regressor * Second-order effects + Semilog * Spherical disturbances + Translog modcl