Docsity
Docsity

Prepare-se para as provas
Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity


Ganhe pontos para baixar
Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium


Guias e Dicas
Guias e Dicas

econometric analysis - GREENE 6TH EDITION - ch 17, Manuais, Projetos, Pesquisas de Matemática Computacional

Econometria

Tipologia: Manuais, Projetos, Pesquisas

2011

Compartilhado em 03/04/2011

pedro-americo-penalber-2
pedro-americo-penalber-2 🇧🇷

29 documentos

1 / 57

Documentos relacionados


Pré-visualização parcial do texto

Baixe econometric analysis - GREENE 6TH EDITION - ch 17 e outras Manuais, Projetos, Pesquisas em PDF para Matemática Computacional, somente na Docsity! 17 MAXIMUM LIKELIHOOD ESTIMATION = p/à/ ip 17.1 INTRODUCTION 17. 46s “The generalized method of moments discussed in Chapter 18 and the semiparametric, nonparametric, and Bayesian estimators discussed in Chapter 16 are becoming widely used by model builders. Nonctheless, the maximum likelihood estimator discussed in this chapter remains the preferred estimator in many more settings than the others listed. As such, we focus our discussion ol gencraily applied estimation methods on this technique. Sections 17.2 through 17.5 present statistical results for estimation and hypothesis testing bascd on the maximum likelihood principle. After establishing some general results for this method of estimation, we will then extend them to the more familiar setting of econometric models. Some applicaúons arc presented in Section 17.6. Finally, three variations on the technique, maximum simulated likelihood, two-step estimation and pseudomaximum likelihood estimation are described in Sections 17.7 through 17.9. 2 THE LIKELIHOOD FUNCTION AND IDENTIFICATION OF THE PARAMETERS The probability density function, or pdf for a random variable y, conditioned on a set of parameters. 8, is denoted f(y| 8).! This function identifies the data gencrating process that underlies an observed sample of data and, at the same time, provides a mathematical description ol lhe data that the process will produce. The joint density of n independent and identically distributed (id) observations from this process is the product ol the individual densitics; Fou o mlD=][ fo 18) = LOIp. ar i=1 This joint density is the likelihood function, defined as a function of the unknown parameter vector, 6, wherc y is used to indicate the collection of sample data. Note that we write the joint density as a function of the data conditioned on the parameters whereas when we form the likelihood function, we write this function in reverse, as a function of the parameters, conditioned on the data. Though the two functions are the same, il is to be emphasized that the likelihood function is written in this fashion to 1Later we will extend this to the casc of a random vector, y. with a multivariate density, but at this point, that would complicate the notation without adding anything of substance to the discussion. CHAPTER 17 4 Maximum Likelihood Estimation 469 highlight our interestin the parameters and the information about them that is contained in the observed data. However, itis understood that the likelihood function is not meant to Tepresent a probability density for the parameters as it is in Section 16.2.2. In this classical estimation framework. the parameters are assumed to be fixcd constants which wc hope to leam about from the data. Li is usually simpler to work with the log of the likelihood function: nz) =5 InfG: 0). (17-2) i=1 Again, to emphasize our interest in the parameters, given the observed data, we denote this [unction L(8 | data) = 148 | y). The likelihood function and its logarithm, evaluated atô, are sometimes denoted simply L(9) andln L(6), respectively or, where no ambiguity can arise. just Lor ln L. It will usually be necessary to generalize the concept of the likelibood function to allow the density to depend on other conditioning variables. To jump immediately to one of our centra! applications. suppose the disturbance in the classical Imear regres- sion model is normally distributed. Then, conditioned on it's specific x;, y; is normally distributed with mcan 4; =x; and variance “2. That means that the observed ran- dom variables are not iid: they have different mcans. Nonetheless, the observations are independent, and as we will examine in closer detail, A a In 1493, X) = Dn fQr 45,6) = 55) Ino? in2m) + (4 2/82/57), (47:3) i=1 2a where X is then x K matrix of data with ith row equal to x;. The restof this chapter will be concerned with obtaining estimates ol lhe parameters, 8 and in testing hypotheses about them and about thc data gencrating process. Before we begin that study, we consider the question of whether estimation of the parameters is possible at all—the question of identification. Identification is an issue related to the formulation of the model. The issue of identification must bc resolved before estimation can even be considered. The question posed is essentially this: Supposc we had an infinitely large sample —that is, for current purposes, all the information there is to be had about the parameters. Could we uniquely determine the values of 4 from such a sample? As will be clear shortly, the answer is sometimes no. DEFINITION 17.1 Identification E The parameter vector 8 is identified (estimable) if for any other parameter vector, É 8* £8, for some data y, L(6* |y) * L(8 |y). This result will be crucial at several points in what follows. We consider two examples, the first of which will be very familiar to you by now. Example 17.1 Identification of Parameters For the regression model specified in (17-3), suppose that there is a nonzero vector a such that xa = 0 for every x;. Then there is another “parameter” vector, y = 8 + a * 8 such that 472 CHAPTER 17 + Maximum Likelihood Estimation and Ami(o|y) 20 1 = < O > this is a maximum. a The solution is the same as before. Figure 17.1 also plots the log of L(6 | y) to illustrate the result. The reference to the probability of observing the given sample is not exact in a continuous distribution, since a particular sample has probability zero. Nonetheless, the principle is the same. The values of the parameters that maximize L(6 | data) or its log are the maximum likelihood estimates, denoted ô. Since the logarithm is a monotonic function, the values that maximize L(g | data) are the same as those that maximize In +48 | data). The necessary condition for maximizing In L(6 | data) is din L(6 | data) — 0 ao - (17-4) This is callcd the likelihood equation. Tbc general result then is that the MLE is a root of the likelihood equation. The application to the parameters of Lhe dgp for a discrete random variable are suggestive that maximum likelihood is a “good” use of the data. It remains to establish this as a general principle. We turn to that issue in the next section. Example 17.2 Log Likelihood Function and Likelihood Equations for the Normal Distribution In sampling from a normal distribution with mean & and variance &?, the log-likelihood func- tíon and the likelihood eguations for and a? are = "mem Lino = 150 [ocê ! int(u,0?) = —5In(2m) — 5 lno >| =| (17-5) = ant 1T E n=0, (17-6) “ i=1 alnt n 10 OE = Boi Bd DÊ =O. (17-7) + Tosolve the likelihood equations, multiply (17-6) by o2 and solve for À, then inseri this solution in (17-7) and solve for o?. The solutions are AO ; 2 O ; fm = Dina and Sua q MM ota (17-8) 17.4 PROPERTIES OF MAXIMUM LIKELIHOOD ESTIMATORS Maximum likclihood estimators (MLESs) are most attractive because of their large- sample or asymptotic properties. CHAPTER 17 + Maximum Likelihood Estimation 473 An estimator is asymptotically efficient is consistent, asymptotically normaily distributed (CAN), and has an asymptotic covariance matrix that is not larger than the asymptotic covariance matrix ofany other consistent, asumptotically normaily distributed estimator > É Ê É Tf certain regularity conditions are met, the MLE will have these properties. The finite sample properties are sometimes less than optimal. For example, the MLE may be bi- ased: the MLE of o? in Example 17.2 is biased downward. Thc oceasional statement that the properties ofthe MLE are on!y optimal in large samples is not true, however. Ttcan be shown that when sampling is from an exponentia! family of distributions (see Defini- tion 18.1), there will exist sufficient statistics. [f so, MLEs will be functions of them, which means that when minimum variance unbiascd estimators cxist, (hey will be MLESs. [Sce Stuart and Ord (1989).] Most applications in econometrics do not involve exponential families, so the appeal ol the MLE remains primarily its asymptotic properties. We use the following notation: à is the maximum likelihood estimator; 8, de- notes the true value of the parameter vector; 8 denotes another possible value of the parameter vector, not the MLE and not necessarily the true values. Expectation based on the true values of thc parameters is denoted Eg[.). If we assume that the regularity conditions discussed below are met by /(x, 89), then we have the following theorem. ros gprs peer oem nanaç eta THEOREM 17.1 Properties ofan MLE Under regularity, the maximum likelihood estimator (MLE) has the following asymplotic properties: essere MI. Consistency: plimô = 0p. M2. Asymptotic normality: ô * N[8o, (M09))”') where Icy) = — Eo[82 in 1./000081]. M3. Asymptotic efficiency: O is asymptotically efficient and achieves the Cramér-Rao lower bound for consistent estimators, given in M2 and Theorem C2. M4. Invariance: The maximum likelihood estimator ofy, = €(8o) is e(ô) if e(89) is à continuous and continuously differentiable function. spseeetancomans 17.41 REGULARITY CONDITIONS To sketch proofs of these results, we first obtain some useful properties of probability density functions. Wc assume Lhat (yy,..., Jr) is a random sample from the population 2Noi larger is detincd in the sense of (A-118): The covariance matrix of the less ellicient estimator equals that of the efficient estimator plus a nonnegative definite matrix. 474 CHAPTER 17 + Maximum Likelihood Estimation with density function f(y [89) and that the following regularity conditions hold. [Our statement of these is informal. A more rigorous treatment may be found in Stuart and Ord (1989) or Davidson and MacKinnon (1993).] Rn DEFINITION 17.3 Regularity Conditions i RI. The first three derivatives of In fly | 8) with respect to 8 are continuous and finite for almost all y; and for all 8, This condition ensures the existence of a certain Taylor series approximation and the finite variance ofrhe derivatives of In L. R2. The conditions necessary to obtain the expectations of the first and second derivatives of In f(3; | 8) are met. R3. Foraii values of6. |a*In f(y |0)/99,90,96)] is less than a function thar ã has a finite expectation. This condition will allow us to truncare the Taylor : series. Í É Ê í Ê Ê eee eoceveen eesncenipmpeessner e e Wilh these regularity conditions, we will obtain the lollowing fundamental char- acteristics of f(y; |8): D1 is simply a consequence of the definition of the likelihood function. D2 leads to lhe moment condition which defines the maximum likelihood estimator. On the one hand, the MLE is tound as the maximizer of a function, which mandates finding the vector which cquates the gradient to zero. On Lhe other, D2 is a more fundamental relationship which places the MLE in the class of generalized method of moments estimators. D3 produces what is known as the Information matrix equality. This relationship shows how to obtain the asymptotic covariance matrix of the MLE. 17.42 PROPERTIES OF REGULAR DENSITIES Densities that are “regular” by Definition 17,3 have three properties which are used in establishing the properties ol maximum likelihood estimators: THEORE Di, Inftx19),g =3n f(y:|0)/98. and H; = 02 fly |6)/0008". i=1,....n, are all random samples of random variables. This statement follows from our assumption of random sampling. The notation g; (80) and H;(85) indicates the derivative evaluated at Bo. D2. Elgifo]=0. D3. Valg(60)]= EH; Go]. 17.2 Moments of the Derivatives of the Log-Likclihood Condition DJ] is simply a conseguence of the definition of the density. : i Ê É i é Ê Ê ã Ê estenose cerimas os eneremmransemontaneam For the moment, we allow the range of y; to depend on thc parameters; A(99) < w < B(80). (Consider, for example, finding the maximum likelihood estimator of 6/break CHAPTER 17 4 Maximum Likelihood Estimation 477 wc will sketch an analysis provided by Stuart and Ord (1989) for a simple case, and indicate where it will be necessary to extend the derivation ifit were to be fully general. 17.4.5a CONSISTENCY We assume that /(y; | 80) is a possibly mullivariate density which at this point does not depend on covariates, x,. Thus, this is the iid, random sampling case. Since ê is the MLE, in any finite sample, [or any 8 é (including the true 86) it must be true that In £(ô) > In (6). (17-12) Consider, then, the random variable L(8)/L(89). Since the log function is strictly con- cave, ftom Jensen's Inequality (Theorem D.8.), we have 148) 148) Es los Too] < log Ey [io | (17-13) The expectation on the right hand side is exactly cqual to one, as Lo) Lo) = 17- Eli] / (aj) 1 bd mm is simply the integral of a joint density. Now. take logs on both sides of (17-13), insert the result of (17-14), then divide by n to produce Eo[l/nln £463] — Eo[l/nIn L(99)] < 0. (17-15) This produces a central result: THEOREM 17.3 Likelihood Inequality EO /) im L)) > Eoll/m) In L(g)] for any 8 £ 8 (including 8). i This resul is (17-15). a In words, she expected value of the log-likelihood is maximized at the true value of the parameters. For any 9, including ô, A 10/20) 24609] = (1/0) 5 Im fQr 16) i=1 is lhe sample mean of iid random variables, with expectation Eo[(1/n) In L(6)]. Since the sampling is iid by the regularity conditions, we can invoke the Khinchinc The- orem, D.5; the sample mean converges in probability to the population mean. Us- ing 9 = ô, it follows from Theorem 17.3 that as n — 00, lim Probf[(L/n)In 1(8)] < [6l/m) ln 409)])=1 if 860. But, Ô is the MLE, so Jor every n. (1/n)ln Ltd) > (1/n) In IXgo). The only way these can both be true is if (1/n) times the sample log- likelihood evaluated at the MLE converges to the population expectation of (1/n) times thc log-likelihood evaluated at the true parameters. There remains onc final step. 478 CHAPTER 17 + Maximum Likelihood Estimation Does (1/m)n L(d) > (1/8) In 189) imply thatô > 69? If thereisa single parameter and the likelihood function is onc to one, then clearly so, For more general cascs, this requires a further characterization of the likelihood function. If the likelihood is strictly continuous and twice differentiable, which we assamed in thc regularity conditions, and il'the parameters of the model arc identified which we assumed at the beginning of this discussion. then ves, it does, so we have the result. This is a heuristic proot. As noted, formal presentations appear in more advanced trealises than this one, We should also note, we have assumed at several points that sample means converged to the population expectations, This is likely to bc truc for the sorts of applications usually encountered in cconometrics, but a fully gencral set of results would look more closely at this condition. Second, we have assumcd iid sampling in the preceding—that is, the density for y; does not depend on any other variables, x;. This will almost never be true in practice. Assumptions about the behavior ofthese variables will enter the proots as well. For example, in assessing the large sample behavior of the least squares estimator, we have invoked an assumption that the data are “well behaved” The same sort of consideration will apply here as well, We will return to this issue shortly. With all this in place, we have property Mi, plim ô = 89. 17.4.5.b ASYMPTOTIC NORMALITY At the maximum likelihood estimator, the gradient of the log-likelihood equals zero (by definition), so gd)=0 (This is the sample statistic, not the expeciation.) Expand this set of equations in a second-order Taylor series around the true parameters 99. We will use the mean value theorem to truncate the Taylor scries at the second term, gt) = g(%o) + HO) — 00) = 0. The Hessian is evaluated at a point ô that is between 8 and 89 (9 = wê + (1 — w)8 for some O < w < 1). We then rearrange this function and multiply the result by vn to obtain vnitô — 89) = [He] IV agido]. Because plim(ô —99) =0. plim(ê—8) = Qas well. The second derivatives are continuous functions. Therefore, if the limiting distribution exists, then 2 4 - v7(ô — 89) > [-Hoo)]" [Vrg(00)]. By dividing H(8,) and g(89) by n, we obtain v'ntô — 60) > [— LH(96)]T IL vnB(6]. We may apply the Lindberg-Levy central limit theorem (D.18) to [,/ng(99)], since it is vn times the mean of a random sample; we have invoked DI again. The limiting variance of [nB(80)] is —Eo[(1/)H(69)), so 78480) É> Nf, — Eai Ho). n CHAPTER 17 + Maximum Likelihood Estimation 479 By virtuc of Theorem D.2, plim|-(1/)H(69)]= — Eol(1/1)H(99)]. Since this result is a constant matrix. we can combine results to obtain [-1Hço] ' vngtão) -L> No, (- Eo[LH(0o)] FP (- Bo [LHC] = ol LH(9] TT, or vn(8 — 80) > NT0, (- to ;H(06)]) which gives the asymptotic distribution of the MLE: 8 * Nfêo, (180)] "1. This last step completes M2. Example 17.3 Information Matrix for the Normal Distribution For the likelihood function in Example 17.2, the second derivatives are 2Int 242 int o No? Dot 8 For the asymptotic variance of the maximum likelihood estimator, we need the expectations of these derivatives. The first is nonstochastic, and the third has expectation 0, as E [x;] = q That leaves the second, which you can verify has expectation -n/(20) because each of the n'terms (x; — «)? has expected value 22, Collecting these in the information matrix, reversing the sign, and inverting the matrix gives the asymptotic covariance matrix for the maximum likelihood estimators: gfPni fm o º [385086 “lo 20%)" 17.4.5.c: : ASYMPTOTIC EFFICIENCY Thcorem €.2 provides the lower bound for the variance of an unbiascd estimator. Since the asymptotic variance of the MLE achieves this bound. it seems natural to extend the result directly. There is, however, a loose end in that the MLE is almost never unbiased. As such, we need an asymptotic version of the bound, which was provided by Cramér (1948) and Rao (1945) (hence the name): peer er ce Assuming that the density of xy sutisfies the regularity conditions RI-R3, the asymprotic variance of a consistent and asymptotically normally distributed esti. 4 mator of the parameter vector By will always be at least as large as o 82 In 189) an L480) (ain Le) VN Io = (E : Déo] =( o; 380 98% e) = (2:[( 80 X 980 )) THEOREM 17.4 Cramér-Rao Lower Bound 1 E Ê É 482 CHAPTER 17 + Maximum Likelihood Estimation Example 17.4 Variance Estimators for an MLE The sample data in Example C.1 are generated by a model of the form Hx B)= ente, 048 = where y=income and x = education. To find the maximum likelihood estimate of 8, we maximize Y Inttg Int8 + x) =» (64% = Br The likelihood equation is a = X = 2 Lear - which has the solution É = 15.602727. To compute the asymptotic variance of the MLE, we require (17-19) (17-20) Since the function E(y) = 8 + x; is known, the exact form of the expected value in (17-20) is known. Inserting 8 - x; for y in (17-20) and taking the reciprocal yields the first variance estimate, 44.2546. Simply inserting 2 = 15.602727 in (17-20) and taking the negative of the reciprocal gives the second estimate, 46.16337. Finally, by computing the reciprocal of the sum of squares of first derivatives of the densities evaluated at À, às 1 KB! = 5 E SAE ES ES? we obtain the BHHH estimate, 100.5116. 17.4.7 | CONDITIONAL LIKELIHOODS AND ECONOMETRIC MODELS All of the preceding results form the statistical underpinnings of the technique of maxi- mum hikclihood estimation. But, for our purposes. a crucial element is missing. We have done the analysis in terms of the density of an observed random variable and a vector of parameters, f(x; | a). But, econometric models will involve exogenous or predeter- mined variables, x;, so Lhe results must be extended. A workable approach is to treat this modeling framework (he same as the one in Chapter 5, where wc considered the large sample properties of the linear regression model. Thus, we will allow x; to denote a mix of random variables and constants that enter the conditional density of y;. By partitioning the joint density of y; and x; into the product of the conditional and the marginal, the log-likelihood function may be written in Le | data) = 5 In f(x |0) = Erros [x,0) + Songs lo), i=1 i=1 where any nonstochastic elements in x; such as a time trend or dummy variable, are being carried as constants. In order to proceed, wc will assume as we did belorc that the CHAPTER 17 + Maximum Likelihood Estimation 483 process gencraling x; takes place outside the model of interest. For present purposes, that means thal the parameters that appcar in g(x; |«) do not overlap with those that appearin f(x |x;, e). Thus, we partition « into [8. 8] so that the log-likelihood [uncuon may be written In L(9, é Idata) = 5 in fr. x; |0) = Em Or xi, 0) +S ing 16). i=1 i=1 As long as 8 and 8 have no elements in common and no restrictions conncet them (such asg +8 = 1), then the two parts ol the log likelihood may bc analyzed separately. In most cases, the marginal distribution of x; will be ot secondary (or no) interest. Asymptolic results for the maximum conditional likelihood estimator must now account for the presence of x; in the functions and derivatives of In f(y; |x;, 6). We will procecd under the assumption of well behaved data so that sample averages such as 1 Giynylm 149 [9,30 = — Lin Fou xi 0) and its gradient with respect to 8 will converge in probability to their population expec- tations. We wil] also need Lo invoke central limit thcorems to establish the asymptotic normality of the gradient of Lhe log likelihood, so as to be able to characterize the MLE itself. We will leave it to more advance treatises such as Amemiya (1985) and Newey and McFadden (1994) to establish specific conditions and fine points that must be assumed to claim the “usual” properties [or maximum likelihood estimators, For present purposes (and the vast bulk of empirical applications), the following minimal assumptions should suffice: * Parameter space. Paramctcr spaces lhat have gaps and nonconvexities in them will gencrally disable these procedures. An estimation problem that produces this failure is that of “estimating” a parameter that can takc only one among a discrete set of values. For example, this set of procedures does not include “estimating” Lhe timing of à structural change in a model, (Sce Section 7.4.) The likelihood function must be a continuous function of a convex parameter space. We allow unbounded parameter spaces, such as o > O in the regression model, for example. e Identifiabi Estimation must be feasible. This is the subject of definition 17.1 concerning identification and the surrounding discussion. * Well behaved data. Laws of large numbers apply to sample means involving the data and some form of central limit theorem (generally Lyapounov) can bc applied to the gradient. Ergodic stalionarity is broad enough to encompass any situation that is likely to arise in practice, though it is probably more general than we need for most applications, since we will not encounter dependent observations specifically until later in the book. The definitions in Chapter 5 are assumed to hold generally. With these in place, analysis is essentially the same in character as that wc uscd in the linear regression modelin Chapter 5 and follows precisely along the lines of Section 16.5. 484 CHAPTER 17 + Maximum Likelihood Estimation 17.5 THREE ASYMPTOTICALLY EQUIVALENT TEST PROCEDURES “The next several sections will discuss the most commonly used test procedures: the likelihood ratio. Wald. and Lagrange multiplier tests. [Extensive discussion of these procedures is given in Godfrey (1988).] We consider maximum likelihood estimation of a parameter O and a test of the hypothesis Ho: c(8) = O. The logic of the tests can be seen in Figure 17.2º The figure plots the log-likelihood function in 1(9), its derivalive with respect to 6, dln 1(9)/d0, and the constraint c(8). There are three approaches to testing the hypothesis suggested in the figure: * Likelihood ratio test. Tl the restriction c(6) = O is valid, then imposing it should not lead to a large reduction in the log-likelihood function, Therefore, wc base the test on the difference, In Ly; — In Ly, where Ly is the valuc of the lkelihood function at the unconstraincd valuc of 8 and Lg is the value of the likclihood function at the restricted estimate. e Wald test. If thc restriction is valid. then ctômrg) should be close to zero since the MLE is consistent. Therefore, the test is based on c(Ôme). We reject the hypothesis if this value is significantly different from zero. e Lagrange multiplicr test. If the restriction is valid, then the restricted estimator should be near the point that maximizes the log-likelihood. Therefore, the slope of the log-likelihood function should be near zero at the restricted estimator. The test is based on the siope of the log-likelihood at the point where the function is maximized subject to the restriction. These three tests are asymptotically equivalent under the null hypothesis, but Lhey can behave rather differently in a small sample, Unfortunately. their small-sample proper- ties are unknown, except in a few special cases. As a consequence, Lhe choice among them is typically made on the basis of ease of computation. The likelihood ratio test reguires calculation of both restricted and unrestricted cstimators. If both are simple to compule, then this way lo proceed is convenient. The Wald test requires only the unrestricted estimator, and the Lagrange multiplicr test requires only the restricted estimator. In some problems, one of these estimators may be much easier to compute than the other. For example, a linear model is simple to estimate but becomes nonlincar and cumbersome if à nonlinear constraint is imposed. In this case, the Wald statistic mighr be preferable. Alternatively, restrictions sometimes amount to the removal of nonlinearities, which would make the Lagrange multiplier test the simpler procedure. 17.5.1 THE LIKELIHOOD RATIO TEST Let be avector ofparameters to be estimated, and let Hy specify some sort of restriction on these parameters. Let dy; be the maximum likelihood estimator ot 8 obtained without regard to the constraints, and let ô « be the constrained maximum likelihood cstimator. If £,; and À are the likelihood functions evaluated at these two estimates, then the SSes Buse (1982). Note that the scale of (he vertical axis would be different for each curve. As such, the points olintersection have no signiticance. CHAPTER 17 + Maximum Likelihood Estimation 487 These two tests are based on the distribution of the full rank quadratic form con- sidered in Section B.11.6. Specifically, Tfx— Ny[g. E]. then (x — mz (an) chi-squared[J]. (17-22) In the setting of a hypothesis test, under the hypothesis that EG) = q, the quadratic form has the chi-squared distribution. If the hypothesis that E(x) = q is false, however, then the quadratic form just given will, on average, be larger than it would be if the hypothesis were true.º This condition forms the basis for the test statistics discussed in this and the next section. Let ô be the vector of parameter estimates obtained without restrictions. We hypo- thesize a set of restrictions Hoed)=q. If the restrictions are valid, then at least approximately should satisfy them. If the hypothesis is erroneous, however, then e(d) — q should be farther from 6 than would be explaincd by sampling variability alone. The device we use to formalize this idea is the Wald test. emasmeecensersempseprseseasersa eomeeonemomenscepmmemenuasor THEOREM 17.6 Limiting Distribution of the Wald Test Statistic : The Wald statistic is | | W = [e(ê) — qJ'(Asy.Var[e(ô — q])”“letô) — q]. Under IH, in large samples. W has a chi-squared distribution with degrees of freedom equal to the number of restrictions fie., the number of equations in e(8)—-q=0/ A derivation ofthe limiting distribution ofthe Wald statistic appears Theorem 6. This test is analogous to the chi-squared statistic in (17-22) if e(ê) — q is normally distributed with thc hypothesized mean of 0, A large value of Wcads to rejection of the hypothesis. Note, finally, that W only requires computation of the unrestricted model. One must still compute the covariance matrix appcaringin the preceding quadratic form. This result is the variance of a possibly nonlinear function, which we Lreated earlier, Est. Asy. Var[e(ô) — q] = É Est. Asy. Var[6]€”, ve(ô) 07-23) º | ad” , Thatis, Cisthe 4 x K matrix whose jth row is the derivativos of the jth constraint with respect to the K elements of 8. A common application occurs in testing a set of linear restrictions, e “Tf the mean is not p. then the statistic in (17-22) will have a noncentral chi-squared distribution. This distribution has the same basic shape as lhe central chi-squared distribution, with the same degrees offreedom, but lies to the right of it. Thus, a random draw from the noncentral distribution will tend, on average, to be larger than a random observation from the central distribution. 488 CHAPTER 17 4 Maximum Likelihood Estimation For testing a set of lincar restrictions Rê = q. the Wald test would be bascd on Ih:e(0)-q=Rº-q=0, (17-24) Est. Asy. Varletô) — q] = R Est. Asy. Var[ô]R, and W= [Rô — IR Est. Asy. Var(ô)R'|! [Rô — q]. The degrees of freedom is the number of rows in R. Tfe(8) — q is a single restriction, then the Wald test will be the same as the test bascd on the confidence interval developed previously. TI the test is Ho:o=% versus H:0 £ o, pa then the earlier testis based on “|ê-6] std) (17-25) where s(Ô) is the estimated asymptotic standard error. The test statistic is compared to the appropriate value from the standard normal table. The Wald test will be based on 5 A 15,5 [O W= [(0 65) —0](Asy. Varl(ô—69)—0]) [8-8)—-0]= -———+ = ?. (17-26) Asy. Var[ô] Here W has a chi-squared distribution with onc degree of freedom, which is the distri- bution of the squarc of thc standard normal test statistic in (17-25). To summarize. lhe Wald test is based on measuring the extent to which the un- restricted estimates fail to satisfy thc hypothesized restrictions. There are two short- comings of the Wald test. First, it is a pure significance test against the null hypothesis, not necessarily for a specific alternative hypothesis. As such, its power may be limited in some settings. In [act, the test statistic tends to be rather large in applications. The second shortcoming is not shared by either of the other test statistics discussed here. The Wald statistic is not invariant to the formulation of the restrictions. For example. for a test of the hypothesis that a fimetion 4 = 8/(1 — y) equals a specific value q there are two approaches onc might choose, A Wald test based directly on 4 — q = O would use a statistic based on the variancc of this nonlincar function. An alternative approach would be to analyze the lincar restriction 8 — q(l — y) = 0, which is an equivalent, but linear, restriclion. The Wald statistics for these two tests could be different and might lead to different inferences. These two shortcomings have been widely viewed as compelling arguments against usc of the Wald test. But, in its favor, the Wald test does not rely on a strong distributional assumption, as do the likelihood ratio and Lagrange multiplicr tests. Thc recent econometrics literature is replete with applications that are bascd on distribution free cstimation procedures, such as the GMM method. As such, in recent years, the Wald test has enjoyed a redemption of sorts. CHAPTER 17 + Maximum Likelihood Estimation 489 17.5.3 THE LAGRANGE MULTIPLIER TEST The third test procedure is the Lagrange multiplier (LM) or efficient score (or just score) test. ILis based on the restricted model instead of the unrestricted model. Suppose that we maximize the log-likclihood subject to thc set of constraints e(g) — q = O, Let à be a vector of Lagrange multipliers and define the Lagrangean function in L'(0) = In (9) +A(e(8) — q). “The solution to the constrained maximization problem is the root of ani! dn 146) » 200 + € (727) ami! «o -q=0 oa GU daSA where C' is the transpose of the derivatives matrix in the second linc ol (17-23). If the restrictions are valid, then imposing them will not lead to a significant difference in the maximized value of the likelihood function. In the tirst-order conditions, the meaning is that the second term in the derivative vector will be small. In particular, À will be small. We could test this directly, that is, test Ho: A = 0, which leads to the Lagrange multiplier test. There is an equivalent simpler formulation, however. At the restricted maximum, the derivatives of the log-likelihood function are Mn (apr (17-28) d0R Tf the restrictions are valid, at least within the range of sampling variability, then gp = 0. Thatis, the derivatives of the log-likelihood evaluated at the restricted parameter vector will be approximately zero. The vector of first derivatives of thc log-likelihood is the vector of efficient scores. Sincc the test is based on this vector, it is called the seore test as well as the Lagrange multiplier test. Thc variance of the first derivative vector is the information matrix, which wc have used to compute the asymptotic covariance matrix of the MLE, The test statistic is based on reasoning analogous to that underlying the Wald test statistic. THEOREM 17.7 Limiting Distribution of the Lagrange Multiplier Statistic The Lagrange multiplier test statistic is LÔOV a vn L(ê iM= (em L 2) mó (De e), 08R 98 Under the null hypothesis, LM has a limiting chi-squared distribution with degrees of freedom equal to the number of restriciions. All terms are computed at the Ê restricted estimator. 492 CHAPTER 17 + Maximum Likelihood Estimation * Lagrange Multiplier Test: The Lagrange multiplier test is based on the restricted estimators. The estimated asymptotie covariance matrix of the derivatives used to compute the statistic can be any of the three estimators discussed earlier. The BHHH estimator, Vs, is the empirical estimator of the variance of the gradient and is the one usually used in practice. This computation produces 0099438 0.26762] 7 TO. 1M = [00000 7.9162] [280 0 1º] [oo 026762 11197 49169] = 15.687. The conclusion is the same as before, Note that the same computation done using V rather than Vy produces à value of 5.1182. As before, we observe substantial small sample variation produced by the diflerent estimators. Thc latter three test statistics have substantially different values, Tt is possible to rcach different conclusions, depending on which one is used. For example, if the test had bcen carried out at the ! percent level of significance instead ol 5 percent and LM had been computed using V. then the critical value from (he chi-squared statistic would have been 6.635 and the hypothesis would not have been rejected by the LM test. Asymptotically, all three tests are equivalent. But. in a finite sample such as this one, dilferences are to be expected. !º Unfortunately. there is no clear rule for how to proceed in such a case, which highlights the problem of relying on a particular significance level and drawing a firm rejeet or accept conclusion based on sample evidence. 17.6 APPLICATIONS OF MAXIMUM LIKELIHOOD ESTIMATION Wc now examine three applications of the maximum likelihood estimator. The first extends the results ot Chapters 2 through 5 to the linear regression model with normally distributed disturbances. In the second application, we fita nonlinear regression model by maximum likelihood. This application illustrates the effect of translormation of the dependent variable. The third application is a rclatively straightforward use of the maximum likelihood technique in a nonlinear model that does not involve the normal distribution. This application illustrates the sorts of extensions of the MLE into settings that depart from the linear model of the preceding chapters and that arc typical in econometric analysis. 17.6.1 THE NORMAL LINEAR REGRESSION MODEL The linear regression model is n=xB+en The likelihood function for a sample of 1 independent, identically and normally dis- tributed disturbances is L= (Qro!y "Pet eioo!, (17-32) *For further discussion of this problem. see Berndt and Savin (1977). CHAPTER 17 4 Maximum Likelihood Estimation 493 The transformation from es; to pisg = —X ; 8, so the Jacobian for each observation, l95;/8y:], is one H Making the transformation. we find that the likelihood function for the x obscryations on the observed random variable is L= Oro eo no XE (PXB) (17-33) To maximize this function with respect to 8, it will be necessary to maximize the expo- nent or minimize the familiar sum of squares. Taking logs, we obtain the log-likclihood function for the classical regression model: Q-XBY(y— Xp) n n =-ln2x — -Ino? 17- nz 5 In 27 qo" — 252 (17-34) The necessary conditions for maximizing this log-likelihood are amp X(y— Xp) ; 5 0 B | = g = [ | . (1735) ant no (y— XBYG — XB) 9 do? 29? 204 o The values that satisfy these equations are x êm = (X00'x'y=b and dy = “e (17-36) The slopc estimator is the familiar one, whercas the variance estimator differs from the least squares value by thc divisor of n instcad of n — K.2 The Cramér-Rao bound for the variance of an unbiased estimator is the negative inversc of the expectation of mL ni XX X'e opor” apdo? “o Ta Bop ê Cl=| o º . (17-37) d2nL qnt eX no de do” Ho) 204 98 In taking expected values, the off-diagonal term vanishes leaving a cx xo 0 K8,0)]! = 17-38 [K8, 0º) Y 20!/n ( ) The least squares slope estimator is the maximum likclihood estimator for this model. Therclorc, it inherits all the desirable asymprotic properties of maximum likelihood estimators. We showcd earlier thats? = e'e/(n — K) is an unbiased estimator of o2. Therelore, the maximum likelihood estimator is biased toward zero: -K o kK Eloa) =" É? = (1 - Jo? <o?, (17:39) See (B-41) in Section 85, The analysis (o follow is conditioned on X. To avoid cluttering the notation, we will leave this aspect of (he model implicit in the results. As noted earlier, we assume that the data generating process for X does not involve £ or o? and that the data are well behaved as discussed in Chapter 5. As a general rule, maximum likelihood estimators do not make corrections for degrees of freedom. 494 CHAPTER 17 + Maximum Likelihood Estimation Despite its small-sample bias, the maximum likelihood estimator of o? has the same desirable asymptotic properties. We see in (17-39) that s? and 62 differ only by a factor —K/n, which vanishes in large samples. It is instructivc to formalize the asymptotic equivalence of the two. From (17-38), we know that vA(8t — 02) & Nf0,20%. It follows W= (a - = foda -)+ =” <, (1 - =) N[0, 20% + o But K/Vrand K/n vanishasn — 00, so the limiting distribution of z, is also N[0, 20]. Since 2, = (8º — 07), wc have shown Lhat the asymptotic distribution of s? is the same as that of the maximum likelihood estimator. The standard test statistic for assessing lhe validity of a set of linear restrictions in the linear model, R$ — q = 0, is the F ratio, FUn-Kl= fere, =e)/l A(Rb- TR 'RT (Rb — D ee/n— K) J With normally distributed disturbances, the F test is valid in any sample size. There remains à problem with nonlinear restrictions of the form e(8) = 1), since the counter- part to F, which we will examine here, has validity only asymptotically even with nor- mally distributed disturbances. In this section, we will reconsider the Wald statistic and examine two related statistics, the likclihood ratio statistic and (he Lagrange multiplier statistic. Thesc statistics are both based on the likelihood function and, like the Wald statistic, are generally valid only asymptotically. No simplicity is gaincd by restricting ourselves to linear restrictions at this point, so we will consider general hypotheses of the form Ho cdB)=0, HudB) 0. The Wald statístic for testing this hypothesis and its limiting distribution under Ay would be W = ebyCd)[5 00 Cm ed) 5 4214], (17.40) where C(b) = [9e(b)/9b']. (17-41) The hikelihood ratio (LR) test is carried out by comparing the values of the log- likelihood function with and without the restricuons imposed. We leave aside [or the present how the restricted estimator b, is computed (except for the linear model, which we saw earlier). The test statistic and it's limiting distribution under Hy are LR= nt, -In1] É (17-42) The log-likelihood for the regression modelis given in (17-34). The first-ordcr conditions imply that regardless of how the slopes are computed, the estimator of o? without CHAPTER 17 + Maximum Likelihood Estimation 497 $=0 and 9 — —oc, which produces a sum of squares of zero. “Estimalion” becomes a nonissue. For this type of regression model, however, maximum likclihood estimation is consistent, cfficient, and generaily not appreciably morc difficult than least squares. For normally distributed disturbances, the density of y; is Flo 12 [gt htm PAC?) After collecting terms, the log-likelihood function will be Dia leOn. 6) — Açu, 9)? . nL= Er; [nda +In6] 45) ns Qi. 9) — 3 52 ia (17-48) In many cases, including the applications considered here, there is an inconsistency in the model in that the transformation of the dependent variable may rule out some values. Hence, the assumed normality of the disturbances cannot be strictly correct. In the generalized production function, there is a singularity at 3; = O where the Jacobian becomes infinite. Some research has been done on specific modifications of the model to accommodate the restriction [e.g., Poirier (1978) and Poirier and Melino (1978)], but in practice, the typical application involves data for which the constraint is inconseguential. But for the Jacobians, nonlincar least squares would be maximum likelihood, Tf the Jacobian terms involve 8, however, then least squares is not maximum likelihood. As regards o?, this likelihood function is essentially the same as that for the simpler nonlinear regression model. The maximum likelihood estimator of o? will be 1 . . ne 2 º 2. 2 » =. Algord) = he. BP = Le (17-49) The likelihood equations for the unknown parameters are erôh(x;, 8) am L ET ESA ip o8 = ” n ' alnL 1/94 1 àg(u. 0) | = S(E)- ajbeo |=]0]. (17-50) ist i=1 0 óln£ -n 14 ——— Z ão? mta dd These equations will usually be nonlincar, so a solution must bc obtained iteratively. One special case lhat is common is a model in which 8 is a single parameter. Given a particular value of 0, we would maximize In L with respect to & by using nonlinear least squares. [It would be simpler yet if, in addition, A(x;. 8) were linear so that we could use linear least squares. See the following application.| Thcrefore, a way to maximize Lfor all the parameters is to scan over values ol 6 for the one that, with Lhe associated least squares estimates of f and o, gives the highest value of In £, (Of course, this requires that we know ronghly what values of 9 to examine.) 498 CHAPTER 17 + Maximum Likelinood Estimation Tf 8 is a vector of parameters, then direct maximization of L with respect to the full sct of parameters may be preferable. (Methods ol maximization are discussed in Appendix E.) There is an additional simplification that may be usctul. Whatever val- ves are ultimately obtained for the estimates of 6 and $, the estimate of o? will be given by (17-49). Il we insert this solution in (17-48), then we obtain ihe concentrated log-likelihood, mh= x Togo) St +InQm)]- 5 Ê x | . (17:51) This equation is a function only of 9 and 8. We can maximize it with respect to 8 and 8 and obtain the estimate of o? as a by-product. (See Section E.6.3 tor details.) An estimate of the asymptotic covariance matrix of the maximum likclihood esti- mators can be obtained by inverting the estimated information matrix. Itis quite likely. however, that the Berndt et al. (1974) estimator will be much casier to compute. The log of the density for the ilh observation is the ith term in (17-50), The derivatives of In L; with respect to the unknown parameters are 9 In 1,/98 (e:/0")[Bh(x;, 8/98] g = [9lnL;/00 | =| (1/19[94;/00] — (e;/0)[9g(y. 63/00]] . (17-52) 9lnL;/90? (1/(202))[82/0? —1] The asymptotic covariance matrix for the maximum likelihood estimators is estimated using " + Est.Asy. Var[MLE] = » ae! =(66). (17-53) i=1 Note that the preceding includes of a row and a column [or o? in lhe covariance matrix. In à model that transforms y as well as x, the Hessian of the log-likelihood is generally not block diagonal with respect to 6 and o? When y is transformed, the maximum likelihood estimators of 8 and o? are positively correlated, because both parameters reflect Lhe scaling of the dependent variable in the model. This result may seem counterintuitive. Consider the difference in the variance cstimators that ariscs when a lincar and a loglinear model are estimated. The variance of In y around ils mean is obviously different from that of y around íts mean. By contrast. consider what happens when only the independent variables are transformed, for example. by the Box-Cox transformation. The slope estimators vary accordingly, but in.such a way that the variance of y around its conditional mean will stay constant. !é Example 17.5 A Generalized Production Function The Cobb-Douglas function has often been used to study production and cost. Among the assumptions of this model is that the average cost of production increases or decreases monotonically with increases in output. This assumption is in direct contrast to the standard textbook treatment of a U-shaped average cost curve as well as to a large amount of empirical evidence. (See Example 7.3 for a well-known application.) To relax this assumption, Zellner logce Seaks and Layson (1983). CHAPTER 17 + Maximum Likelihood Estimation 499 TABLET: alized Production; — Maximum ikelihood Estimate SE() SEC) Nonlinear Least Squares Br 2.914822 0.44912 0.12534 2.1008925 Bz 0.350068 0.10019 0094354 0.257900 Bs 1.092275 O.16070 0.11498 - 01.878388 8 0106666 0.078702 — 0031634 o? 0,0427427 0,0151167 e'e 1.068567 0.7655490 InZ —8.939094 —13.621256 and Revankar (1970) proposed a generalization of the Cobb-Douglas production function.” Their model allows economies of scale to vary with output and to increase and then decrease as output rises: Iny+ey=Iny+a(1-9) INK +eôlnt +e. Note that the right-hand side of their model is intrinsically linear according to the results of Section 7.3.3. The model as a whole, however, is intrinsically nonlinear due to the parametric transformation of y appearing on the left. For Zeliner and Revankar's production function, the Jacobian of the transformation from estoy is de;/2y = (8 + 1/y;). Some simplification is achieved by writing this as (1 +9y)/y;. The log-likelihood is then n a " n n 1 2a +6y) Dm — a in(2x) - ano? 32 Do — Int where 4; = (Iny; + 6% — 84 — Bo In capital, — fa in labor;). Estimation of this model is straight- forward. For a given value of 2, 8 and 9? are estimated by linear least squares. Therefore, to estimate the full set of parameters, we could scan over the range of zero to one for 6. The value of 6 that, with its associated least squares estimates of £ and o?, maximizes the log-likelihood function provides the maximum likelihood estimate. This procedure was used by Zeilner and Revankar. The results given in Table 17.2 were obtained by maximizing the log-likelihood function directly, instead. The statewide data on output, capital, labor, and number of establishments in the transportation industry used in Zellner and Revankar's study are given in Appendix Table F9.2 and Example 16.6. For this application, y = value added per firm, X = capital per firm, and £ = labor per firm. Maximum likelihood and nonlinear least squares estimates are shown in Table 17.2, The asymptotic standard errors for the maximum likelihood estimates are labeled SE(1). These are computed using the BHHH form of the asymptotic covariance matrix. The second set, SE(2), are computed treating the estimate of à as fixed; they are the usual linear least squares results using (In y +9y) as the dependent variable in a linear regression. Clearly, these results would be very misleading. The final column of Table 10.2 lists the simple nonlinear least squares estimates. No standard errors are given, because there is no appropriate formula for computing the asymptotic covariance matrix. The sum of squares does not provide an appropriate method for computing the pseudoregressors for the parameters in the trans- formation. The last two rows of the table display the sum of squares and the log-likelihood function evaluated at the parameter estimates. As expected, the log-likelihood is much larger at the maximum likelihood estimates. In contrast, the nonlinear least squares estimates lead to a much lower sum of squares; least squares is still /east squares. V An alternative approach isto model costs directly with a exible Iunctional form such as the translog model. [his approach is examined in detail in Chapter 14. 502 CHAPTER 17 + Maximum Likelihood Estimation production — any nonzero disturbanee must be interpreted as the result of inefficiency. A strictly orthodox interpretation embedded in a Cobb-Douglas production model might produce an empirical frontier production model such as ny=8 +DLrbklnx—u, uzo0 The gamma model describedin Example 5.1 was an application. Onc-sided disturbances such as this onc present a particularly difficult estimation problem, The primary theoret- ical problem is that any measurementerrorin In y must be embedded in the disturbance. The practical problem is that the entire estimated function becomes a slave to any single errantly measured data point. Aigner, Lovell, and Schmidt proposed instead a formulation within which observed deviations from the production function could arisc from two sources: (1) productive inefliciency as wc have defined it above and lhat would necessarily be negative: and (2) idiosyneratic elfcels that are specific to the firm and that could enter the model with either sign. The end result was what they labeled the “stochastic Ironticr”: ny=8+Eifinm-uto uz0 v-N[0,05). = +LEBina+e N The frontier for any particular firm is h(x, 8) + v, hence the name stochastic fron- ticr. The incfficieney term is «, a random variable of particular interest in this setting. Since the data are in log terms, u is a measure of the percentage by which the particular observation fails to achieve Lhe [rontier, ideal production rate. To complete the specification, they suggested two possible distributions for the inefficiency term, the absolute value of a normally distributed variable and an exponen- tially distributed variable. The density functions for lhese two compound distributions are given by Aigner, Lovell, and Schmidt; lete = v — 41,A = culo;.0 = (o) + oD!2, and &(z)= lhe probability to the left of z in the standard normal distribution [see Sections B.4.1 and E.5.6). For the “half-normal” model, 1 2 1/8) —Eid Infís;|B,A,0)= [Ho (Jog 2-5 (2) +mo( z JJ whereas for the exponential model Inh( 8.0.0) = [no + Muro? +08+ mo(-= - to,)] . A Both these distributions are asymmetric. We thus have a regression model with a nonnormal distribution specified for the disturbanec. Thc disturbance, e, has a nonzero mean as well; E[£] = —ou(2/m)!'? for the half-normal model and —1/8 for the expo- nential model. Figure 17.3 illustrates the density for the half-normal model with o = 1 andA =2. By writing 89 = 81+ E[elandet = e — E Te), we obtain a more conventional formulation iny= Bo + LiBklnaçã et which does have a disturbance with a zero mean but an asymmetric, nonnormal distribu- tion. The asymmetry of Lhe distribution of s* does not negate our basic results for Icast squares in this classical regression model. This model satisfies the assumptions of the CHAPTER 17 + Maximum Likelihood Estimation 503 Probability Density for the Stochastic Fronticr To «56 42 Density 38 per «00, —40 29 Disturbande: Gauss-Markov theorem, so least squares is unbiased and consistent (save for the con- stant term), and efficient among linear unbiased estimators. In this model, however, the maximum likelihood estimator is not linear. and it is more efficient than least squares. We will work through maximum likelihood estimation of the half-normal model in detail to illustrate the technique, The log likelihood is —E;à o ) . laL=-nlno - Zn? Is a “3 > =-nlho- >In>— > — n 2 qm 2 o 1 This is not a particularly difficult log-likelihood to maximize numerically. Nonetheless, it is instructive to makc use of a convenience that wc notcd earlier. Recall that maximum likelihood estimators are invariant to one-to-one transformation. If we let 8 = 1/c and y = (1/0), the log-likelihood function becomes n 2 1E n nL=nlno-5Ino 5 Ly —y'x) + Lm D[-1(0y — px). As you could verify by trying the derivations, this transformation brings a dramatic simplification in the manipulation of the log-likelihood and its derivativos. We will make repeated use of the Iunctions q =e/0 =0y—p'x. ól-da] bia] A; = —ô(—ho + dj) 804,8, 4,0,9) = j 504 CHAPTER 17 + Maximum Likelihood Estimation (The second of these is the derivative of the function in the final term in Jog £. The third is the derivative of 3; with respect to its argument; A; < O for all values ol Aa;.) It will also be convenient to define the (K + 1) x 1 columns vectors z; = (x;, —y;) and t; = (07, 1/9). The likelihood equations are +SDen+AS 8m =, i=1 il 9inZ A Em ==288=0 n alnb o ” aprox i=l and the second derivatives are fia nar (G6-imApe] |tit O no [o ed oia; v olf' iz The estimator of the asymptotic covariance matrix lor tac directly estimated parameters is Est.Asy. Var[9',8,] = (-HIp,0.)5”. Therc are two sets of transformations of the parameters in our formulation. In order to recover estimates of the original structural parameters o = 1/6 and 8 = y/8 we need only transtorm the MILES. Sincc these transformations are one to one, the MLES of o and £ arc 1/ô and P /2. To compute an asymptotic covariance matrix for these estimators we will use the delta method, which will use the derivalive matrix aB/op' aB/oô aB/oi G/ôM —1/09p O G=|96/9)' 96/00 aejaa|=| 0 -(1/8º) O vÃop' 0h/0B aÃ/oà Y 0 1 Then, for the recovered parameters, we Est. Asy. VarlÊ',6.1]' = 6 x [-H[p, 8, 3] x 6º. For the half-normal model, we would also rcly on the invariance of maximum likelihood estimators to recover estimates of the deeper variance parameters, o? = o2/(1 +22) ando) = c2x2/1 +22). The stochastic frontier model is a bit different from those we have analyzed previ- ously in that the disturbance is the central focus of the analysis rather than the catchall for the unknown and unknowablc factors omitted from the equation. Ideally, we would like to estimate «; for each firm in (he sample to compare them on the basis of Lheir pro- ductive efficiency. (The parameters of Lhe production function are usually of secondary interest in these studies.) Unfortunately, the data do not permit a direct estimate, since with estimates of $ in hand, we are only able to compute a direct estimate ofe = y —x'B. Jondrow et al. (1982), however, have derived a useful approximation that is now the standard mcasure in these settings, Eluld= 12a| ci] ex Io “| To! CHAPTER 17 + Maximum Likelihood Estimation 507 for some covariance matrix E that we have yet to estimate, it Tollows that the Wald statistic, ne $ ES ), (17-59) where the degrecs of Irecdom ./ is the number of moment restrictions being tested and É is an estimate of E. Thus, the statistic can be referred to the chi-squared table. Ti remains to determine the estimator of £. The full derivation of X is fairly com- plicated. [Sec Pagan and Vella (1989, pp. $32-S33).] But when lhe vector of parameter estimators is a maximum likelihood estimator, as it would be for the least squares es- timator with normally distributed disturbances and for most of the other cstimators we consider, a surprisingly simple estimator can be used, Suppose that the parameter vector used to compute the moments above is obtained by solving the equations 1 1 q E 806 mn.d) = 18 = i= i— where Ô is the estimated parameter vector fe.g., (2.6) in the linear model]. For the linear regression model, that would be the normal equations Le e= 5 mu uh) = i=1 Let the matrix G be then x K matrix with th row equal to g;. In a maximum likelihood problem, G is the matrix of derivatives of the individual terms in the log-likelihood function wilh respect to the parameters. This is the G uscd to compute the BHHH estimator of the information matrix. [Sce (17-18).] Let R be the 1 x 4 matrix whose ith row is £;. Pagan and Vella show that for maximum likelihood cstimators, E can be estimated using (17-60) S= [RR-R'G(G'G GR]? n This equation looks like an involved matrix computation, but it is Tegression program. Each element of S is the mean square or cross-product of the least squares residuals in a linear regression of à column of R on the variables in G.2 Therefore, the operational version of the slatistic is IG'RI'R'i, (17-62) 1 C=nrStr= YRIRR-R'G n where i is an 1 x 1 column of ones, which, once again, is referred to the appropriate critical value in the chi-squared table. This result provides a joint test that allthe moment conditions are satisfied simultaneously. An individual test of just one of the moment Mt might be tempting just to usc (1/1) R'R. This idea would be incorrect, because 8 accounts for R being a function of the estimated parameter vector that is converging to its probability limit at the same rate as the sample moments are converging to theirs. If the estimator is not an MLE. then estimation of E is more involved but also straightiorward using basic matrix algebra, The advantage of (17-62) is that it involves simple sums of variables (hat have already been computed to obtain é and E. Note. as well, that if 8 has been estimated by maximum likelihovd. then the term (6'G) 1 is the BHHH estimator of the asymptotie covarianco matrix of À. If it were more convenient, then this estimator could be replaced with any other appropriate estimator of Asy. Varfê| 508 CHAPTER 17 + Maximum Likelihood Estimation restrictions in isolation can be computed even morc casily than a joint test. For testing one of the £ conditions, say the Eh onc, the test can be carried out by a simple £ test of whether the constant term is zero in a lincar regression of the th column of R on a constant term and all the columns of G. In fact, the test statistic in (17-62) could also be obtained by stacking the J columns of R and treating the £ equations as à seemingly unrelated regressions model with (i, G) as the (identical) regressors in cach equation and then testing the joint hypothesis that all thc constant terms are zero. (See Section 14.23.) Example 17.8 Testing for Heteroscedasticity in the Linear Regression Model Suppose that the linear model is specified as % = Br + Box + Baz + 8. To test whether elz(? -c?))=0, we linearly regress z?(e? — s?) on a constant, e, x, and z;;. A standard t test of whether the constant term in this regression is zero carries out the test. To test the joint hypothesis that there is no heteroscedasticity with respect to both x and z, we would regress both xe? — 8?) and zí(e? — 52), on [1,6;, x, 26] and coliect the two columns of residuais in V. ThenS = (1/n)V'V. The moment vector would be The test statistic would now be 4 Comst=or [av ] We will examine other conditional moment tests using this method in Section 2.3.4 where we study thc specification of the censored regression model. 17.7 TWO-STEP MAXIMUM LIKELIHOOD ESTIMATION The applied literature contains a large and increasing number of models in which one model is embedded in another, which produces what are broadly known as “two-step” estimation problems. Consider an (admittedly contrived) example in which we have the following. Model 1. Expectcd number of children = E[y |x1, 91]. Model 2. Decision to enrollin job training = 32, a function of (x2. 82. E [yi |xi, 61]). There are two parameter vectors, 9, and 62. The first appears in the second model. although not the reverse. In such a situation. there arc two ways to proceed. Full in- formation maximum likelihood (FIML) cstimation would involvc forming the joint distribution fQ1. 72 1x1. x2. 84,82) of the two random variables and then maximizing. CHAPTER 17 + Maximum Likelihood Estimation 509 the full log-likclihood function, " lnL=5 [On via [x x, 01,09). il A second, or two-step, limited information maximun likelihood (LIML) procedure for this kind of model could be done by estimating the parameters of model !, since it does not involve 82, and then maximizing a conditional log-likelihood function using the estimates from Step 1: n Wni= 3” Fla | xi2, 02, 41,80]. = There are at least two reasons one might procecd in this fashion. First, ilmay be straight- forward to formulate the two separate log-likelihoods, but very complicated to derive the joint distribution. This situation IrequentIy arises when the two variables being mod- eled are from different kinds of populations, such as one discrete and onc continuous (which is a very common case im this framework). The second rcason is that maximizing the separate log-likelihoods may be fairly straightforward. but maximizing the joint log-likelihood may be numerically complicated or difficult? We will consider a few examples. Although we will encounter FIML problems at various points later in the book, for now we will present some basic results for two-step estimation. Prools of the results given here can be found in an important reference on the subject, Murphy and Topcl (1985). Supposc, then, that our model consists ol the two marginal distributions, (pj |x1, 8,) and f(35 [x1. x2, 64. 85). Estimation proceeds in two steps. 1. Estimate 9; by maximum likelihood in Model 1. Let (1/n)V, be a times any of the estimators of the asymptotic covariance matrix ol this estimator that were discussed in Section 17.4.6. 2. Estimate 8> by maximum likelihood in model 2, with ô, inserted in place of 8, as if it were known. Let (L/n) V> be n times any appropriate estimator of the asymptotic covariance matrix of d,. The argument for consistency ofô, is essentially thatifê, were known, thenallour results for MLEs would apply for estimation of 85, and since plim 8; = 81. asymptotically, this line of reasoning is correct. But the same line of reasoning is not sufficient to justify using Um)Vo as the estimator of the asymptotic covarianec matrix of 8». Some correction is necessary to account for an estimate of 8, being used in estimation of 8,. The essential tesult is the following. “There is a Lhird possible motivation. If eilher model is misspecified. then (he FIML estimates of both models will be inconsistent. But if only the second is misspecified, at least lhe first will be estimated consistently. Of course, this result is only “half a Joaf;” but it may be better than none, 512 CHAPTER 17 + Maximum Likelihood Estimation Chapter 21.) After a bit of manipulation, we find the convenient result that a amis E =D Bi o = Dus: - Once again, any of the three estimators could be used for estimating the asymptotic covari- ance matrix, but the BHHH estimator is convenient, so we use vi Seus ” For the final step, we must correct the asymptotic covariance matrix using & and À. What remains to derive—the few lines are left for the reader—is no t oa = ST ubrexofasaes A So, using our estimates, C- 2 a 11 Clexp(x,Blkix. and 8-5" i=1 = pç s1= We can now compute the correction. In many applications, the covariance of the two gradients R converges to zero. When the first and second step estimates are based on different samples, R is exactly zero. For example, in our application above, R = 3), trtyx5x;- The two “residuals” « and v, may well be uncorrelated. This assumption must be checked on a model-by-model basis. butin such an instance, the third and fourth terms in V5 vanish asymptotically and what remains is the simpler alternative, Vyt = (1 m[Vo + VoCViC'Vo]. We will examine some additional applications of this technique (including an empirical implementation of the preceding example) later in the book. Perhaps the most com- mon application of two-step maximum likelihood estimation in the current literature, especially in regression analysis, involves inserting a prediction of one variable into a function that describes the behavior of another. 17.8 MAXIMUM SIMULATED LIKELIHOOD ESTIMATION The technique of maximum simulated likclihood (MSL) is essentially a classical sam- pling thcory counterpart to the hierarchical Bayesian estimator we considered in Sec- tion 16.2.4. Since the celebrated papcr of Berry. Levinsohn, and Pakes (1995), and a related literature advocated by McFadden and [rain (2000), maximum simulated like- lihood estimation has been used in a large and growing number of studies bascd on log-likelihoods that involvc integrals that are expectations? In this section, we will lay out some general results for MSL estimation by developing a particular application, 22A major reference for this set of techniques is Gourieroux and Monfort (1996). CHAPTER 17 + Maximum Likelihood Estimation 513 the random parameters model. This general modeling framework has been used in the majority of the received applications. We will then continue the application to the dis- crete choice model for panel data that wc began in Section 16.2.4. The density of y;, when the parameter vector is 8; is /(y | Xi, B;). The parameter vector 8, is randomly distributed over individuals according to B;=B+AZ+y; whcrc 8+ Az; is the mean of the distribution, which depends on time invariant individual characteristics as well as parameters yct to be estimated, and the random variation comes from the individual heterogenoity, v;. This random vector is assumed to have mean zero and covariance matrix, L. The conditional density of the parameters is denoted 8Biln, 8.4.2) = g(v + B + Az. D). where g(.) is the underlying marginal density of the heterogeneity. For the T observa- tions in group i, the joint conditional density is T FX, 8) = [| Gu Deo. = The unconditional density for y, is obtained by integrating over 8,, FX, 2,8, A, 2) = Elf IX BD) = | FX. Bog(B mB, A, DDdB;. Collecting terms, and making the transformation from v; to 8,, the true log-likelihood would be n A T In£=57m U/ I Ore [Xi B + Az + vi) i=1 " t=1 g(vi| mo | = nf / FG Xi B + Az + vg | Bia). i=1 “Yi Each of the n terms involves an expectation over v;. The end result of the integration is a function of (8, A, £) which is then maximized. Asinthe previous applications, it will not be possible to maximize the log-likelihood in this form because Lhere is no closed form for the integral. We have considered two approaches to maximizing such a log-likclihood. In the latent class formulation, it is assumed that the parameter vector takes one of a discrete set of valucs, and the log- likelihood is maximized over this discrete distribution as well as the structural parame- ters. (Scc Section 16.2.3.) The hierarchical Baycs procedure used Markov Chain-Monte Carlo methods to sample from the joint posterior distribution of the underlying param- eters and used the empirical mean of the sample of draws as the estimator. We now consider a third approach to estimating the parameters of a model of this form, maxi- mum simulated likelihood estimation. The terms in the log-likelihood are each of the form Indo= Ef IX B+ Az + vo). Asnoted, we do not have a closed form for this function. so we cannot computc it directly. Supposc we could sample randomly from the distribution of v;. If an appropriate law 514 CHAPTER 17 + Maximum Likelihood Estimation of large numbers can be applied, then TT Ê . fis, q TOM B + aa + wo) = EfQIX 8 + As + vo] where v; is lhe rth random draw from the distribution. This suggests a strategy for computing the log-likelihood. We can substitute this approximation to the expectation into the log-likelihood function. With sufficient random draws, thc approximation can be made as close to the true function as desired. [The theory for this approach is discussed in Gourieroux and Montfort (1996), Bhat (1999), and Train (1999,2002). Practical details on applications of the method are given in Greene (2001).] A detail to add concerns how to sample from the distribution of v;. Ihere are many possibilities, but for now, we consider rhe simplest casc, the multivariate normal distribution. Write L in the Cholesky form E = LL/ where L is a lowcr triangular matrix. Now, let m be a vector of K independent draws from the standard normal distribution. Then a draw from the multivariate distribution with covariance matrix E is simply va = Lu,. The simulated log-likelihood is. " R r nis= 5 fi s 1 Fou lx B+ Bu ta) ! i=1 r=1 Lt=1 The resulting function is maximized with respect to 8, A and L. This is obviously not a simple caleulation, but it is feasible, and much casier than trying to manipulate the integrals directly. In fact, for most problems to which this method has been applied, the computations are surprisingly simple. The intricate part is obtaining the function and ils derivatives. But, the functions are usually index function models that involve x;,8, which greatly simplifies the derivations. Inference in this setting does not involve any new results. The esumated asymp- totic covariance matrix for the estimated parameters is computed by manipulating the derivatives of the simulated log-likelihood. The Wald and likelihood ratio statistics are also computed the way they would usually be. As before, we are interested in estimating person specific parameters, A prior estimate might simply usc 8 + Az;, but this would not use all the information in the sample, A posterior estimate would compute Ro B, AML -L 1Bir Fou Mo , és. duas] E TG Bo Br =8+ Âz + Lu. Mechanical details on computing the MSLE are omitted. The interested reader is referred to Gourieroux and Monfort (1996), Train (2000, 2002), and Greene (2001, 2002) for details. Example 17.10 Maximum Simulated Likelihood Estimation of a Binary Choice Model We continue Example 16.5 where estimates of a binary choice model for product innovation are obtained. The model is for Prob[y = 1 Lx, 8,] where Yr = iffirm / realized a product innovation in year t and O if not. CHAPTER 17 + Maximum Likelihood Estimation 517 Figure 17.5 shows the kernel density estimate for the firm-specific estimates of the log sales coefficient. The comparison to Figure 16.5 shows some striking difference. The random parameters model produces estimates that are similar in magnitude, but the distributions are actually quite different. Which should be preferred? Only on the basis that the three point discrete latent class model is an approximation to the continuous variation model, we would prefer the latter. Kernel Density Estimate for BS 5.12 Density 2.56 1.28 0.00 =2 7.20 5.76 432 Density 518 CHAPTER 17 + Maximum Likelihood Estimation 17.9 PSEUDO-MAXIMUM LIKELIHOOD ESTIMATION AND ROBUST ASYMPTOTIC COVARIANCE MATRICES Maximum likelihood estimation requires complete specification of the distribution of the observed random variable. If the correct distribution is something other than what we assume, then the likelihood function is misspecificd and the desirable properties of the MLE might not hold. This section considers a sct of results on an estimation approach that is robust to some kinds of model misspccification. For example, we have found that in a model, if the conditional mean function is E [y |x] = x'B. then certain estimators, such as least squares, are “robust” to specifying the wrong distribution of the disturbances. That is, LS is MLE if the disturbances are normally distributed, but we can still claim some desirable properties for LS, including consistency, even if the disturbances are not normally distributed. This section will discuss some results that relate to what happens if we maximize the “wrong” log-likclihood function, and for those cases in which the estimator is consistent despite this, how to compute an appropriate asymptotic covariance matrix for it? Let /(y; |x;. 8) be the true probability density for a random variable y; given a sct of covariates x; and parameter vector 8. The log-likelihood function is (1/1) log IX8 | y. M=(1/mMDilog fOx;, B). The MLE, Êyr. is the sample statistic that maximizes this function. (The division oflog L by n does not affect the solution.) We maximize the log-likelihood function by cquating its derivatives to zero, so the MLE is obtained by solving the set of empirical moment equations n Ip dog [6 [xo É) - n no o ê -3 dBm) = d =0. = afim. ndo (Bu) = Bm) The population counterpart to thc sample moment equation is 19logL] 1 2 Ca eli 5 |-efE Spam = ciúme Using what we know about GMM estimators. il E [d(8)] = 0, then Êmr is consistent and asymptotically normally distributed, with asymptotic covariance matrix equal to Vi = [SB GB]! GB) (Varl dep] GG GU where G()=plim dd(8)/08'. Since d(8) is the derivative vector, G(f) is 1/n times the expected Hessian of log L: that is, (1/n) E[H(8)]=H(B). As we saw earlier, Var[ô log 1/98] = — E [H(8)]. Collecting all seven appearances of (1/n) E [H(8)]. we obtain the familiar result Vuc= (EH). [AII the ns cancel and Var[d = (1/m)H(8).] Note that this result depends crucially on the result Var[ô log 1/98] = =ETH(8)). The following will sketch a set of results related to this estimation problem. The important references on this subject are White (19824); Gourieroux. Montort. and Lrognon (1984); Huber (1967); and Amemiya (1985). A recent work with a largo amount of discussion on (be subject is Miltelhammer ct al. (2000). The derivations in these works are complex, and we will onfy attempt to provide an intuitive introduction to the topic. CHAPTER 17 + Maximum Likelihood Estimation 519 The maximum likclihood estimator is obtained by maximizing the function Ax(y, X.B=(1 log FC, x;. 8). This function converges toils expectation asp — 20. Since this function is the loglikclihood for the sample, it is also the case (not proven here) that as 1» > oe, it attains its unique maximum at the true parameter vector, 8. (We used this result in proving the consistency of the maximum likelihood cstimator,) Since plim As(y. X, B)= ElAn(y. X, 8)] it follows (by interchanging differentiation and the expectation operation) that plim 9A,(y, X. 8)/98 = ETah,(y. X, 89/98]. But, il this function achieves its maximum at 8, then it must bc the case that plim dhn(y, X. 8)/ 98 =4. An cstimator that is obtaincd by maximizing a criterion function is called an M estimator [Huber (1967)] or an extremum estimator [Amemiya (1985)]. Suppose that we obtain an estimator by maximizing some other function, Ma(y, X. 8) that, although not the log-likelihood function, also attains its unique maximum at the true 8 as» — 00, Then the preceding argument might produce a consistent estimator with a known asymp- totic distribution. For example, the log-likelihood for a linear regression model with normally distributed disturbances with different variances, 02;. is " o 2 8) = 155 (5 [lomtro tos + SER om; i=1 + By maximizing this function, we obtain the maximum likclihood estimator. But we also examined another estimator, simple lcast squares, which maximizes M,(y, X, 8) = (1/0) 10: — X:BP. As we showed carlicr, least squares is consistent and asymp- totically normally distributed even with this extension, so it qualifies as an M estimator of the sort we are considering here. Now consider the general case. Suppose that we estimate 8 by maximizing acriterion function 1 MAVIX. 8) == 5º log gli. 8). i=1 Suppose as well that plim MM, (y, X, 8)= E|M,(y. X, 8)] and that as x > 00, ELM(y. X. 8)] attains its unique maximum at 8. Then, by the argument we used above for the MLE, plim 9M,(y. X. 8)/08 = E[9M,(y.X. 8)/08] = 0. Onec again, we have a set of moment equations for estimation. Let Ê É be the estimator that maximizes May, X, 8). Then the estimator is defined by IMnly,X, Bo 1 > Blogg(ulxi, Br) dB: E oBe Thus, 2 is a GMM estimator. Using the notation of our carlicr discussion, G( 7) is the symmetric Hessian of E[Mn(y. X. 89], which we will denote (1/1) E [Hu(B2)] = Hy(Bo). Proceeding as we did above to obtain Vym, we find that the appropriate asymptotic covariance matrix for the extremum estimator would be Ve = to (58) tt where & = Var(d log g(i|x;. 8)/08]. and, as before, the asymptotic distribution is normal. =m(Ê;) = 522 CHAPTER 17 + Maximum Likelihood Estimation Exercises 1. Assume that the distribution of x is f(x) = 1/8,0< x < 8. In random sampling [rom this distribution. prove that the sample maximum is a consistent estimator of 8. Note: You can prove lhat the maximum is the maximum likelihood estimator of 8. But the usual properties do not apply here. Why not? [Hint: Atcmpt to verify that the expected first derivative of the log-likelihood with respect to 6 is zero.) 2. in random sampling from the exponential distribution fQc) = (1/0)e* Bx>0. 8>0, find the maximum likelihood cstimator of 9 and obtain the asymptotic distribution of this estimator. . 3. Mixture distribution. Suppose that the joint distribution ol the two random variables xand vis BAY, DX fin = É 2, 8.0>07202=0,1,2,.... a. Find the maximum likelibood estimators of £ and 8 and their asymptotie joint distribution. b. Find thc maximum likelihood estimator of 6/(8 + 6) and its asymptotic distribution, c. Prove that f(x) is of the form fej=v0-vãa and find the maximum likelihood estimator of y and its asymptotic distribution. d. Prove that /(y | x) is of the form RSA vIt fora) = EM o, y204>0. =0,1,2,..., Prove that f(y | x) integrates to 1. Find the maximum likclihood estimator of À and its asymptotic distribution. [Hint: In the conditional distribution, just carry the xs along as constants.] e. Prove that f)=6", yz0, 0>0. Tind the maximum likelihood estimator of & and ils asymptotic variance. - Prove that =» fely)= Based on this distribution, what is the maximum likelihood estimator of 8? 4. Supposc that x has the Weibull distribution p— By JE PÉ cc gr2...6>0 x! fade, x>0,0,8>0. a. Obtain the log-likclihood function for a random sample o['n observations. b. Obtain the likelihood equations for maximum likelihood estimation of e and 8. Note that the first provides an explicit solution for « in terms of the data and 8. But, aftcr inserting this in the second. we obtain only an implicit solution for 8. How would you obtain the maximum likclihood estimators? CHAPTER 17 + Maximum Likelihood Estimation 523 c. Obtain the second derivatives matrix of the log-likelihood with respect to « and 8. The exact expectations of the elements involving £ involve the derivatives of the gamma function and are quite messy analytically. Of course, your exact result provides an empirical estimator. How would you estimate the asymptotic covuriance matrix for your esumators in Part b? d. Prove that «8Cov[ln x, xº] = 1. [Hint: lhe expected first derivatives of the log-likelihood function arc zero.) 5. The following data were gencratcd by the Weibull distribution of Exercise 4: 1.3043 0.49254 1.2742 1.4019 0,32556 0.29965 0.26423 1.0878 1.9461 0.47615 36454 0,15344 12357 0.96381 0.33453 1.1227 2.0296 1.2797 0.96080 2.0070 a. Oblain lhe maximum likelihood estimates of « and £, and estimate the asymp- tolic covariance matrix for the estimates. b. Carry outa Wald test of the hypothesis that 8 = 1. . Oblain the maximum likelihood estimate of « under the hypothesis that 8 = 1. d. Using the results of Parts a and, carry out a likelihood ratio test of the hypothesis thatg=1, . Carry out à Lagrange multiplicr test of the hypothesis that 8 = 1. 6. (Limited Information Maximum Likelihood Estimation). Consider a bivariate * distribution for x and y that is a function of two parameters, a and f. The joint density is f(x, v lo, 8). We consider maximum likclihood cstimation of the two parameters. The full information maximum likelihood estimator is the now famil- iar maximum likelihood estimator of the two parameters. Now, suppose that we can factor the joint distribution as done in Excreise 3, but in this case, we have faca, B) = fOlx, a, 8) f(x | o). That is, the conditional density for y is a fune- tion of both parameters, but thc marginal distribution for x involves only «. a. Write down the general form for the log likelihood function using the joint density. b. Since the joint density cquals the product of the conditional times the marginal, the loglikelihood function can be written equivalent!y in terms of thc factored density. Write this down, in general terms. c. The parameter « can be estimated by itself using only the data on x and the log likelihood formed using the marginal density for x. It can also be estimated with 8 by using the full log-likelihood function and data on both y and x. Show this, d. Show that the first estimator in Part c has a larger asymptotic variancc than the second one. This is the difference between a limited information maximum likelihood estimator and a full information maximum likelihood estimator. c. Show thatif d2ln f(y|x,0, 8)/0008=0, then the result in Part d is no longer true. 7. Show that the likelihood inequality in Theorem 17.3 holds for the Poisson distribu- tion uscdin Section 17.3 by showing that E T(L/n) ln L(9 | y)] is uniquely maximized at 6 = %y. Hint: First show that the expectation is 8 + Go Ing — Eo[im x;!]. 8. Show that thc likclihood ineguality in Thcorem 17.3 holds for the normal distribution. 9. For random sampling from the classical regression modelin (17-3), reparameterize the likelihood function in terms oly = 1/0 and 8 = (1/0)8. Find lhe maximum n e. 524 CHAPTER 17 + Maximum Likelihood Estimation 10. likclihood estimators of 7 and 8 and obtain the asymptotic covariance matrix of the estimators of these parameters. Section 14.3.1 presents estimates of a Cobb-Douglas cost function using Nerlove's 1955 data on the U.S. electric power industry. Christensen and Greene's 1976 update of this study used 1970 data for this industry. The Christensen and Greene data are given in Table F5.2. These data have provided a standard test data set for estimating different forms of production and cost functions, including the stochastic frontier model examined in Example 17.5. It has been suggested that one explanation for the apparent finding of cconomies ofscale in these data is that the smaller firms were inefficient for other reasons. The stochastic frontier might allow one to disentangle these effects. Use these data to fit a frontier cost function which includes a quadratic termin log outputin addition to the linear term and the factor prices. Then examine the estimated Jondrow ct al, residuals to see if they do indecd vary negatively with output, as suggested. (This will require either some programming on your part or specialized software. The stochastic frontier model is provided as an option in TSP and LIMDEP. Or, the likelihood function can bc programmed fairly easily for RATS or GAUSS. Note, for a cost fronticr as opposed to a production fronticr, it is necessary to reverse the sign on lhe argument in the & function.) - Consider. sampling from a multivariate normal distribution with mean vector We = (ga, tio, .... t4) and covariance matrix 9?[. The log-likelihood function is —M mM 1Ê InL=-5—Intm)- mo 55 ug). 7º nO ou 1(gr— 8) Show that the maximum likelihood estimates of the parameters are no qaM M u H 220 Die Diner Vim a 1 rol 22 og = St EM 4 x z E Om — Im)" = mM 2% =" j= m=1 Derive the second derivatives matrix and show that lhe asymptotic covariance matrix for the maximum likelihood estimators is [er o) = [rr 0 “Logoo” “1 0 20!/um | Suppose that we wished to test the hypothesis that the means of the M distributions were all equal to a particular value 4º. Show that the Wald statistic would be AZ + W= (3 - nt (61) Goutd= (5) Oui — ut, where j is the vector of sample means.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved