Docsity
Docsity

Prepare-se para as provas
Prepare-se para as provas

Estude fácil! Tem muito documento disponível na Docsity


Ganhe pontos para baixar
Ganhe pontos para baixar

Ganhe pontos ajudando outros esrudantes ou compre um plano Premium


Guias e Dicas
Guias e Dicas

econometric analysis - GREENE 6TH EDITION - ch 21, Manuais, Projetos, Pesquisas de Matemática Computacional

Econometria

Tipologia: Manuais, Projetos, Pesquisas

2011

Compartilhado em 03/04/2011

pedro-americo-penalber-2
pedro-americo-penalber-2 🇧🇷

29 documentos

1 / 93

Documentos relacionados


Pré-visualização parcial do texto

Baixe econometric analysis - GREENE 6TH EDITION - ch 21 e outras Manuais, Projetos, Pesquisas em PDF para Matemática Computacional, somente na Docsity! 21 MODELS FOR DISCRETE CHOICE rafa fim 21.1 INTRODUCTION There are many settings in which the economic outcome we seek to model is a discrete choice among a set of alternatives, rather than a continuous measure of some activity. Consider, for example, modeling labor forcc participation, the decision of whether or not to make a major purchase, or the decision of which candidate to vote for in an election. For the first of these examples, intuition would suggest that factors such as age, education, marital status, number of children, and some ceonomic data would be relevant in explaming whether an individual chooses to seck work or not in a given period. But somcthing is obviously lacking if this example is treated as the same sort of regression model we used to analyze consumption or the costs of production or the movements of exchange rates. In this chapter, we shall examine a varicty of what have come to be known as qualitative response (QR) models. There arc numerous different types that apply in different situations. What they have in common is that they are models in which the dependent variable is an indicator of a discrete choice, such as a “yes or no” decision. In general, conventional regression methods are inappropriate in these cases. This chapter is a lengthy but far from complete survey of topies in estimating OR models, Almost none of these models can be consistently estimated with linear regres- sion methods. Therelore, readers interested in the mechanics of estimation may want to revicw the materialin Appendices D and E before continuing. In most cascs, the method of estimation is maximum likelihood. The various properties of maximum likclihood estimators are discussed in Chapter 17. We shall assume throughout this chapter that the necessary conditions behind the optimality properties of maximum likelihood estima- tors are met and, therefore, we will not derive or establish these properties specifically for the OR models. Detailed proofs tor most of these models can be found in surveys by Amemiya (1981), McFadden (1984), Maddala (1983), and Dhrymes (1984). Additional commentary on some of the issues of interest in the contemporary literature is given by Maddala and Flores-Lagunes (2001). 21.2 DISCRETE CHOICE MODELS The general class of models we shall consider are those for which the dcpendent variable takes values 0,1,2,.... In a few cases, the values will themselves be meaningful, as in the following: 1. Number of patents: y = 0,1,2,... These are count data. 663 664 CHAPTER 21 + Models for Discrete Choice In most of the cases we shall study. the values taken by the dependent variables are merely a coding for some qualitative outcome. Some examples are as follows: 2. Labor force participation: We equate “no” with O and “yes” with 1. These decisions are qualitative choices. The 0/1 coding is a mere convenience, 3. Opinions of a certain type of legislation: Let O represent “strongly opposed)” 1“opposedy” 2 “neutral” 3 “support,” and 4 “strongly support.” These numbers arc rankings, and the values chosen are not quantitative but mercly an ordering. The difference between the outcomes represented by 1 and O is not necessarily the same as lhat between 2 and 1. 4. The occupational field chosen by an individual: Let O be clerk, 1 engineer, 2 lawyer, 3 politician, and so on. These data are merely categories, giving neither a ranking nor a count, 5. Consumer choice among alternative shopping arcas: This casc has Lhc same characteristics as example 4, but the appropriate model is a bit dillcrent. These two cxamples will differ in the extent to which the choice is based on characteristics of the individual, which arc probably dominant in the occupational choice, as opposed to attributes of the choices, which is hkely the more important consideration in the choice of shopping venue. None of these situations lends themselves readily to our familiar type of regression analysis. Nonethcless, in cach case, we can construct models that link the decision or outcome to a sct ol factors. at least in the spirit of regression. Our approach wilt be to analyze each of them in the general framework of probability models: Prob(event j occurs) = Prob(Y = ;) = Flrelevant effects, parameters]. (21-1) The study of qualitative choice focuscs on appropriate specification, estimation, and use of models for the probabilitics of events, where in most cases, the “event” is am individual's choice among a set of alternatives. Example 21.1 Labor Force Participation Model In Example 4.3 we estimated an earnings equation for the subsample of 428 married women who participated in the formal labor market taken from a full sample of 753 observations. The semilog earnings equation is of the form In camnings = &1 + Boage + faage? + fseducation + Bskids + É where carnings is hourty wage times hours worked, education is measured in years of school- ing and kids is a binary variable which eguals one if there are children under 18 in the house- hold. What of the other 325 individuals? The underlying labor supply model described a market in which labor force participation was the outcome of a market process whereby the demanders of labor services were willing to offer a wage based on expected marginal product and individuals themselves made a decision whether or not to accept the offer depending on whether it exceeded their own reservation wage. The first of these depends on, among other things, education, while the second (we assume) depends on such variables as age, the presence of children in the household, other sources ofincome (husband's), and marginal tax rates on labor income. The sample we used to fit the earnings equation contains data on all these other variables. The models considered in this chapter would be appropriate for modeling the outcome y; = 1 ifin the labor force, and O if not. CHAPTER 21 + Models for Discrete Choice 667 Partly because of its mathematical convenience, the logistic distribution, x8 Prob(Y=1|)= CSA), (21.7) 1+evê has also been used in many applications. We shall usc the notation A(.) to indicate the logistic cumulative distribution function. This model is called the logit modcl for reasons wc shall discuss in the next section. Both of these distributions have the familiar bell shape of symmetric distributions. Other models which do not assume symmetry, such as the Weibull model Prob(Y = 1|x) = exp[-exp(x'8)] and complementary log log model, Prob(Y = lijx) 1 — explexp(—x'8)] have also been employed. Still other distributions have been suggested? but the probit and logit models are still the most common frameworks used in econometrie applications. The question ol which distribution to use is a natural one. The logistic distribution is similar to the normal except in the tails. which are considerably hcavicr. (It more closcly resembles a t distribution with seven degrees of freedom.) Therefore, for intermediate values of x'8 (say. betwecn —1.2 and +1.2). the two distributions tend to give similar Probabilities. The logistic distribution tends to give larger probabilities to y = O when x'B is extremely small (and smaller probabilities to Y = O when 8'xis very large) than the normal distribution. Itis difficult to provide practical gencralities on this basis, however, since they would require knowledge of 8. We should expect different predictions from the two models, however. if the sample contains (1) very few responses (Ys equal to 1) or very few nonresponses (Ys equal to 0) and (2) very wide variation in an important independent variable, particularly if (1) is also true. There arc practical reasons for favoring one or the other in some cases for mathematical convenience, but it is difficult to justify thc choice of one distribution or another on theoretical grounds. Amemiya (1981) discusses a number of related issues, but as a general proposition, the question is unresolvcd, In most applications, the choice between thesc two seems not to make much difference. However, as seen in the example below, Lhe symmetric and asymmetric distributions can give substantively different results, and here, the guidance on how to choose is unfortunately sparse. The probability model is a regression: Eli=0[ = FB] + 1[FOCB)] = Fx). (218) Whatever distribution is used, it is important to note that the parameters of the model, like those of any nonlincar regression model. are not necessarily the marginal cficets wc are accustomed to analyzing. In general, dEJvix] À dF('B) ox dB). | B=fwpe. (1-9) ?See, or example, Maddala (1983, pp. 27-32), Aldrich and Nelson (1984) and Greene (2001). 668 CHAPTER 21 + Models for Discrete Choice where /(.) is the density Junction that corresponds to the cumulative distribution, F(.). For the normal distribution, Lhis result is aEly LDd ow, (110) x where &(t) is the standard normal density. For the logistic distribution, dA B) ext DTD Ss =AGgBlI- . 21- dB) E Ts ep ACE AGB] ex Thus, in the logit model, 2E CEDO = açepyi - AGE. (2112) Itis obvious that these values will vary with the values of x. In interpreting the estimated model. it will bc useful to calculate this value at, say, the means of the regressors and, where necessary, other pertinent values. For convenience, it is worth noting that the same scale factor applies to all the slopes in the model. For computing marginal effects, onc can evaluate the expressions at the sample means of the data or evaluate the marginal effects at every observalion and use the sample average of the individual marginal effects. The functions are continuous with continuous first derivatives, so Thcorem D.12 (the Slutsky lcorem) and assuming thai the data are “well behaved” a law of large numbers (Theorems D,4 and D.5) apply: in large samples these will give the same answer. But that is not so in small or moderate- sized samples. Current practice favors averaging the individual marginal effects when it is possible to do so. Another complication for computing marginal effects in a binary choice model arises because x will often include dummy variables—lor example, a labor force par- ticipation equation will often contain a dummy variable for marital status. Since the derivative is wilh respect to a small change, it is not appropriate to apply (21-10) tor the cffcct of a change in a dummy variable, or change of state. The appropriate marginal effect [or a binary independent variable, say d, would be Marginal effect = Prob[Y = 1|%&a. d=1] -Prob[Y=1|&a, d=0), where X;a. denotes the means of all the other variables in the model. Simply taking the derivative with respect to Lhe binary variable as ifit were continuous provides an approximation that is often surprisingly accurate. In Example 21.3, the difference in the two probabilities for the probit model is (0.5702 — 0.1057) = 0.4645, whereas the derivative approximation reported below is 0.468. Nonetheless, it might be optimistic to rely on this outcome. We will revisit this computation in the examples and discussion to follow. 21.3.2 LATENT REGRESSION—INDEX FUNCTION MODELS Discretc dependent-variable models are often cast in the form ofindex function models. We view the outcome of a distrete choice as a rellection of an underlying regression. As an olten-cited example, consider the decision to make a large purchase. The theory states that the consumer makes a marginal bencfit-marginal cost calculation based on the utilitics achieved by making the purchase and by not making the purchase and by CHAPTER 21 + Models for Discrete Choice 669 using the moncy for something else. We model the difference between benefit and cost as an unobserved variable y* such that yi=xB+e. We assume thats has mean zero and hascither astandardized logistic with (known) vari- ance 2/3 [sec (21-7)] or a standard normal distribution with variance one [see (21-6)]. We do not observe the not benclit of the purchase, only whether it is made or not. Therefore, our observation is y=1 ifyt>0, y3=0 1fy*<0. In this formulation, x'£ is called the index function. Two aspects of this construction merit our altention. First, the assumption of known variance of & is an innocent normalization. Suppose the variance of e is scaled by an unrestrictod parameter o2. The latent regression will be y* = x/8 + oe. But, (y*/0) = x'(8/0)+eis lhe same model with the same data. The observed data will be unchanged: yisstill0or 1, depending only on the sign of y* not on its scale. This means that there is no information about o in the data so it cannot be estimated. Second, the assumption of zero Lor the threshold is likewise innocent if the model contains a constant term (and not if it does not).* Let a be the supposed nonzero threshold and « be an unknown constant term and, for the present, x and 8 contain thc rest of thc index not including the constant term. Ihen, the probability that y equals once is Prob(y* > a |x) = Prob(a + x8+e > a |x) = Problta a) +x'B +e > 0|x). Sincc « is unknown, the ditlerence (a — «) remains an unknown parameter. With the two normalizations, Prob(y* > 0|x) = Prob(e > —x'B |x). Tf the distribution is symmetric, as are the normal and logistic, then Proby” > 0]x) = Prob(e <x'B x) = F('B), which provides an underlying structural model for the probability. Example 21.2 Structural Equations for a Probit Model Nakosteen and Zimmer (1980) analyze a model of migration based on the following structure:? For individual /, the market wage that can be earned at the present location is Yo = AB + Ep Variables in the equation include age, sex, race, growth in employment, and growth in per capita income. If the individual migrates to a new location, then his or her market wage *Unless there is some compelling reason. binomial probability modeis should not be estimated without constant terms. 5A number of other studies have also used variants of this basic formulation. Some importaal examples are Willis and Rosen (1979) and Robinson and Tomes (1982). The study by Tunali (1986) examined in Example 21.5 is another example. The now standard approach, in which “participation” equals one if wage offer GU, By +Ew) minus reservation wage (x; 8, + é) is positive, is also used in Fernandez and Rodriguez-Poo (1997). Brock and Durlaut (2000) describe à number oí models and situations involving individual behavior that give rise to binary choice models. 672 CHAPTER 21 + Models for Discrete Choice Using the device suggested in footnote 6, wc can reduce this to n dlogL aeb) =, = Oi ly = x 0 21-21 a | Dia) | Dedo (2 where g =2y—1. The actual second derivatives for the logit model are quite simple: nt — dgop' Since the second derivatives do not involve the random variable y;, Newton's method is also the method of scoring for the logit model. Note that the Hessian is always negative definite, so the log-likchihood is globally concave, Newton's method will usually converge to thc maximum of the log-likelihood in just a few iterations untess the data arc especially badly conditioncd. The compuration is slighily more involved for the probit model. A useful simplification is obtained by using the variable 2(y;, B'xp) = À; that is defined in (21-21). The second derivatives can be obtained using the result that for any 7, do (z)/dz = —24(2). Then, for the probit model, =" SD Al — Adu. (21-22) 92InL — dgop' This matrix is also negative definite for al values of 8. The proof is Jess obvious than for the logit model. Itsuffices to note that the scalar part in the summation is Varfe |s < 8'x] — I when y = 1 and Var[e|e > —8'x] — 1 when y = 0. The unconditional variance is one. Since truncation always reduces variance—see Thcorem 22.3—in both cascs, the variance is between zero and one, so the value is negative.!º The asymptotic covariance matrix for the maximum likclihood estimator can be estimated by using lhe inverse of the Hessian evaluated at the maximum likelihood estimates. There are also two other cstimators available. The Berndt, Hall, Hall, and Hausman cstimator [see (17-18) and Example 17.4] would be o B=So ei, i=l where g; = (% — Ap) for (he logit model [see (21-19)] and g; = à; Lor the probit model [see (21-21)]. The third estimator would bc based on Lhe expected value of the H. As we saw earlier, lhe Hessian for the logit model does not involve y. so H = E[H]. But because à; is a function of y; [sec (21-21)], this result is not true for the probit model. Amemiya (1981) showed that for the probit model, 2InL 2 eligis | = Ahn. (21-24) 98 9P' | oi Dotada Once again, the scalar part of the expression is always negative |see (21-23) and note that Aq; is always negative and À, is always positive]. The estimator of the asymptotic a =55 aã; + alba. (21-23) tal Psce, for example, Amemiya (1985, pp. 273-274) and Maddala (1983, p. 63). Jogec Johnson and Kotz (1993) and Heckman (1979). We will make repeated use of this result in Chapter 22 CHAPTER 21 + Models for Discrete Choice 673 covariance matrix for the maximum likclihood estimator is then the negative inverse of whatever matrix is used to estimate the expected Hcssian. Since the actual Hessian is generally uscd for the iterations, this option is (he usual choice, As we shall see below, though, for certain hypothesis tests, the BHHII estimator is a more convenient choice. In some studics [c.g., Boyes, Hoffman, and Low (1989), Greene (1992)], the mix of ones and zeros in thc observed sample of the dependent variable is deliberately skewed in favor of onc outcome or the other to achieve a more balanced sample than random sampling would produce. The sampling is said to be choice based. In the studies noted, the dependent variable measured the occurrence of loan default, which is a relatively uncommon occurrence. To enrich the sample, observations with y = 1 (default) were oversampled. Intuition should suggest (correctly) that the bias in the sample should bc transmitted to the parameter estimates, which will bc cstimated so as to mimic the sample, not the population. which is known to be different. Manski and Lerman (1977) derived the weightcd endogenous sampling maximum likelihood (WESML.) estimator for Lhis situation. The cstimator requires that the truc population proportions. « and «og, be known. Let p, and py be the sample proportions of ones and zeros. Then the estimator is obtained by maximizing a weighted log-likelihood, InL=5 win lg B'xo, i=1 where w; = yi(mn/py) + (1 — y)(wo/ po). Note that w; takes only Lwo different values. Thc derivatives and the Hessian are likewise weighted. A final correction is needed after estimation; thc appropriate estimator of the asymptotic covariance matrix is the sandwich estimator discussed in (he next section, H'BH”" (with weighted B and H), instead of B or H alone. (The weights are not squared in computing B.)! 21.4.1 ROBUST COVARIANCE MATRIX ESTIMATION The probit maximum hkelihood estimator is often labelcd a quasi-maximum likeli- hood estimator (OMLE) in view of the possibility that the normal probability model might be misspecified. White's (19822) robust “sandwich” estimator for the asymptotic covariance matrix of the QMLE (see Section 17.9 for discussion), Est.Asy. Var[8] = [AJ "BIA, has been used in a number of recent studies based on the probit model [e.g., Fernandez and Rodriguez-Poo (1997), Horowitz (1993), and Blundell, Laisney, and Lechner (1983)]. IL the probit model is correctly specified, then plim(1/7)B = plim(!/ny(—H) and either single matrix will suffice, so the robustness issue is moot (of course). On the other hand, the probit (Q-) maximum likelihood estimator is not consistent in the pres- enec ol any form of heteroscedasticity, unmeasured heterogeneity, omitted variables (even if they arc orthogonal to the included ones), nonlinearity of the functional form of the index, or an error in the distributional assumption [with some narrow exceptions “NWESML and the choice-based sampling estimator are not the free lunch they may appear to be. That which the biused sampling does, (he weighting undoes, IL is common for the end result to be very large standard errors. which might be viewed as untortunale, insofar as the purpose of the biased sampling was to balance the data precisely to avoid this problem. 674 CHAPTER 21 + Models for Discrete Choice as described by Ruud (1986)). Thus, in almost any case, the sandwich estimator pro- vides an appropriate asymptotic covariance matrix for an estimator that is biased in an unknown direction. White raises this issuc explicitly, although it scems to reccive little attention in the literature: “it is (he consisteney cf the OMLE [or the parameters of interest in a wide range of situations which insures its usefulness as the basis for robust estimation techniques” (1982a, p. 4). His very useful result is that if the quasi-maximum likelihood estimator converges to a probability limit, then the sandwich estimator can. under certain circumstances, be used to estimate the asymptotic covariance matrix of that estimator. But there is no guarantee that the OMLE wil! converge to anything interesting or useful. Simply computing a robust covariance matrix Jor an otherwise inconsistent estimator does not give it redemption, ConsequentIy, the virtue of a robust covariance matrix in this setting is unclear. 21.4.2 MARGINAL EFFECTS The predicted probabilities, F6r À) = É and the estimated marginal effects f(x B)xÊ = fB are nonlincar functions of the parameter estimates. To compute standard errors, we can use the lincar approximation approach (delta method) discussed in Section 5.2.4. For the predicted probabilities. Asy. Varl?] = [af /9B] v(0 2/08), where V=Asy. Var[Ê). The estimated asymptotic covariance matrix of 2 can be any of the three described carlicr. Let z = x'8. Then the derivative vector is [92/38] = [dPydaloz/08] = fx. Combining terms gives Asy. Varfl= fx Vx. which depends. of course, on the particular x vector used. This results is useful when à marginal effect is computed Lor a dummy variable. In that case, the estimated cffect is =Pjd=1-Ê|d=0. (21-25) The asymptotic variance would be Asy. Var[A É] = [94 Pj9ÊJV]oA É 08], where (21-26) [asFpafl= n( Ê For the other marginal effects, let p = fÊ. Asy. Var[P] = CHAPTER 21 + Models for Discrete Choice 677 10 08 [esse femme With PSI osn 88 04 Prob(Grade = 1) E TABt Logistic Probis Variable Coefficient Ratio Slope + Ratio Coefficient tRatio Slope tRatio Constant —13.021 —2.641 — — —7.452 —2.931 — — (4.931) (2.542) GPA 2.826 2.238 0.53 2.252 1.626 2343 0.533 2.294 (1.263) (0.237) (0.694) (0.232) TUCE 0095 067 008 0.685 0057 0617 0017 0626 (0.142) (0.026) (0.084) (0.027) PST 2.379 2234 (0456 2.52 1.426 2397 0.464 2.727 (1.065) (0.181) (0.595) (0.170) log likelihood 12.890 —12819 restrictions R$ = q, the statisticis W= (RÊ — q'(R(EstAsy. Var BPRYTURÊ — q). For example, for testing the hypothesis that a subset of the coefficients, say the last M, are zero, the Wald statistic uses R = [0] [x] and q = 4. Collecting terms, we find that the test statistic for this hypothesis is W=Puviiên (21:27) where the subscript M indicates the subvector or submatrix corresponding to the M variables and V is the estimated asymptotic covariance matrix of 8. 678 CHAPTER 21 + Models for Discrete Choice Likelihood ratio and Lagrange multiplier statistics can also be computed. The like- lihood ratio statistic is LR=—2[n br Into). where Lp and Íy are lhe log-likelihood functions evaluated at the restricted and unte- stricted estimates, respectively. A common test, which is similar to the F'test that all the slopes in a regression are zero, is the likelihood ratio test that all the slope coclfficients in the probit or logit model are zero. For this test. the constant term remains unrestricted. In this case, the restricted log-likelihood is the same [or both probit and logit models, InLo=n[PnP+GoPInd-P), (21-28) where P is the proportion of the observations that havc dependent variable cqual to 1. Jt might be tempting to use the likelihood ratio test to choose between the probit and logit models. But there is no restriction involved, and the test is not valid for this purpose. To underscore the point, there is nothing in its construction to prevent the chi-squared statistic for this “test” from being negative. The Lagrange multiplier test statistic is LM = g'Vg. where g is the first derivativos of the unrestricted model evaluated at the restricted parameter vector and V is any of the three estimators of the asymptotic covariance matrix of lhe maximum likelihood es- timator, once again computed using the restricted estimates. Davidson and MacKinnon (1984) find cvidence that E[H] is the best of the three estimators to use, which gives H t n = " LM = (» ex) >» El-hia (> e) , (21:29) i=1 i=1 i=1 wherc E[—A;) is defined in (21-22) for the logit modei and in (21-24) for the probit model. For the logit model, when the hypothesis is that all the slopes are zero, LM =nR, where R2 is the uncentered coefficient of determination in the regression of (3; — 7) on x; and j is the proportion of Is in the sample. An alternative [ormulation bascd on the BHHH estimator, which we developed in Section 17.5.3 is also convenient. For any of the models (probit, logit, Weibull, etc.). the first derivative vector can be written as ami É me 38 =3 gm =X6Gi, i=d where G(n x n) = diaglgi, g2..... 8h] andiis an nx 1 column of Is, The BHHH csti- mator ol the Hessian is (X'G'GX), so lhe LM statistic bascd on this estimator is IM=n [piemereeo” ore =nR. (21-30) where R$ is the uncentered coefficient of determination in a regression of a column of ones on the first derivativos of the logs of the individual probabilities. All the statistics listed here are asymptotically equivalent and under the nullhypoth- esis of the restricted model have limiting chi-squared distributions with degrees of frec- dom equal to the number of restrictions being tested. We consider some examples below. CHAPTER 21 + Moadels for Discrete Choice 679 21.4.4 SPECIFICATION TESTS FOR BINARY CHOICE MODELS In lhe lincar regression model, we considered two important specification problems, the effect ol omitted variables and the effect of heteroscedasticity. In the classical model Y = Xi84 + Xo8> + e, when least squares estimates b, are computed omitting X>, Edi] = 8 + [XIX XXo8). Unless X, and X» arc orthogonat or 8, = 0. by is biased. Hfwe ignore heteroscedasticity, then although the Icast squares estimator is still unbiased and consistent, it is inefficient and the usual estimate of its sampling covariance matrix is inappropriate. Yatchew and Griliches (1984) have examined these same issues in the setting of the probit and logit models. Their general results are far more pessimistic, In the context of a binary choice model, lhey find the following: 1, Jfx>is omitted from a model containing x; and x». (ie. 8 0) then Plim 2, =ciBy + c282, where cy and «> are complicated functions of the unknown parameters. The implication is that even if the omittcd variable is uncorrelated with the included one, the coefficient on the included variable will be inconsistent, 2. Ifthe disturbances in the underlying regression are heteroscedastic, then the maximum likelihood estimators are inconsistent and the covariance matrix is inappropriate. The second result is particularly troubling because the probit model is most often used with microeconomic data, which are frequently heteroscedastic. Any of the three methods of hypothesis testing discussed above can be used to analyze these specification problems. The Lagrange multiplicr test has thc advantage thatitcan be carried out using the estimates from the restricted model, which sometimes brings a large saving in computational clfort. This situation is especially true for the test for heteroscedasticity. To reiterate, the 1 agrange multiplier statístic is computed as follows. Let the null hypothesis, Ha. be a specification of the model, and let + be the alternative. Forexample, Ho might specify that only variables x, appear in the model, whereas JA might specify that xp appears in the model as wcll. The statistic is LM= gV5'go, where ga is the vector of derivatives ofthc log-likelihood as specified by Z7, but evaluated at the maximum likclihood estimator of the parameters assuming that 14 is true, and Y5! is any of the three consistent estimators of the asymptotic variance matrix ol the maximum likelihood estimator under Fh, also computed using the maximum likelihood estimators based on H. The statistic is asymptotically distributed as chi-squared with degrees ol freedom equal to the number of restrictions. E The results in (his section are based on Davidson and MacKinnon (1984) and Engle (1984). A symposium on ihe subject of specification tesis in diserete choice models is Blundell (1987). 682 CHAPTER 21 + Models for Discrete Choice Estimated;CoeHi Estimate (StdEr) Mare. Effec: Estimare (SLER) | Marg. Effeci Constant Bro 41571402) — —6.030(2.498) — Age B> Q.185(0.0660) —0,0079(0.0027) 0.264(0.118) —(,0088(0.00251) Ago? Ba —0.0024(0.00077) — —0.0036(0.0014) — Income Ba 0.0458(0.0421) 0.0180(0.0165) 0.424(0.222) 0,0552(0,0240) Education Bs 0,0982(0,0230) 0.0385(0.0090) 0,140(0.0519) 0.0289(0.00869) Kids Be —0.449(0.131) —ONTI(O0480) 0.870.303) —0.167(0.0779) Kids A 0.000 — —0,141(0.324) — Income » 0.000 — 0,313(0.123) — Log 7. —490.8478 —487.6356 Correct Preds. Os: 106, 15:357 Os: 115,15:358 *Marginal eifect and estimated standard error include both mean () and variance (y) cffects. Table 21.3 presents estimates of the probit mode! now with a correction for heteroscedas- ticity of the form Var[e;] = explyrkids + yfamily income). The three tests for homoscedasticity give LR = 2[-487.6356 — (—490.8478)] = 6.424, LM = 2.236 based on the BHHH estimator, Wald = 6.533 (2 restrictions). The 99 percent critical value for two restrictions is 5.99, so the LM statistic conflicts with the other two. 21.4.4.c A Specification Test for Nonnested Models — Testing for the Distribution Whether the logit or probit form, or some third alternative, is the best specification for a discrete choice model is a perennial question. Since the distributions are not nested within some higher level model, testing for an answer is always problematic. Building on the logic ol the P; test discussed in Section 9.4.3, Silva (2001) has suggested a score testwhich may be useful in this regard. The statistic is intended for a varicty of discrete choice models, but is especially convenient for binary choice models which are based on a common single index formulation—the probability model is Probty; = 1|x) = F(g8). Let “1” denote Model 1 based on parameter vector 8 and “2” denote Model 2 with parameter vector y and let Model 1 be the null specification while Model 2 is the alternative, A “super-model” which combines two alternatives would have likelihood function p= ÀALOLOI XP tai | My Po MA OuGIXP+atag X pda (Note that integration is used generically here, since y is discrete.) The two mixing parameters are p and «. Silva derives an LM test in this context for the hypothesis « = 0 for any particular value of p. The case when p = O is of particular interest. As he notes, it is the nonlincar counterpart to the Cox test we examined in Section 8.3.4. [For related results, see Pesaran and Pesaran (1993), Davidson and MacKinnon (1984, 1993). CHAPTER 21 + Models for Discrete Choice 683 Orme (1994). and Weeks (1996).] For binary choice models, Silva suggests the following procedure (as one of three computational strategies): Compute the parameters of the compcting models by maximum likelihood and obtain predicted probabilities for y; = 1, P? where “i” denotes lhe obscrvation and “m” =1 or 2 for the two models. The individual observations on the density for the null model, /?:, are also required. The new variable 2(0) = fi PP) isthen computed. Finally, Model 1 is then reestimated with z;(0) added as an additional independent variable. A test of the hypothesis that its coelficient is zero is equivalent to alest of the null hypothesis that « = 1, which favors Model 1. Rejection of the hypothesis favors Model 2. Silva's preferred procedure is the same as this based on a] P; Tr apo Assupgested by the citations above, tests of this sort have a long history in this literature, Silva's simulation study for the Cox test (o = 0) and his score test (o = 1) suggest that the power of the test is quite erratic. 21.4.5 MEASURING GOODNESS OF FIT Therc have been many fit measures suggested for QR models.!é At a minimum, one should report the maximized valuc of the log-likelihood function. Im L. Since the hypothesis that all the slopes in the model are zero is often interesting. the log-likelihood computed with only a constant term, In £o [sce (21-28)], should also be reported, An analog to the R? in a conventional regression is McFadden's (1974) likelihood ratio index, In£ Indo. This measurc has an intuitive appeal in that it is bounded by zero and onc. If all the slope coefficients are zero, then il cquals zero. There is no way to make LRI equal 1, although one can come close. Tf F; is always one when y equals one and zero when y equals zero. then In L equals zero (the log ol one) and LRI equals one. Tt has been suggested that this finding is indicative of a “perfect fit” and that LRI increases as the fit of the model improves. To a degree, this point is truc (sec the analysis in Section 21.6.6). Unfortunately, the values between zero and one have no natural interpretation. HE P6x; 8) is a proper pdí, then even with many regressors the model cannot fit perfecily unless x/8 goes to +00 or —oc. As a practical matter, it doces happen. But when it does. it indicates a law in the madel, not a good fit. If the range of one of the independent variables contains a value, say x*, such that the sign of (x — x*) predicts y perfectly LRI=1- 1SHis conjecture about the computational burden is probably overstated given that modern software offers a variety ol binary choice models essentially in push-burton fashion. 1óSco, for example. Cragg and Uhler (1970). Amemiya (1981), Maddala (1983). McFadden (1974), Ben-Akiva and Lerman (1985), Kay and Little (1986). Veall and Zimmerman (1992), Zavoina and Mekelvey (1975). Efron (1978), and Cramer (1999), A survey of techniques appears in Windmeijer (1995). 684 CHAPTER 21 + Models for Discrete Choice and vice versa, then Lhe mode! will become a perfect predictor. This result also holds in general if the sign ol x'B gives a perfect predictor for some vector 8.7 For example, one might mistakenly include as a regressor a dummy variables that is identical, or nearly so, lo Lhe dependent variable. In this case, thc maximization procedure will break down precisely because x'f is diverging during the iterations. |Sce McKenzie (1998) for an application and diseussion.] Of course, this situation is not at all what we had in mind for a good fit. Other fit measures have been suggested. Ben-Akiva and Lerman (1985) and Kay and Little (1986) suggested a fit measure that is keyed to the prediction rule, 1É a Ri ui DM E), i=1 which is the average probability of correct prediction by thc prediction rule. The diffi- culty in this computation is that in unbalanced samples, the less [requent outcome will usually be predicted vary badly by the standard procedure, and this measure does not pick that point up. Cramer (1999) has suggested an alternative measure that directly measures Lhis failure, » = (average? | yy = 1) — (average? |y = 0) = (average(l — É) | y; = 0) — (average(l — É) | 3; = 1). Cramer's measure heavily penalizes the incorrect predictions, and because each propor- lion is taken within the subsample, it is not unduly influenced by the large proportionate size of the group of more frequent outcomes. Some of the other proposed fit measures are Elron's (1978) Duo d? Di O" R$ =1 Vcall and Zimmermann's (1992) Ra = (riso! YZ" ÀS-IRI "O Dlog Lo and Zavoina and MeKclvey's (1975) Dia (Bt — é? n+ 7 (Bm — Bi)? The last oí these measures corresponds to the regression variation divided by the total variation in the latent index function model, where the disturbance variance iso? = 1, The values of'scveral of these statistics are given with the model results in Example 21.4 for illustration. A useful summary of the predictivo ability of the model is a 2 x 2 table of the hits and misses of a prediction rule such as 2 Ryz = P=1 ifÊ>F“andoO otherwise. (21-36) VSee McFadden (1984) and Amemiya (1985). If this condition holds, then gradient methods wil find that £ CHAPTER 21 + Models for Discrete Choice 687 where Els]=0, Varfo] = FUDmo (21-38) This heteroscedastic regression format suggests that the parameters could be estimated by a nonlincar weighted least squares regression. But there is a simpler way to proceed. Since the Iunction P(x;8) is strictly monotonie, it has an inverse. (Sce Figure 21.1.) Consider, then, a Taylor series approximation to this function around the points; =0, that is, around the point P = x;, À, Pap = E time Pa) + [Ep um) dm; But F-!(x;) = x! and de) 1 a dm FEM) fm so FB) me xçf + A PoE Ra This equation produçes a heteroscedastic linear regression, FMP=4=XB8+u, where Fixo — Fm; Elulx]=0 and Varju|x]= A (21-39) The inverse function for the logistic model is particularly easy to obtain. If = exp(xi8) CO A rexpogp) then m( Zi J-xa. 1-7; This function is called the logit of x;, hence the name “logit” model. For the normal distribution, the inversc function &" !(x;), called the normitof x;. must be approximated. The usual approach is a ratio of polynomials.? Woeighted least squares regression bascd on (21-39) produces the minimum chi- squared estimator (MCSE) of 8. Since the weights are functions of the unknown pa- rameters, a twu-step procedure is called for. As always, simple least squares at the first step produces consistent but inefficient estimates. Then the estimated variances E = mó; Bee Abramovitz and Stegun (1971) and Section E.5.2. The function normit +5 is callcd the probit of £. The term dates from the early days of this analysis, when the avoidance of negative numbers was a simplification with considerable payoff. 688 CHAPTER 21 + Models for Discrete Choice for the probit model or o 1 mÁL- À) for the logit model bascd on the first-step estimates can be used for weighted least squares?! An iteration can then be set up, pro = DA x = pi where “(k)” indicales the Ath iteration and “>” indicates computation of the quantity at the current (Ath) estimate of 8. The MCSE has the same asymptotic properties as the maximum likelihood estimator at every step after the first, so, in fact, iteration is not necessary. Although they have the same probability limit, the MCSE is not algebraically ihe same as the MLE, and in a finite sample, they will diftcr numerically. The log-likelihood function for a binary choice model with grouped data is An Int= Dude In F(x;8) + — P)In[L > FexB)]). = The likelihood equation that defines the maximum likclihood estimator is n tr fi ç F(xiB) = E “Oto roça) This equation closely resembles the solution for the individual data case, which makes sense if wc view lhe grouped observation as 1; replications of an individual obser- vation. On the other hand, it às clear on inspection lhat the solution to this set of equations will not be the same as the gencralized (weighted) lcast squares estimator suggested in the previous paragraph. For convenience, define ; = F(xiB). fi= f(x). and f=[f()lz=x8]=[df(2)/dz]|2=x;f. The Hessian of the log-likelihood is susto (8) (8) came) (to) os To evaluate the expectation of the Hessian, we need only insert the expectation of the only stochastic element, 2, which is E[P;|x;] = Fi. Then PlogL] É , | mf? , Elgg) = 24 K sÉgIe= 5 [ae + The asymptotic covariance matrix for the maximum likclihood estimator is the negative inverse of this matrix. From (21-39), we sec that il is exactly equal to 1 =0. Asy. Var[minimum xº estimator] = [X'2-1x)-+ “Simply using p; and fF-4P9] might seem to be a simple expedient in computing the weights. But this method would be analogous to using x? instead of an estimate of o? in a heteroscedastie regression. Fitted probabilities and, tor the probit model, densities should be based on à consistent estimator ol the parameters. CHAPTER 21 + Models for Discrete Choice 689 since the diagonal elements of 2”! are precisely the values in brackcts in the expression for the expected Hessian above. We conclude that although the MCSE and the MLE for this model are numerically diffcrent, they have the same asymptotic properties, consistent and asymptotically normal (the MCS estimator by virtue of the results of Chapter 10, the MLE by those in Chapter 17), and with asymptotic covariance matrix as previously given. There is a complication in using the MCS estimator. The FGLS estimator breaks down if any of the sample proportions equals one or zero. A number of ad hoc patches have been suggested; the one that seems to be most widely used is to add or subtract a small constant, say 0,001, to or from the observed proportion when it is zero or one. The familiar results in (21-38) also suggest that when the proportion is based on à large population, the varianec of the estimator can be exceedingly low. This issue will Tesurface in surprisingly low standard errors and high + ratiosin the weighted regression. Unfortunately, that is a consequence of the model? The same result will emerge in maximum likclihood estimation with grouped data. 21.5 EXTENSIONS OF THE BINARY CHOICE MODEL Qualitativc response models have been a growth industry in econometrics. Thc recent literaturc, particularly in the area of panel data analysis. has produced a number of new techniques. 21.5.1 RANDOM AND FIXED EFFECTS MODELS FOR PANEL DATA The availability of high quality pancl data sets on microeconomic behavior has main- taincd an interest in extending the models of Chapter 13 to binary (and other discrete choice) models, In this section, we will survey a few results from this rapidly growing literature. The structural model for a possibly unbalanced panel of data would bc wriften Mi =MB+te i=Lcnt=1,...,T, Yu =1 ifyj > 0, and O otherwise. Thc second line of this definition is often written dir = Mai + eu > 0) to indicate a variable which equals one when the condition in parentheses is truc and zero when it is not. Ideally, we would like to specify that £;, and s; are frecly correlated within a group. but uncorrelated across groups. But doing so will involve computing * Whether the proportion should, in fact, be considered as a single observation from a distribution of pro- portions is a questiun that arises in all these cases. It is unambiguous in the bioassay cases noted earlier. But the issue is less clear with election data, especially since in these cases, the 7; will represent most ofif not all the potential respondents in location i rather than a random sample of respondents. 692 CHAPTER 21 + Models for Discrete Choice Conditioned on the common &;, the «'s are independent, so the term in square brackcts is just the product of the individual probabilities. Wc can write this as too [E / ph Yi 1X] -[ II (4 fetos] Fui) dus. TS L=1 ' Now, consider the individual densitics im Lhe product. Conditioncd on u;, these arc the now familiar probabilíties for the individual obscrvations, computcd now at x;,8 + tt. This produccs a general model for random effects for the binary choice model. Collecting all the terms, we have reduced it to li = Pla. soc [7 Li= Pbnsom lo f Ki ProbCk, = mu 8 + o) fun du e Li=1 It remains to specify the distributions, but the important result thus far is that the entire computation requires only onc dimensional integration, The inner prohabilitics maybe any cf the models wc have considered so far. such as probit, logit, Weibull. and so om. The intricate part remaining is to determine how to do the outer integration. Butler and Moffitt's method assuming that «; is normally distributed is straightforward, so we will consider it first. We will then consider some other possibilities. For the probit model, the individual probabilities inside the product would be O[g; (xj,B + u;)), where 9[.] is lhe standard normal CDF and g; = 2yi — 1. For the logit model, &[.] would be replaced with the logistic probability, A[.]. For the present. treat the entire function as a function of u;, g(:;). The integral is, then o 1 L= Ê. ou2a Letr; = m/(0,n/2). Then, u = (0uv/2)r; = 9r; and du; = 6dr;. Making the change of variable produces EE g(u;) dus. 1/9 s L= =f. g(Uri) dr;. (Several constants cancel out of the fractions.) Returning to our probit (or logit model), we now have 1 qt L= =| em VT Jo The payoff to all this manipulation is that this likelihood function involves only one- dimensional integrals. The inner integrals are the CDF of the standard normal distri- bution or the logistic or extreme value distributions, which are simple to obtain. The function is amenable to Gauss-Hermite quadrature for computation. (Gauss-Hermite quadrature is discussed in Section E.5.4,) Assembling all the pieces, we obtain lhe ap- proximation to the log-likelihood; n 1 HI In Ly = Lp fr E »X Due (a: (x,8 + vao) | h=1 = E [ais + er t=1 dr. CHAPTER 21 + Models for Discrete Choice 693 where H is the number of points for the quadrature, and w, and z, are the weights and nodes for the quadrature. Maximizing this function remains a complex problem. But, it is made quite feasible by the transtormations which reduce the integration to one dimension. This technique for the probit model has been incorporated in most contemporary econometric software and can be casily extended to other models. The first and second derivativos are likewisc complex but still computable by quadrature. An estimate of o, is obtained from the result o, = g//2 and a standard error can be obtained by dividing that for ô by «/2. The model may be adapted to the logit or any other tormulation just by changing the CDF in the preceding equation from $[.] to the logistic CDF, A[.] or the other appropriate CDF. The hypothesis of no cross-period correlation can be tested, in principle. using any of the three classical testing procedures we have discussed to examine the statistical significance of the estimated o. A number of authors have found the Butler and Moffitt formulation to be a satis- factory compromise between a fully unrestricted model and the cross-scelional variant that ignores the correlation altogether. A recent application that includes both group and time effects is Tauchen, Witte, and Griesinger's (1994) study of arrests and criminal behavior. The Butler and Moffitt approach has been criticized for the restriction of equal correlation across periods. But it does have a compelling virtuc that the model can bc efficiently estimated even with fairly large T; using conventional computational methods. [See Greene (1995a, pp. 425-431).] A remaining problem with the Butler and Moffitt specification is its assumption of normality. In general, other distributions are problematic because of the difficulty of finding cither a closed form for the integral or a satisfactory method of approximating the integral. An alternative approach which allows some flexibility is the method of maximum simulated likclihood (MSL) which was discussed in Section 17.8. The trans- formed tikelihood we derived above is an expectation; T dao = [ II ProbX: = ya [24,8 + v) Fan) du = L=1 = Eu T% [rode = pa lxt,8 + o) . 2-1 This expeetation can bc approximated by simulation rather than quadrature. First. let 6 now denote the scalc parameter in the distribution of 1;. This would be o, for a normal distribution, for example, or somc other scaling Lor the logistic or uniform distribution. Then, write the term in the likclihood function as * Tr Li = Eu pi FO X,B + Om; ] = Eulh(u)). 1 The function is smooth, continuous, and continuously differentiable. TE this expectation is finite, then lhe conditions of the law of large numbers should apply, which would mcan that for a sample of observations u;, ..., ks R plim & tau) = Edh(u). r=1 694 CHAPTER 21 + Models for Discrete Choice This suggests, based on the results in Chapter 17, an alternative method of maximizing the log-likclihood for the random effects model. A sample of person specific draws from the population u; can be generated with a random number generator. For the Butler and Moffitt model with normally distributed u;, the simulated log-likelihood function is a RI E in Esimutaad = 9] In ( 53 poi F lg (NB + 7) | is = Lit This function is maximized with respect 8 and o. Note that in the preceding, as in the quadrature approximated log-likelihood, the model can be based on a probit, logit. or any other functional form desired. There is an additional degree of flexibility in this approach.'The Hermite quadrature approach is essentially limited by its functional form to the normal distribution. But, in the simulation approach, 4; can come from some other distribution. For example, it might be believed that the dispersion of the hetero- geneity is greater than implicd by a normal distribution. The logistic distribution might bepreferable. A random sample from the logistic distribution can be created by sampling Gra. w; 8) [rom the standard uniform [0, 1 distribution, then É; = In(w;, (1 wi). Other distributions, such as the uniform itself, arc also possible, We have cxamincd two approaches to estimation of a probit model with random ef- fects. GMM estimation is another possibility. Avery, Hansen, and Hotz (1983), Bertschek and Lechner (1998), and Inkmann (2000) examine this approach: the latter two offer some comparison with the quadrature and simulation based estimators considered here. (Our applications in the following Examples 16.5, 17.10, and 21.6 use Lhe Bertschek and Lechner data.) The preceding opens another possibility. The random cffects model can be cast as a model with a random constant term; M=0+Xpabyre i= nt= E. Yu =1 ifyi > 0, and O otherwise where q; = « +o,it;. This is simply a reinterpretation of the model we just analyzed. We might, however, now extend this formulation to the [ull parameter vector. The resulting structure is = %Bitem i ye =1 ify; > 0, andO otherwise where 8,=$ + Tu; wherc T is a nonnegative definite diagonal matrix—some of its diagonal elements could be zero for nonrandom parameters. The method of estimation is essentially the same as beforc. The simulated log likelihood is now " R[K In Lider => Im [á ” jo Flgu (xi (8 + Tu »] ! . i=1 = Lo The simulation now involves R draws from (he multivariate distribution of u. Since the draws arc uncorrelated—T is diagonal—-this is essentialiy the same cstimation problem as the random cffects model considered previously. This model is estimated in Exam- ple 17.10. Example 16.5 presents a similar model that assumes that the distribution of B; is discrete rather than continuous. CHAPTER 21 + Models for Discrete Choice 697 which uses only the K x K matrix computed above and a few K x 1 vectors: n n dem E peso po- (5 [Dum a “sh o» po Gg soh 1=1 i=i Lt=l i=1 = ts) (5) =80+ Ag and AD = 8 — gu/hi) + Ap]? This is a large amount of computation involving many summations, but it is linear in the number of parameters and does not involve any n x n matrices. The problems with the fixed effects estimator are statistical, not practical? The estimator relics on 7; increasing for the constant terms to be consistent—in essence, each a; is estimated with 7; observations. But, in this setting. not only is F fixed, itis kely to be quite small. As such. the estimators of the constant terms are not consistent (not because they converge to something other than what (hey are trying to estimate, but because they do not converge at all). The estimator of 8 is a function of the estimators of «, which means that the MLE of £ is not consistent either. This is the incidental parameters problem. [Sec Neyman and Scott (1948) and Lancaster (2000),] There is, as well, a small sample (small 7;) bias in the cstimators. How serious this bias is remains a question in the literature, Two picces of received wisdom are Hsiao's (1986) results for a binary logit model and Heckman and MaCurdy's (1980) results for the probit model. Hsiao found that for 4; = 2, the bias in the MLE of 8 is 100 percent, which is extremely pessimistic. Heckman and MaCurdy found in a Monte Carlo study that in samples of a = 100 and ” = 8, thc bias appeared to be on the order of 10 percent, which is substantive, but certainly less severe than Hsiao's results suggest. The fixed effects approach does have some appeal in that it does not require an assumption of orthogonality of the independent variables and the heterogeneity. An ongoing pursuit in the literature is concerned with the severity of thc tradeoff of this virtue against the incidental parameters problem. Some commentary on this issue appears in Arellano (2001). Why did the incidental parameters problem arise here and not in the lincar regres- sion model? Recall that estimation in the regression model was based on the deviations from group mcans, not the original data as it is here. The result we exploited there was that although f (yr | X;) is a function of e;, f(y | Xy, 5;) is nota function ol o;, and we used the latter in estimation of 8. In that setting, Y; is a minimal sufficient statistie for a. Sufficient statistics are available for a few distributions that we will examine, but not for the probit model. They are available for the logit model, as we now examine. Similar results appear in Prentice and Gloeckler (1978) who attribute it to Rao (1973), and Chamberlain (1983). Wgee Vyllacil, Akvik and Heckman (2002), Chamberlain (1980, 1984), Newey (1994), Bover and Arellano (1997) and Chen (1998) for some extensions of parametric forms of the binary choice models with fixed effects. 698 CHAPTER 21 + Models for Discrete Choice A fixed effects binary logit model is EAE Probly = 1x) = Tres çd The unconditional likelihood for the n'7 independent observations is L= [eira - pot o Chamberlain (1980) [[ollowing Rasch (1960) and Anderson (1970)] observed that the conditional likelihood function, n [fre =» Ya i=1 L Vias Bin K Do) . t=1 isfrec of the incidental parameters, a;. The joint likelihood for each seto! 7; observations conditioned on the number of ones in Lhe set is Pe = Va= yo Ku = in ” cs) 1 O so(Sivad) Dra-s exp(Nr E) ] The function in lhe denominator is summed over the set of all (8) different sequences of 7; zeros and ones lhat have the same sum as S, = z2, pu! Consider the example of 7; = 2. The unconditional likelihood is L=[|[Prob(y; : sinDProbCXa = yi2). For cach pair of observations, we have these possibilities: 1. yi=Oand yo =0. Prob(0,0|sum =0) = 1. 2 yn=landy=1Prob(l,lsum=29=1. The ith term in Zº for either of these is just one, so they contribute nothing to the con- ditional likelihood function.” When we take logs, these terms (and these observations) will drop out. But supposc that y = O and yo = 1. Then Prob(0, landsum = 1) a Probcô, 1) 3. Prob(O, I|sum=D= = — Prod) TOb(O, 1 |sum = 1) Probtum = 1) Prob(O, 1) + ProbQ, 0) “The enumeration of all these computations stands to be quite a burden—see Arellano (2000, p. 47) or Baltagi (1995, p. 180) who [citing Greene (1993)] suggests that 7; > 10 would be excessive, In fact. using à recursion suggested by Krailo and Pike (1984), (he computation even with 7; up to 100 is routine, “Recall in (he probit model when we encountered this situation. (he individual constant term could not be estimated and the group was removed [rom lhe sample. The same offect is at work here. CHAPTER 21 + Modeis for Discrete Choice 699 Therefore, for this pair of observations, the conditional probability is 1 PRE 1 es KT ess; o eia o eu saA eu+si8 1 “fre Tp eaãS Ipes * Tp ad T pen By conditioning on the sum of the two observations, we have removed lhe heterogeneity. Therefore. we can construct the conditional likclihood function as the product of these terms for the pairs of observations for which the two obscrvations are (0, 1). Pairs of observations with one and zero arc included analogously. The product of the terms such as the preceding, for those observation sets for which the sum is not zero or T;, constitutes the conditional likclihood. Maximization of the resulting function is straightforward and may be done by conventional methods. As in the linear regression model, il is of some interest to test whether there is indecd heterogeneity. With homogeneily (o; = o). there is no unusual problem, and the model can bc estimated, as usual, as a logit model. It is not possible to test the hypothesis using the likelihood ratio test, however, because the two likelihoods are not compara- ble. (The conditional fikelihood is based on a restricted data set.) None of the usual tests of restrictions can be used because the individual effects are never actually estimated Hausman's (1978) specification test is a natural one to usc here, however. Under the null hypothesis of homogencity, both Chamberiain's conditional maximum likelihood estimator (CMLE) and the usual maximum likelihood estimator are consistent, but Chamberlain's is incfficient. (It fails to usc the information that a; =«, and it may not use all the data.) Under the alternative hypothesis, the unconditional maximum like- lihood cstimator is inconsistent* whercas Chamberlain's estimator is consistent and efficient. The Hausman test can be based on the chi-squared statistic x? = (Bem — Êmi)'(Var[CML) — Var[ML) "(Bem — Bm). “The estimated covariance matrices are those computed for the two maximum likelihood estimators. For the unconditional maximum likclihood cstimator, the row and column corresponding to the constant term are dropped. A large value will cast doubt on the hypothesis of homogeneity. (There are K degrees of freedom for the test. ) Tt is possible that the covariance matrix for tie maximum likelihood estimator will be larger than that for the conditional maximum likelihood estimator. 1í so, then the difference matrix in brackets is assumed to be a zero matrix, and the chi-squared statistic is therefore zero, 25This produces a difficulty for this estimator that is shared by the semiparametric cstimators discussed in the next section. Since the fixed effects are not estimated, it is not possible to compute probabilities or marginal effccis with these estimated coelficients, and ít is a bit ambiguous what one can do with the results of lhe compultations. The brute force estimator that actually computes the individual cficets might he preferable. *Hsaio (1996) derives the result explicitly for some particular cases. 702 CHAPTER 21 + Models for Discrete Choice vector.” Greene's result is useful only for the same purpose as Amemiya's quick correction of OLS. Multivariate normality is obviously inconsistent with most appli- cations. For example, ncarly all applications include at least one dummy variable. Ruud (1982) and Cheung and Goldberger (1984), however, have shown that much weaker conditions than joint normality will produce the same proportionality result. For a pro- bit model, Cheung and Goldberger require only that E[x| y*] be linear in y*. Several authors have built on these observations to pursue the issue of what circumstances will Jead to proportionality results such as these. Ruud (1986) and Stoker (1986) have cx- tended them to a very wide class ot models that goes well beyond those ol Cheung and Goldberger. Curiously enough, Stoker's results rule out dummy variables, but it is those for which the proportionality result seems to be most robust .O 21.5.3 THE MAXIMUM SCORE ESTIMATOR (MSCORE) In Section 21.45, we discussed the issue of prediction rules for the probit and logit models. In contrast to the linear regression model, estimation of these binary choice models is not based on a fitting rule, such as the sum of squared residuals, which is related to the fit of the model to the data. The maximum score estimator is based on a fitting rule, Maximizeg Sa(B) = n tz — (1 2o)]sgn(xi 8) * ini The parameter « is a preset quantile, and 7 = 2y — 1. (Soz=-Lilv=0)lais set to 4, lhen the maximum score estimator chooses the 8 to maximize the number of times that the prediction has the same sign as z. This result matches our prediction rule in (21-36) with F* = 0,5. So for a = 0.5, maximum score attempts to maximize the number of correct predictions. Since the sign of x'f is Lhe same for all positive multiples of £, the estimator is computed subject to the constraint that 8'8 = 1. Since there is no log-likclihood function underlying the firting criterion, there is no information matrix to provide a method of obtaining standard errors for the estimates. Bootstrapping can uscd to provide at Icast somc idea of the sampling variability of the estimator. (See Section E.4.) The method proceeds as follows. After the set of coelficients b, is computed, R randomly drawn samples of 1»: observations are drawn from the original data set with replacement. The bootstrap sample size m may be less than or equalto n. the sample size. With each such sample, the maximum score estimator is recomputed, giving by (*). Then the mean-squared deviation matrix 1 R MSD(b) = 7 5 bm?) — ba]lbm(?) — by] b=1 scale factor is estimable with the sample data, so under lhese assumptiuns, a method of moments mator is available. “See Greene (1983). “Soc Manski (1973, 1985, 1986) and Manski and Thompson (1986). For extensions ol his model, see Horowitz (1992). Charlier, Melenberg and van Soest (1995), Kyriazidou (1997) and Loe (1996). e CHAPTER 21 + Models for Discrete Choice 703 Maximum Score Probis Estimate Mean Square Dev. Estimate Standard Error Constant fi —0.9317 0.1066 —74S22 2.5420 GPA & 0.3582 0.2152 16260 0.6939 TUCE &: —0.01513 0.02800 0.05173 0.08389 PSI Bs 005902 0.2749 14264 FRitted 01 Actual 021 0 Actual 147 iscomputed, The authors of the technique emphasize that this matrix is not a covariançe matrix. *? Example 21.7 The Maximum Score Estimator Table 21.5 presents maximum score estimates for Spector and Mazzeo's GRADE model using « = 0.5. Note that they are quite far removed from the probit estimates. (The estimates are extremely sensitive to the choice of «.) Of course, there is no meaningfui comparison of the coefficients, since the maximum score estimates are not the slopes of a conditional mean function. The prediction performance of the model is also quite sensitive to «, but that is to be expected. As expected, the maximum score estimator performs better than the probit estimator. The score is precisely the number of correct predictions in the 2 x 2 table, so the best that the probit model could possibly do is obtain the “maximum score.” In this example, it does not quite attain that maximum. [The literature awaits a comparison of the prediction performance of the probit/logit (parametric) approaches and this semiparametric model] The relevant scores for the two estimators are also given in the table. Semiparametric approaches such as (his one have the virtue that they do not make a possibly erroneous assumption about the underlying distribution. On the other hand, as seen in the example, there is no guarantee that the estimator will outpertorm the fully parametric cstimator. Onc additional practical consideration is that semiparametric estimalors such as this one are very computation intensive, At present, the maximum score estimator is not usable for more than roughly 15 coefficients and perhaps 1.500 to 2,000 observations.* A third shortcoming of the approach is, unfortunately, inherent in “Note that wc are not yel agrecd thai b, even converges to a meaningful vector, since no underlying proba- bility distribution as such has been assumed. Once it is agrecd that Ihere is an underlying regression function au work, then à meaninglul set of asymptotic results, including consistency, can be developed. Manski and Thompson (1986) and Kim and Pollard (1990) present a number of results. Even so, it has been shown that the bootstrap MSD matrix is useful for little more than descriptive purposes, Horowilz's (1993) smoothed maximum secure estimator replaces the discontinuous sgn (8'x;) in the MSCORE eriterion with a continuous weighting Lunction. P(8'x;/ h), where A is a bandwidth proportional to 2715. Fe argues that this estimator isan improvement over Manski's MSCORE estimator. (“Tts asymptotie distribution is very complicated and not useful for making inferences in applications.” Later in the same paragraph he argues, “There has been no Lheoretical investigation of Lhe properties of the bootstrap in maximum scoro estimation.”) “The criterion function [or choosing b is not continuous, and il has more than one optimum. M. E. Bissey reported finding that the score function varies significantly between the local optima as well. [Personal correspondence to Lhe author, University of York (1995).] *Communication from C. Manski to the author. The maximum scorc estimator has bcen impiemented by Manski and Thompson (1986) and Greene (1995a). 704 CHAPTER 21 + Models for Discrete Choice its design. The parametric assumptions of the probit or logit produce a large amount of information about the relationship between the response variable and the covariates. In the final analysis, the marginal elfects discussed earlier might well have been the primary objective of the study. That information is lost here. 21.54 SEMIPARAMETRIC ESTIMATION The fully parametric probit and logit models remain by far the mainstays of empirical research on binary choice, Fully nonparametric discrete choice models are fairly exotic and have made only limited inroads in the literature, and much of that literature is theoretical [e.g.. Matykin (1993)]. The primary obstacle to application is their paucity of interpretable results. (Sec Example 21.9.) Of course, one could argue on this basis that (he firm results produced by the fully parametric models are merely fragilc artifacts of thc detailed specification, not genuine reflections of some underlying truth. [In this connection. see Manski (1995).] But that orthodox view raises the question of what molivates the study to begin wilb and what one hopes to learn by embarking upon it. The intent of model building to approximate reality so as Lo draw useful conclusions is hardiy limited to the analysis of binary chois Semiparametric estimators represent a middle ground between these extreme vie The single index model of Klcin and Spady (1993) has been used in several applications. including Gerfin (1996), Horowitz (1993), and Fernandez and Rodriguez-Poo (1997). “The single index formulation departs from a linear “regression” formulation, Ely x]= Ely lxB) Then Probiy = 1x) = F(x8 |x) = G(xi8), where G is an unknown continuous distribution [unction whose range is [0, 1]. The function G is not specified a priori; il is estimated along with (he parameters. (Since G as well as 8 is to be estimated, a constant term is not identified; essentially, G provides the location for Lhe index that would otherwise be provided by a constant.) The criterion function for estimation, in which subscripts » denote estimators of their unsubscripted counterparts, is 1& ni =05 [yin GuGiBo) + (1 3) In[t — Got). i=1 The estimator of the probability function, Gy, is computed at cach iteration using a nonparametric kerncl estimator of the density of x B,; we did this calculation in Section 16.4. For the Klein and Spady estimator, the nonparametrie regression $Recent proposals for semiparametric estimators in addition to the one developed here include Lewbel (1997, 2000). Lewbel and Honore (2001). and Altonji and Matzkin (2001). In spite cf nearly 10 years of development, this is a nascent literature, The theoretical development tends to focus on root-n consistent coefficient estimation in models which provide no means of computation of probabilities or marginal effects. 46 A symposium on the subject is Hardle and Manski (1993). CHAPTER 21 + Models for Discrete Choice 707 and Ptro) = [L+ expl-croj!. The constante = (7/3)7! = 0.55133is used to standatdize the logistic distribution that is used for the kernel function. (See Section 16.4.1.) The parameter à is the smoothing (bandwidth) parameter. Large values will flatten the estimated function through 7. whereas values close to zero will allow greater variation in the function but might cause it to be unstable. There is no good theory for the choice, but some suggestions have been made based on descriptive statistics. [See Wong (1983) and Manski (1986).] Finally, the function value is estimated with re) a et SP mito) O Example 21.9 Nonparametric Regression Figure 21.3 shows a plot of two estimates of the regression function for £ [GRADE |7]. The coefficients are the MSCORE estimates given in Table 21.5. The plot is produced by com- puting fitted values for 100 equally spaced points in the range of x'b,, which for these data and coefficients is [-0.66229, 0.05505]. The function is estimated with two values of the smoothing parameter, 1.0 and 0.3. As expected, the function based on À = 1.0 is much flatter than that based on À = 0.3. Clearly, the results of the analysis are crucially dependent on the value assumed. The nonparametric estimator displays a relationship between x'8 and £[y;]. AL first biush, this relationship might suggest that we could deduce the marginal effects, but untfortunately, that is not the case, The coefficients im this setting are not meaningful, so al] we can deduce is an estimate of the density. f(2), by using first differences of the estimated regression function. ft might scem, therefore, that the analysis has produced 0.00 +, i Luau i 1 -0.70 —060 050 —040 030 —020 010 00 010 B'x is 708 CHAPTER 21 + Models for Discrete Choice relatively little payoff for the effort. But that should come as no surprise if we reconsider the assumptions we have made to reach this point. The only assumptions made thus far are that for a given vector of covariates x; and cocfficient vector 8 (thats, any B). there exists a smooth function F(x'8) = E [y; | z;]. We have also assumed, at least im- plicitly. that the coelficients carry some information about the covariation of x'8 and the response variable. The technique will approximate any such function [see Manski (1986)]. There is a large and burgeoning literature on kernel estimation and nonparametric estimation in econometrics. [A recent application is Melenberg and van Soest (1996).) Asthissimple example suggests, with the radically different forms of the specified model, the information that is culled from the data changes radically as well. The general prin- ciple now made evident is that the fewer assumptions one makes about the population. the less precise the information that can be deduced by statistical techniques. That tradeoff is inherent in the methodology. 21.5.6 DYNAMIC BINARY CHOICE MODELS A random or fixed effects model which explicitly allows for laggcd effects would be Ju = Ma + ag + yypam + em > 0). Lagged effects, or persistence, in a binary choice setting can arise from three sources, serial correlation in s;, lhe heterogeneity, ot;, or true state dependence through the term y +. Chiappori (1998) [and see Arellano (2001)] suggests an application to the French automobile insurance market in which the incentives built into the pricing system are such that having an accident in one period should lower the probability of having one in the next (state dependence), but, some drivers remain more likely to have accidents than others in every period. which would reflect the heterogencity iustcad State dependence is likely to be particularly important in the typical panel which has only a few observations for each individual. Heckman (19814) examined this issue at length. Among his findings werc that the somcwhat muted small sample bias in fixed elfcets models with 8 was made much worsc when there was state dependence. A related problem is that with a relatively short panel, the initial conditions, yio, have a crucial impact on the entire path of outcomes, Modeling dynamic cífects and initial conditions in binary choice models is more complex than in the linear model, and by comparison there are relatively fewer firm results in the applied literature. Much ofthe contemporary literature has focused on methods of avoiding the strong parametric assumptions of the probit and logit models. Manski (1987) and Honore and Kyriadizou (2000) show that Manski”s (1986) maximum score estimator can be applied to the differences of unequal pairs of obscrvations in a two period panel with fixed effects. However, the limitations of the maximum score cstimator noted carlicr have motivated research on other approaches. An extension of lagged effects to a parametric model is Chamberlain (1985). Jones and Landwehr (1988) and Magnac (1997) who added state dependence to Chamberlain's fixed effects logit estimaror. Unfortunatelp. once the identification issues are setiled, the model is only operational if there are no other exogenous variables in it, which limits is uscfulness [or practical application. Lewbel (2000) has extended his fixed effects estimator to dynamic models as well. In this framework, the narrow assumptions about the independent variables somewhat CHAPTER 21 + Models for Discrete Choice 709 limit its practical applicability. Honore and Kyriazidou (2000) have combined the logic of the conditional logit model and Manski's maximum score estimator. They specify Prob(yo = 1 |x.c;) = polX, 4) whercx; = (x9,%2,.-., KT) Prob(yu = 1x. po dir cs VuD) = POBAR AY) 1=1 The analysis assumes a single regressor and focuses on the case of 7 = 3. The resulting estimator rescmbles Chamberlain's but relies on observations for which x; = Xj-1 which rules out direct time cffects as well as, [or practical purposes, any continuous variable, The restriction to a single regressor limits the generality of the technique as well. The need for observations with equal values of x; is a considerable restriction, and the authors proposc a kernel density estimator for the dillerence, x; — x;, 1. instead which does relax that restriction a bit. The end result is an estimator which converges (they conjecturc) but to a nonnormal distribution and at a rate slower than n7!2, Semiparametric estimators for dynamic modcls at this point in the development are still primarily of theoretical interest. Models that extend the paramctric formulations to include state dependence havc a much longer history, including Heckman (1978, 1981a, 1981b), Heckman and MaCurdy (1980), Jakubson (1988). Keane (1993) and Beck et al. (2001) to name a few In general, even without heterogeneity. dynamic models ultimately involve modeling the joint outcome (xo. ..., 3/7) which necessitates some treatment involving multivariate integration. Example 21.10 describes a recent application. Example 21.10 An Intertempora! Labor Force Participation Equation Hyslop (1999) presents a model of the labor force participation of married women. The focus of the study is the high degree of persistence in the participation decision. Data used in the study were the years 1979-1985 of the Panel Study of Income Dynamics. A sample of 1812 continuously married couples were studied. Exogenous variables which appeared in the model were measures of permanent and transitory income and fertility captured in yearly counts of the number of children from 0-2, 3-5 and 6-17 years old. Hyslop's formulation, in general terms, is (initial condition) yo = 1(x5980 + vio > 0), (dynamic model) yr = 1(x;,8 + vita + + vi > 0) (heterogeneity correlated with participation) a; = 2/8 + 1, (stochastic specification) mIX-NO,02), vio IX; NJ, 08), We IX; N]0, 02), Ve = pla + Wo Ho =1. Coro vil =95t=1,..5,7—1. *SBeck et al. (2001) is à bil different from the others mentioned in that in their study of “state failure," they observe a large sample of countries (147) observed over a fairly large number O! yemis, 40. As such, they are able to [ormulate their models in à way that makes (be asymptoties with respcci to T appropriate. They can analyze the data essentially in a time series framework. Sepanski (2000) is another application which combines state dependence and the random coelficient specification of Akin. Guilkey. and Sickles (1979). 712 CHAPTER 21 + Models for Discrete Choice Then Blog L 5 ' [e TRA = XX so 9B;9B; A logL É Engiz EREA = DO grngaxio [é - , i=1 log L z da gu a81ãp À Dsesio, foróum aa PlogL Ag 3p2 -Lã [root = mer! mo + têm — E, = O where wiRD'w; = 5/(w2 + w% — 2o7-Wiwio). (For Bo. change the subscripts in 32 In L/ 38108; and 92 Im 1./0818p accordingly.) The complexity of the second derivatives for this model makes it an excellent candidate for the Berndt et al. estimator of the variance matrix of the maximum likclihood estimator. 21.6.2 TESTING FOR ZERO CORRELATION The Lagrange multiplier statistic is a convenient device for testing for the absence of correlation in this model. Under the null hypothesis that p equals zero, the mode] consists of independent probit equations, which can be estimated separately. Moreover. in the multivariate model, all the bivariate (or multivariate) densities and probabilitics factor into the products of the marginals if the correlations arc zero, which makes construction of the test statistic a simple matter of manipulating the results of the independent probits. The Lagrange multiplier statistic for testing Ho: p =Oina bivariate probit model is n EXCAAUZ A imo E om) = [ócio (uia) ]P E Pira) POr) P(wa) As usual, the advantage of the LM statistic is that it obviates computing the bivariate probit model. But, the full unrestricted model is now fairly common in commercial software. so that advantage is minor. The likelihood ratio or Wald test can often be used with cqual casc. 21.6.3 MARGINAL EFFECTS Therc are several “marginal efícets” one might want to evaluate in a bivariate probit model? For convenience in evaluating them, we will definc a vector x = x; Ux, and let SLThis is derived in Kicfer (1982). 2See Greene (1996). CHAPTER 27 + Models for Discrete Choice 713 x/81 = xy. Thus, y, contains all the nonzero clements of 84 and possibly somc zeros in the positions of variables in x that appear only in the other equation; y2 is defined likcwise. The bivariate probability is Probly = 1,30 =1]x]= Do[xXy1.x/72.0]. Signs are changed appropriately if the probability of the zero outcome is desired in either case. (See 21-41.) The marginal effects of changes in x on this probability are given by JP “ao eiyi+ sopa where g; and g> arc defined in (21-43). The familiar univariate cases will arise if 9 =0, and effects specific to ont equation or the other will be produced by zeros in the corre- sponding position in one or the other parameter vector. There are also some condilional mean functions to consider. The unconditional mean funciions are given by the univari- ate probabilities: Elvlxl= 0w'pp, j=1,2, so the analysis of (21-9) and (21-10) applies. One pair of conditional mcan functions that might be of interest are Probln = 1,9 =1[x Elyly=1.x]=Probfy=1|p=1,x)= ln == Probly = 1 |x] = Pri xy p) Dx) and similarly for E [y> | y1 = 1. x]. The marginal effects for this function are given by BElylm= 1 $(x'ya) Dh2coti.( lg —d . 3x 2) Biyi (82 (>) y2 Finally, once might construct the nonlinear conditional mean function Dilxpy, (230 — Dx' po. Qyo — Do E (93.0) = OAXPilZr — Dra Cr Del P[Q3 — Dx'p5] The derivatives ol this function are the same as those above, with sign changesin several places if y; = O is the argument. 21.6.4 SAMPLE SELECTION There are situations in which the observed variables in the bivariate probit model are censored in one way or another. For example, in an evaluation of credit scoring models, Boyes, Hoffman, and Low (1989) analyzed data generated by the following rule: Yi =1 ifindividuali defaults on a loan, O otherwise, »=2 ifthe individual is granted a loan, O otherwise. Greene (1992) applicd the same model to y, = default on credit card loans, in which y, denotes whether an application for the card was accepted or not. For a given individual, 714 CHAPTER 21 4 Modeis for Discrete Choice »i is not observed unless y equals one. Thus, there are threc types of observations in the sample, with unconditional probabilities:? n=0 Prob(w = 0 ixo) = 1— D(x589), 91=0,3 =! Prob(y=0, yp=1[x,x2)=D2][-248,.x585.—0], n=1lyw=1: Problyy=1,3y=1xx2) = D>[x,84.x58>, 0). The log-likelihood function is based on lhese probabilities.** 21.6.5 A MULTIVARIATE PROBIT MODEL In principle, a multivariate model would extend (21-41) to more than two outcome variables just by adding cquations. The practical obstacle to such an extension is pri- marily the evaluation of higher-order multivariate normal integrals. Some progress has been made on using quadrature for trivariate integration, but existing results are not sufficient to allow accurate and efficient evaluation for more than two variables in a sam- ple of even moderate size. An altogether different approach has been used in recent applications. Lerman and Manski (1981) suggested that one might approximate multi- variate normal probabilitics by random sampling. For example, to approximate Probíy > 1,92 <3,y3<—D | x1.x2. 012, 13, 23), WC would simply draw random ob- servations from Lhis trivariate normal distribution (sec Section E.5.6.) and count the number of obscrvations that satisfy the incquality. To obtain an accurate estimate of the probability, quite a large number of draws is required. Also. the substantive possibility of getting zero such draws in a finite number of draws is problematic. Nonetheless, the logic of the Lerman-Manski approach is sound. As discussed in Section E.5.6 recent developments have produced methods of producing quite accurate estimates of multi- variate normal integrals based on this principle. The evaluation of multivariate normal integral is generally a much less formidable obstacle to the estimation of models based on the multivariate normal distribution. McFadden (1989) pointed out that for purposes of maximum likclihood estimation. accurate evaluation of probabilities is not necessarily the problem that needs to be solved. One can view thc compulation of the log-likclihood and its derivalives as a problem of esumating a mean. That is, in (21-41) and (21-42), the same problem arises il we divide by n. The idea is thal even though the individual terms in the average might be in error, if the error has mean zero, then it will average out in the summation. The important insight, then, is that if wc can obtain probability estimates that only err randomly both positively and negatively, then it may be possible to obtain an estimate of the log-likelihood and its derivatives that is reasonably close to the one that would $*The model was first proposed by Wynand and van Praag (1981) “Extensions of the bivariate probit model to other types of censoring are discussed in Poirier (1980) and Abowd and Farber (1982). Papers that propose improved methods of simulating probabilítics include Pakes and Pollard (1989) and especially Bórsch-Supan and Ilajivassilou (1990), Geweke (1989), and Keane (1994). A symposium in Lhe November 1994 issue of Review 0f Economics and Statistics presents discussion of numerous issues in speci- fication and estimation of models based on simulation of probabilities. Applications that employ simulation techniques for evaluation ol'multivariate normal integrals are now fairly numerous, See, for example, Hyslop (1999) (Example 21.10) who applies the technique to a pancl data application with 7 = 7. CHAPTER 21 + Models for Discrete Choice 717 ABLE 21.7 -Estimates óta Recursivo Simultaneous:Bivariate:Probit Model : fEstimated Standard-Errors if:Parentheses) Single Equation Bivariate Probit Variable Coefficient Standard Error Coefficient Standard Error Gender Economics Equation Constant =14176 (0.8069) =1.1941 (22155) AcRep —0.01143 (0.004081) —0.01233 (0007937) WomStud 1.1095 (0.5674) 0.835 (22603) EconFac 0.06730 (0.06874) 0.06769 (0.06952) PetWecon 2.5391 (0.9869) 2.5636 (1.0144) Relig —(.3482 (0.4984) (0.5265) Women's Studies Equation AcRep —0,01957 (0.005524) —0,01939 (0005704) PetWiac 1.9429 (0.8435) 18914 (0.8714) Relig —0.4494 (0.3331) —0.A584 (0.3403) South 13597 (0.6594) 1.3471 (0.6897) West 2.3386 (08104) 2.3376 (08611) North 1.8867 (0.8204) 1.9009 (0.8495) Midwest 1.8248 (0.8723) 1.8070 (0.8952) p (0.0000) 0.1359 (12539) Log L —85.6317 2(-85.6317 — (—85.6458)] = 0.0282, which Icads to the same conclusion. The Lagrange multiplier sta! 0.003807, which is consistent. This result might seem counterintu- itive. given the setting. Surely “gender economics” and “women's studies” are highly correlated, bu this finding does not contradict that proposition. The corrclation coeffi- cient measures the correlation between thc disturbances in the equations. Lhe omitted factors. That is, p measures (roughIy) the correlation between the outcomes after the influence of the included factors is accounted for. Thus, the valve 0.13 measures the elfcet atter the influenec of women's studies is already accounted for. As discussed in the next paragraph, the proposition turns out to be right. The single most important determinant (at least wiLhin this model) of whether a gender economics course will be offered is indecd whether the college offers a women's studies program. Table 21.8 presents the cstimates of the marginal effects and some descriptive statis- tics for the data. The calculations were simplified slightly by using the restricted model with p = 0. Computations of the marginal effects still require the decomposition above, but they are simplified slightly by the result that if p equals zero, then the bivariate probabilities factor into the products of the marginals. Numerically, the strongest effect appcars to be exerted by the representation of women on the faculty: its cocificient of +0.4491 is by far the largest. This variable. however, cannot change by a full unit becanse it is a proportion. An increase of 1 percent in the presence of women on the faculty raises the probability by only +0.004, which is comparable in scale to the effect of academic reputation. The effect ot women on the faculty is likcwise fairly small, onty 0.0013 per 1 percent change. As might have been expected, the single most important influence às the presence ol a women's studies program, which incrcases the likelihood of a gender cconomics course by a full 0.1863. OL course, the raw data would have anticipated this result; o! the 31 schools that offer a gender economics course, 29 also 718 CHAPTER 21 4 Models for Discrete Choice Direct Indirect Total (Std Error) (Type of Variable, Mean) Gender Economics Equation AcRep —O002022 — —0,001453 —0,003476 (0.00126) (Continuous, 119.242) PetWecon — +0.4491 +0,4491 (0.1568) (Continuous, 0.24787) EconFac +0.01190 +0,1190 (001292) (Continuous, 6.74242) Relig 007049 003227 o Ins (0.1055) (Binary. 0.57576) WomStud | +0,1863 +0.1863 (00868) (Endogenous, — 0.43939) PetWilac +0.13951 +0.13951 (008916) (Continuous, 0.35772) Women's Studies Equation AcRep —0,00754 —0,00754 (0002187) (Continuou, 119.242) PetWfac +0.13789 +0.13789 (0.01002) (Continuous, 0.35772) Relig —0.13265 —0.13266 (018803) (Binary, 0.57576) have a women's studies program and only two do not. Notc finally that the ctfcet of religious affiliation (whatever it is) is mostly direct. Before closing this application, we can use this opportunity to examine the fit mea- sures listedin Section 21.4.5. We computed the various fit measures using seven different specifications of the gender economics equation: Single-equation probit estimates, z1, 72. 23, 74, 25. )» Bivariate probit modcl estimates, z,. (3, 7d, 75, 2 Single-equation probit estimates, zy, Z2, 23, Z4, 25 Single-cquation probit estimates, 7, 23 25.) Single-equation probitestimates, 23, 2, 2 . Single-equalion probit estimates, 21, Zs +. Single-equation probit estimates z, (constant only). msn a The specifications are in descending “quality” because we removed the most statistically significant variables from the model at cach step. The values are listed in Vable 21.9. The matrix below each column is the table of “hits” and “misses” of the predietion rule $ = 11 Ê > 0,5,0 otherwise. [Note that by construction, model (7) must predict all ones or all zeros.| The column is the actual count and the row is the prediction. Thus. for model (1), 92 of 101 zeros were predicted correctly, whorcas five of 31 ones were predicted incorrectly. As one would hope, the fit measures decline as the more significant y ps a Measure [E (2 e a so (6) o LRI 0.573 0.535 0.495 0.407 0.279 0206 0.000 Rr 0.844 0844 083 0757 074 078 0.641 í 0.565 0560 0526 0444 0319 0216 0.000 Ru 0.561 0558 0530 0475 0343 0216 0000 Rs 0708 077 0672 0589 0447 0352 0000 Ra 0.687 0.679 0628 0.567 0.545 0329 0.000 Predictá 92 9] [98 8] foz 97 [04 71 f98 3] fo o) [oo Tedictions 5 % 5 26 8 23 8 23) 6 15 3 0 31 0 CHAPTER 21 + Models for Discrete Choice 719 variables are removed from lhe model. The Ben-Akiva measure has an obvious flaw in that with only a constant term, the model still obtains a “fit” of 0.641. From the prediction mafrices. it is clear that thc explanatory power of the model, such as it is, comes from its ability to predict the ones correctly. The poorer is the model, the greater thc number of correct predictions of y = O. But as this number rises, the number of incorrect predictions riscs and the number of correct predictions of y = 1 declines. All the fit measures appcar to react to this feature to some degree. The Etron and Cramer measures, which arc nearly identical, and McFadden's LRI appear to be most sensitive to this, with the remaining two only slightly less consistent. 21.7 LOGIT MODELS FOR MULTIPLE CHOICES Some studies of multiple-choice settings include the following: 1. Hensher (1986), McFadden (1974), and many others have analyzed the travel mode of urban commutcrs. 2. Schmidt and Strauss (1975a.b) and Boskin (1974) have analyzed occupational choice among multiple alternatives, 3. Terza (1985) has studied the assigament of bond ratings to corporate bonds as a choice among multiple alternatives. Thesc are all distinct from the multivariate probit model we examined earlier. In that setting. there were several decisions, each between two alternatives. Here there is a single decision among two or more alternatives. We will examine two broad types of choice sets, ordered and unordered. The choice among mcans of getting to work—by car, bus, train, or bicyclc—is clearly unordered. A bond rating is. by design, a ranking: that is its purpose. As we shall see, quite different techniques are used for the two types of models. Models for unordered choice sets are considered in this section. A model for ordercd choices is described m Section 21.8. Unordered-choice models can be motivated by a random utility model. For the ith consumer faced with / choices, suppose that the utility of choice j is Uj=2B + ey If the consumer makes choice j in particular, then we assume that U;; is lhe maximum among the J utilities. Hencc, the statistical model is driven by the probability that choice j is made, which is Prob(U;; > Um for allother k £ j. The modelis made operational by a particular choice ol distribution for the disturbances. As before, two models have bcen considered. logit and probit. Because of the need to evaluate multiple integrals of the normal distribution, the probit model has found rather limited use in this setting. The logit model, in contrast, has been widely used in many fields. including economics, market research, and transportation engineering. Let Y; be a random variable that indicates the choice made. McHadden (1973) has shown that il (and only if) the J disturbances are independent and identically distributed with 722 CHAPTER 21 + Models for Discrete Choice The exact second derivatives matrix has J2K x K blocks, 82 ln £ oB;õB; where I(j = ?) equals 1 if ; cquals / and O if not. Since the Hessian docs not involve di;, these are the expected values, and Newton's method is equivalent to the method of scoring. Elis worth noting that the number ol parameters in this model proliferates with the number of choices, which is unfortunate because the typical cross section sometimes involves a fairly large number of regressors. The coefficients in this model are difficult to interpret. lt is tempting to associate 8, with the jth outcome, but that would be misleading. By dillerentiating (21-46), we find that the marginal cffects of the characteristics on the probabilities are a == Blti= 0 Poe? i=l 4 4 s=-p, | -5 fi = 8; Bl (21-47) x k=0 Therefore, every subvector of 8 enters every marginal effect, both through the prob- abilities and through the weighted average that appears im 6, These values can be computed from thc parameter estimates. Although the usual focus is on the coefficient estimates, equation (21-47) suggests that there is at least some potential tor confusion. Note, for example, that for any particular xp, à P;/dxy need not have the same sign as Bj. Standard errors can be estimated using the delta method. (Sec Section 5.2.4.) For purposes of the computation, let 8 = [0, 84, 85...., 85]. We include the fixed 0 vector for outcome O because although Bo 0,7, = — WB, which is not 0. Note as well that Asy. Covo, Ê d=0for;=0,. - Then Asy. Varlê;]= 3 (im ias Covlêr f(s55) 50 m=0 = [ly =D-BIPI+ sx] + Pj[6x]. Finding adequate fit measures in this setting presents the same difficulties as in the binomial models. As before, it is uscful to report the log-likelihood. If the model contains no covariates and no constant term, then the log-likclihood will be 1 ln Le = Eun(G) = where x, is the number of individuals who choose outcome y. If the regressor vector includes only a constant term, then the restricted log-likelihood is 1 / n£g= nin(2)=5n mp, i a mp; 4-0 =0 SSH the data were in Me lorm of proportions. such as market shares, then the appropriate log-likelihood and derivatives are 5, 35; np and 37,55, tutpy — Byp)xs, respectively. Tho terms in the Hessian are muhiplied by me. CHAPTER 21 + Models for Discrete Choice 723 where p; is the sample proportion of obscrvations that make choice j. If desired, the likelihood ratio index can also be reported. A useful table will give a listing of hits and misses of (he prediction rule “predict Y% = j if Ê; is the maximum of the predicted probabilities 21.7.2 THE CONDITIONAL LOGIT MODEL When the data consist of choice-specific attributes instead of individual-specífic char- acteristics, the appropriate modelis o Ê e Prob(h = lauro. u)= sm” (21-48) ef Here. in accordance with the convention in the literature, we let j = 1,2..... Jlora total of Y alternatives. The model is otherwise essentially the same as the multinomial logit. Even more care will be required in interpreting the parameters, however. Once again, an example will help to focus ideas. In this model, the coefficients arc not directly Lied to the marginal cffects. The marginal cffects for continuous variables can be obtained by differentiating (21-48) with respect to x to obtain E (pag=B- BIB k=1...4 dx (To avoid cluttering the notation, wc have dropped the observation subscript.) Ft is clear that through its presence in P, and P,, every attribute set x; altects all the probabilities. Hensher suggests that one might prefer to report elasticities of the probabilities, The effect of attribute 71 ol choice k on P; would be dog P; dog %km = kn =) — PilBm- Since there is no ambiguity about the scale of the probability itself, whether one should report the derivatives or the elasticities is largely a matter of taste. Some of Hensher's elasticity estimates arc given in Table 21.16 later on in this chapter. Estimation of the conditional Logit model is simplest by Newton's method or the method of scoring. The log-likclihood is the same as for the multinomial fogit model. Once again, we define d; = 1if Y = j and O otherwise. Then n + logL=5" 5] dy log Prob(y, = 9). do Jet Market share and frequency data are common in this setting. If the data are in this [orm, then the only change needed is, once again, to define d;; as the proportion or frequency. PlUnfortunately, it is common lor this rule to predict all observation with (he same vatuc im an unbalanccd sample or a model with litile explanatory power. 724 CHAPTER 21 + Models for Discrete Choice Because ol the simple [orm of L, the gradient and Hessian have particularly convenient forms: Let; = Sia P;x;;. Then, dog 1 ro ET =5 5 dy; — Rj), ja Flog 1 Es o o 890 E B067 — 8) Qu — 8, 7 17 Qi; j 9808 = The usual problems of fitmeasures appear here. The log-likelihood ratio and tabula- tion of actual versus predicted choices will be useful. There are two possible constrained log-likelihoods. Since the model cannot contain a constant term. the constraint 8 = O renders all probabilities equal to 1/./. The constrained log-likelihood tor this constraint is then Le = —uln 7. Of course, it is unlikely that this hypothesis would fail to be re- jected. Alternatively. we could fit the model with only the .! —1 choice-specific constants, which makes the constrained log-likelihood the same as in the multinomial logit model, In Lj = »ny;lnp; where, as before, n; is the number of individuals who choose alternative j. 21.7.3 THE INDEPENDENCE FROM IRRELEVANT ALTERNATIVES We noted carlicr that the odds ratios in the multinomial logit or conditional logit mod- els are independent of the other alternatives. This property is convenient as regards estimation, but it is not a particularly appcaling restriction to place on consumer be- havior. The property of the logit mode! whereby P;/ Pk is independent of the remaining probabilities is cailed the independence from irrelevant alternatives (HA). The independence assumption follows from the initial assumption that the distur- bances are independent and homoscedastic. Later we will discuss several models that have been developed to relax this assumption. Before doing so, we consider a test that has been developed for testing the validity of the assumption. Hausman and McFadden (1984) suggest that if a subset of the choice set truly is irrelevant, omitting it from the model altogether will not change parameter estimates systematically. Exclusion of these choices will be inefficient but will not Icad to inconsistency. But if the remaining odds ratios arc not truly independent from these alternatives, then the parameter estimates obtaincd when thesc choices are included will be inconsistent. This observalion is the usual basis for Hausman's specification test. The statistic is * = Bo VIA Bo, where s indicates the estimators based on the restricted subsct, / indicates the estimator based on the full set of choices, and V, and V are the respectivo estimates ot the asymptotic covariancc matrices. The statistic has a limiting chi-squared distribution with K degrees of freedom º “MeFadden (1987) shows how this hypothesis can also be tested using a Lagrange multiplier test. CHAPTER 21 + Models for Discrete Choice 727 Since this approach is a two-step estimator, the estimate ol the asymptotic covariance matrix of the estimates at the second step must be corrected. [See Section 4.6, McFadden (1984), and Greene (1995a, Chapter 25).] For full information maximum likelihood (FIML) estimation of the model, the log-likelihood is InL= ” In[Probttwig | branch)] x Prob(branch));. = The information matrix is not block diagonal in 8 and (y, 7), so FIML. estimation will be more efficient than two-step estimation. “To specify the nested logit model, it is necessary to partition the choice set into branches. Sometimes there will bc a natural partition, such as in the example given by Maddala (1983) when the choice of residence is made first by community, then by dwelling type within the community. In other instances, however. the partitioning of the choice set is ad hoc and leads to the troubling possibility that the results might bc depen- dent on the branches so defincd. (Many studies in this literature present several sets of results based on different specifications of the tree struelure,) There is no well-defincd testing procedure for discriminating among tree structures, which is a problematic as- pect af the model. 21.7.5 A HETEROSCEDASTIC LOGIT MODEL Bhat (1995) and Allenby and Ginter (1995) have developed an extension of the con- ditional logit model that works around the difficulty of specifying the tree for a nested model. Their model is based on the same random utility structure as before, . Ui = 84; + e. The logit model arises from the assumption that £;; has a homoscedastic extreme value (HEV) distribution with common variance x? /6. The authors” proposcd model simply relaxes the assumption of egual variances. Since the comparisons are all pairwise, one ofthe variances is set to 1.0; the same comparisons of utilities will result if all equations are multiplied by the samc constant, so the indeterminacy is removed by setting one of the variances to one. The model that remains, then, is exactly as before, with the additional assumption that Var[g;;] = 0,, with oy = 1.0. 21.7.6 MULTINOMIAL MODELS BASED ON THE NORMAL DISTRIBUTION A natural alternative model that relaxes the independence restrictions built into the multinomial logit (MNL) model is the multinomial probit (MNP) model. The structural equations of the MNP model are U,=x;B;tep j=1,..,4, [e1,82,.. 0,89] NO, E]. The term in the log-likelihood that corresponds to the choice of alternative g is Problchoice q] = Prob[U, > Us j=1,.. o) j%9) The probability for this occurrence is Probjchoice q] = Prob[e, — £, > (x; — 21) B... ces — e; > (44 — 478] 728 CHAPTER 21 + Models for Discrete Choice for the J — 1 other choices, which is a cumulative probability from a (J — 1)-variate normal distribution. As in the HEV model, since we are only making comparisons, one of the variances im this 4 — 1 variate structure—that is, one of the diagonal clements in the reduced L—must be normalized to 1.0. Since only comparisons are ever observable in this model, tor identification. 4 — 1 of the covariances must also be normalized, to zero. The MNP model allows an unrestricted (4 — 1) x (4 — 1) correlation structure and 4 — 2 free standard deviations for the disturbances in the model. (Thus, a two choice model returns to the univariate probit model of Section 21.2.) For more than two choices, this specification is far more general than the MNL model, which assumes that E = IL (The scaling is absorbed im the coctficient vector in the MNL model.) The main obstacle to implementation of the MNP model has been the difficulty in computing the multivariate normal probabilities for any dimensionality higher than 2. Recent results on accurate simulation of multinormal integrals, however, have made estimation of the MNP model feasible. (See Section E.5.6 and a symposium in the November 1994 issue vf lhe Review of Economics and Statistics.) Yet some practical problems remain. Computation is exceedingly time consuming. It is also necessary to ensure that X remain a positive definite matrix. One way often suggested is to construct the Cholesky decomposition of E, LL”, where L is a lower triangular matrix, and esti- mate the elements of L. Maintaining the normalizations and zero restrictions will still be cumbersome, however. An alternative is estimate the correlations, R, and a diagonal matrix of standard deviations, S = diag(o g4-2,1,1) separately. The normaliza- tions, R;; = 1, and exclusions, Rs; = 0, are simple to impose, and Z is just SRS. R is otherwise restricted only in that —1 < Rj; < +1. The resulting matrix must be positive definite. Identification appears to be a serious problem with the MNP model. Although the unrestricted MNP model is fully identificd in principle, convergence to satisfactory results in applications with more than three choices appears to require many additional restrictions on the standard deviations and correlations, such as zero restrictions or equality restrictions in the case of the standard deviations. 21.7.7 A RANDOM PARAMETERS MODEL Another variant of the multinomial logit model is the random parameters logit (RPL) model (also called the “mixed logit model”). [See Reveltand Train (1996); Bhat (1996): Berry, Levinsohn, and Pakes (1995); and Jain, Vilcassim, and Chintagunta (1994). Train's formulation of the RPL model (which cncompasses the others) is a modification of the MNT. model. The model is a random cocfficients formulation, The change to the basic MNL model is the parameter specification in the distribution of the parameters across individuals, é; Bia = Pk + 20% + Out, where 1; is normally distributed with correlation matrix R, ox is the standard deviation of the distribution, 8 + 2;84 is the mean of the distribution, and z; is a vector of person specific characteristics (such as age and income) that do not vary across choices. This formulation contains all the earlier models. For example, if 8; = 0 for all the coefficients and ok = O [or all the coefficienis except for choice specific constants, lhen the original MNL model with a normal-logistic mixture for the random part of the MNL model arises (hence the name). CHAPTER 21 + Models for Discrete Choice 729 The authors propose estimation ot the model by simulating the log-likelihood fune- tion rather than direct integration to compute the probabilities, which would be intca- sible because the mixture distribution composcd of the original &;; and the random part of the coefficient is unknown. For any individual, Problchoice q |u;] = MNL probability | 8; (u;), with all restrictions imposed on the coefficients. The appropriate probability is Eu[Probtchoice q |u)] = / Problchoice g |u] f(u) du, tecido which can be estimated by simulation, using 1 R 35 Probfchoice q | B;(e;)] r=t Est. E[Prob(choice q |w)] = R where e; is the rth ol R draws for observation i. (There are nkR draws in total. The draws for observation i must bc the same from one computation to the next, which can be accomplished by assigning to cach individual their own secd for the random number generator and restarting it cach time the probability is to be computed.) By this method, the log-likelihood and its derivativos with respect to (Br, Oro k= 1... K and R are simulated to find the values that maximize the simulated log-likelihood. This is precisely the approach wc used in Example 17.10. The RPL model enjoys a considerable advantage not available in any of the other forms suggested. In a panel data setting, one can formulate a random etfecis model simply by making the variation in the coelficients time invariant. Thus, the model is changed to Uia =X temo del oamj=Lo Biju = Br + 20 + Out, The time variation in the coefficients is provided by the choice invariant variables which may change through time. Habit persistence is carried by Lhe time invariant random eftceL, te. If only the constant terms vary and they are assumed to be uncorrelated, then this is logically equivalent to the familiar random cfects model. But, much greater gencrality can be achieved by allowing the other coeificients to vary randomly across individuals and by allowing correlation of these effects.º2 21.7.8 APPLICATION: CONDITIONAL LOGIT MODEL FOR TRAVEL MODE CHOICE Hensher and Greenc [Greene (1995a)] report estimates of a model of trave] mode choice for travcl between Sydney and Melbourne, Australia. The data set contains 210 observations on choice among four travel modes, air, train, bus, and car. (See Ap- pendix Table 21.2.) The attributes used for their example werc: choice-speeific con- stants; two choice-specific continuous measures: GC, a measure of the gencralized cost of the travel that is equal to the sum of in-vehicle cost, INVC and a wagelike measure “See Hensher (2001) for an application to transportation modo choice in which cach individual is observed in several choice situations, 732 CHAPTER 21 + Models for Discrete Choice in Parentheses). : Parameter FIML Estimate LIML Estimate Unconditional Cair 6.042 (1.199) —0,0647 (2.1485) 5.207 (0.779) as 4.096 (0.615) 3105 (0.609) 3.163 (0.450) Cain 5065 (0.662) 4464 40.641) 3869 (0443) Buc —0.03159 (0.00816) —0.06368 (0.0100) —0.1550 (0.00441) Brrue —0.1126 (0.0341) —(.0699 (0,0149) —0.,09612 (00104) vn - 001533 (0.00938) 0.02079 (001128) 001329 (00103) Tia 0.5860 (0.141) 0.2266 (0.296) 1.0000 (0.000) Teronud 03890 (0.124) 0.1587 1.0000 (0.000) Shy 21886 (0.525) 5.675 1.2825 (0.000) Saron 32974 (1.048) 8.081 (4.219) 12825 (0.000) log E. —193.6561 —115.3354 + (879382) —199.1284 Note that onc of the branches has only a single choice, so the conditional probabil- ity, Pjjyy = Poirjfiy = 1. The modelis fit by both FIML and LIML methods. Three sets of estimates are shown in Table 21.14. The setmarked “unconditional” are the simple con- ditional (multinomial) logit (MNL) model for choice among the four alternatives that was reported earlier. Both inclusive value parameters are constrained (by construction) to equal 1.0000. The FIML estimates arc obtained by maximizing the full log likclihood for the nested logit model. In this model, Probcchoice | branch) = P(oairdair + Cerraindirain + Ghustltus + Bo GC + BrTTME), Probkbranch) = Ply da HINC + tayI Vip + Tgrouna | Veround), Prob(choice, branch) — Prob(choice | branch) x Prob(branch). Finally, the limited information estimator is estimated in two steps. At the first step, a choice model is estimated for the three choices in the ground branch: Probíchnice | ground) = Pliteraindirain + Ghustbus + BoGC + BrTTME) This model uses only the observations that chose one of the three ground modes: for these data, this subset was 152 of the 210 observations. Using the estimates from this model, wc compute, for all 210 observations, 7Vys = loglexp(z/,,8)] for air and O lor ground, and 1Voound = log|5 ;-pround eXP(Z;8)] for ground modes and O for air. Then, the choice model Prob(branch) = P(ctairdtair + yirdar HINC + tar] Vhs + Teround E Verona) is fit separately. Since the Hessian is not block diagonal, the FIML estimator is more efficient. To obtain appropriate standard errors, we must make the Murphy and Topel correction [or two-step estimation; see Section 17.7 and Thcorem 17.8. It is simplified a bit here because diffcrent samples arc used for the two steps. As such, the matrix R in the theorem is not computed. To compute C, we require the matrix ol derivatives of log Prob(branch) with respect to the direct parameters, aqi YZ7s Tio Tgrownd, And with respect to the choice parameters, 8. Since this model is a simple binomial (two choiec) logit model, these arc casy to compute, using (21-19). Then lhe corrected asymptotic covariance matrix is computed using Theorem 17.8 with R = 0. CHAPTER 21 + Models for Discrete Choice 733 TÁBLE:21 145 Estimatesof a Heteroscedastit Ex inParentheses) Parameter FEV Estimate Nested Logit Estimate Restricted HEV air 78326 (10.951) 6.062 (1.199) 2.973 (0.995) Aus 71718 (9.135) 4.096 (0.615) 4.050 (0.494) rrain 6.8655 (8.829) 5.065 (0.662) 3.042 (0.429) Boc —0.05156 (0.0694) —0.03159 (0.00816) —0.0289 (0.00580) Brrme —0.1968 (0.288) —0.1126 (0.0141) —0.0828 (0.00576) y 004024 (0.0607) 0.,01533 (0.00938) 0.0238 (0.0186) Thy — 0.5860 (0.141) — Tarouna — 0.3890 (0.124) — Fair 0.2485 (0.369) 04959 (0.124) Brain 0,2595 (0.418) 1.0000 (0.000) Bru 0.6065 (1.040) 1.0000 (0.000) Cear 1.0000 (0.000) 10000 (0.000) Implied Standard Deviations Cair 5.161 (7.667) Crrain 4.942 (7.978) has 2.115 (3.623) ea 1.283 (0.000) mL —195.6605 —193.6561 —200.3791 The likclihood ratio statistic for the nesting (hetcroscedasticity) against the null hy- pothesis of homoscedasticity is —2[—199.1284 — (—193.6561)] = 10.945. The 95 percent eritiçal value from the chi-squared distribution wilh two degrees of frecdom is 5.99, so the hypothesis is rejected. We can also carry out a Waid test. The asymptotic covariance matrix for the two inclusive valnc parameters is [0,01977/0,009621,0.01529]. The Wald statistie for the joim test of the hypothesis that tg, = tarowmd = 1, is E! e W=(0586-10 0389-110) 0.1977 “90159 (e3ão -10 0.009621 001529) (0.389 — 19) = 24415 The hypothesis is rejected. once again. The nested logit model was reestimated under assumptions of the heteroscedastic extreme value model. The results are shown in Table 21.15. This model is less restrictive than the nested logit model. To make them comparable. we note that wc found that Sair = 7/(Tair/6) = 2.1886 and Opain = Obus = Secar = T/ (ground (6) = 3.2974, The het- eroscedastic extreme value (HEV) model thus relaxes one variance restriction, because it has three frec variance parameters instead of two. On the other hand, the important degree of freedom here is (hat the HEV model does not impose the ILA assumption anywhere in the choice set, whereas the nested logit does, within each branch. A primary virtue of the HEV model, the nested logit model, and other alternative models is thal they relax the TIA assumption. This assumption has implications [or the cross elasticities between attributes in the different probabilities. Table 21,16 lists the estimated elasticities of the estimated probabilities with respect to changes in the generalized cost variable. Elaslicities are computed by averaging the individual sample values rather than computing them once at the sample means. The implication of the HA 734 CHAPTER 21 + Models for Discrete Choice engralized Cóst...... Cost Is That of Alternative Effect on Air Train Bus Car Multinomial Logir o Air —1.136 0.498 0.238 0418 - Train 0.456 —1.520 0.238 0418 Bus 0.456 0.498 —1.549 0418 Cur 0.456 0.498 0.238 —1.061 Nested Logit Air —0.858 0.179 Train 0.314 0887 Bus 0314 —4.132 Car 0314 0.887 Heteroscedastic Extreme Value Air —1.040 0.221 0,441 Train 0.272 0.250 0,553 Bus 0.688 —6.562 3.384 Car 0.690 1.254 —2717 assumplion can he seen in the table entries. Thus, in the estimates for the multinomial logit (MNL) model, the cross elasticities for each attribute arc all equal. In the nested logit model, the IIA property only holds within the branch. Thus, in the first column, the effect of GC of air aftects all ground modes equally, whereas the ctfect of GC for train is lhe same for bus and car but different from these two for air. All these elasticities vary freely in lhe HEV model. Table 21.17 lists the estimates of the parameters of the multinomial probit and random parameters logit models. For the multinomial probit model, we fit Lhrec spcc- ifications: (1) free correlations among the choices, which implies an unrestricted 3 x 3 correlation matrix and two free standard deviations; (2) uncorrelated disturbances. but free standard deviations, a medel that parallels the heteroscedastic extreme value model; and (3) uncorrelated disturbances and equal standard deviations, a mode) that is the same as the original conditional logit model save for the normal distribution of the disturbances instead of the extreme value assumed in the logit model. In this case. the scaling of the utility Iunctions is different by a factor of (x?/6)!2 = 1,283, as the probit model assumes « ; has a standard deviation of 1.0. We also fit three variants of the random parameters logit. In these cases, the choice specific variance for each utility function is o? + 67 where o7 is the contribution of the logit model, which is 72/6= 1.645, and 67 is the estimated constant specific vari- ance estimated in the random parameters model. The combined estimated standard deviations are given in (he table. The estimates of the specific parameters, 8; arc given in the footnotes. The estimated models are: (1) unrestricted variation and correlation among the three intcrecpt parameters—this parallels the general specification of the multinomial probit modcl; (2) only the constant terms randomly distributed but uncor- related, a model that is parallel to the multinomial probit model with no cross equa tion correlation and to the heteroscedastic extreme value model shown in Table 21.15: CHAPTER 21 + Models for Discrete Choice 737 04 03 em 01 - 1 ! ! ! ' ! ! 1 de ! ! ' , ' ' i x =0!p=11 -B'wr mB'x mB usB'x As before, we assume that é is normally distributed across obscrvations.* For the same reasons as in the binomial probit model (which is the special case of / = 1), we normalize the mean and variance of e to zero and onc. We then have the following probabilities: Prob(y = 0]x) = &(-x'8), Probçy = 1x) = E(guy — x/B) — Di-x'8), Prob(y=2]%) = (jo —x'B) — Pl — x), Prob(y=4|w)=1— Dus —xB). For all the probabilities to be positive, we must have OQem<m<c <a Figure 21.4 shows the implications of the structure. This is an extension of the univariate probit model we examined earlicr. The log-likelihood function and its derivatives can be obtained rcadily, and optimization can be done by the usual means. Às usual, the marginal effcets of the regressors x on thc probabilítics are not equal to the coefficients. It is helpful to consider a simple example. Suppose there arc three categories. Thc model thus has only one unknown threshold parameter. The three “Other distributions, particularly the logistie, could be used just as casily. We assumo the normal purely for convenience, The logistic and normal distributions generally give similar results im practice. 738 CHAPTER 21 + Models for Discrete Choice probabilities are Prob(y =0|x) =1-— O(x'B), Prob(y = 1 |x) = P(u — 8) — P(-x'B), Proby=2|)=1—D(u—x'B). For the thres probabilities, the marginal effects of changes in the regressors are à Prob(y = 0|x) = 5 = —p(xB)B. x à Probíy=1 Pod gx) ow-xBp. à Prob(y =2|x) dProtO = 2 = du —xB)B. Figure 21.5 iNustrates the ellect. The probability distributions of y and y* are shown in the solid curve. Increasing one of the x's while holding 8 and yu constant is equivalent to shifting the distribution slightly to the right, which is shown as the dashed curve. The effect of the shift is unambiguously to shift some mass out of the lefimost cell. Assuming that £ is positive (for this x), Prob(y = 0]x) must decline. Alternatively, from the previous expression, it is obvious that the derivative ot Prob(y = 0 | x) has the opposite sign from 8. By a similar logic, the change in Prob(y = 2x) [or Prob(y = 4 |x) in the gencral case] must have the same sign as 8. Assuming that the particular 8 is positive, we are shifting some probability into the rightmost cell. But what happens to the middle cell is ambiguous. It depends on the two densities. In the gencral case. relative to the signs of the coefficients. only the signs of the changes in Prob(y = 0|x) and Prob(y = 4 |x) are unambiguous! The upshot is that wc must be very carcful CHAPTER 21 + Models for Discrete Choice 739 ABLE 2118. Estimated Rating: Mean of Variuble Estimate tRatio | Variable Constant —434 — — ENSPA 0.057 17 0.66 EDMA 0.007 08 121 AFQT 0.039 39.9 n2 EDYRS 0.190 87 121 MARR —0.48 -90 0.08 AGEAT 0.6015 01 188 H 1.79 808 in interpreting the coefficients in this model, Indeed, without a fair amount of extra calculation, it is quite unclear how the coefficients in the ordered probit model should be interpreted.* Example 21.11 Rating Assignments Marcus and Greene (1985) estimated an ordered probit model for the job assignments of new Navy recruits. The Navy attempts to direct recruits into job classifications in which they will be most productive. The broad classifications the authors analyzed were technical jobs with three clearly ranked skill ratings: “medium skilled,” “highly skilled,” and “nuclear qualified/highly skilled.” Since the assignment is partly based on the Navy's own assessment and needs and partly on factors specific to the individual, an ordered probit model was used with the following determinants: (1) ENSPE = a dummy variable indicating that the individual entered the Navy with an “A school!” (technical training) guarantee, (2) EDMA = educational level of the entrant's mother, (3) AFQT = score on the Air Force Qualifying Test, (4) EDYRS = years of education completed by the trainee, (5) MARR = a dummy variable indicating that the individual was married at the time of enlistment, and (6) AGEAT = trainee's age at the time of enlistment. The sample size was 5,641. The results are reported in Table 21.18. The extremely large t ratio on the AFQT score is to be expected, since it is a primary sorting device used to assign job classifications. To obtain the marginal effects of the continuous variables, we require the standard normal density evaluated at —X'À = —0.8479 and j — X' = 0.9421. The predicted probabilities are D(—0.8479) = 0.198, d(0.9421) — b(—0.8479) = 0.628, and 1 — D(0.9421) = Q.174, (The actual frequencies were 0.25, 0.52, and 0.23,) The two densities are (—0.8479) = 0.278 and 4(0.9421) = 0.255. Therefore, the derivatives of the three probabilities with respect to AFQT, for example, are 2% = (—0.278)0. =-—0. sara = ("02780039 = —o.01084, PP — (g278- 0.255)0.089 = 0.0008, BAFQT O EE Si BP saror = 0:255(0.039) = 0.00995. s unitormly 10 be overtooked in the received literature. Authors often report coetlicients jonally with some commentary about significant effects, but rarely suggest upon what or in what direction those cffects are exerted. 742 CHAPTER 21 + Models for Discrete Choice A measure based on the standardized residuals is . Dis [3] This measure has the virtue that it compares the fit of the model with that provided by a model with only a constant term. But it can bc negative, and it can fall when a variable is dropped from the model. For an individual obscrvation, the deviance is d& = 2 InG/Ã9 — (yr — ÀD] = 2br Inu /Ãp) — e]. where, by convention, O In(0) =0. If the model contains a constant term, then 5), €:=0. The sum of the deviances, G=54=25 nln(g/io, i=1 =1 is reported as an alternative fit measure by some computer programs. This statisti will equal 0.0 for a model that produces a perfect fit. (Note that sincc y; is an integer while the prediction is continuous, it could not happen.) Cameron and Windmeijer (1993) suggest that the fit measure based on thc deviances, has a number of desirable properties. First, denote the log-likelihood function for the modelin which q; is used as the prediction (e.g., the mean) of w; as £(y;, 3;). The Poisson model fit by MLE is, then, (às. Y). the model with only a constant term is €(P, x), and a model thai achieves a perfect fit (by predicting y; with itself) is !(y;, y:). Then 2 MÃO) HOM) Ri= . Eu) — UP, 3) Both numerator and denominator measure the improvement of the model over one with only a constant term. The denominator measures the maximum improvement. since one cannot improve on a perfect fit. Hence. the measure is bounded by zero and onc and increases as regressors are added to the model.” We note, finally, the passing resemblance of R$ 19 the “pseudo-R?? or “likelihood ratio index” reported by some statistical packages (e.g., Stata), LÃ, po) 2 =1- Rr 10,3) Note that multiplying both numerator and denominator by 2 produces Lhe ratio of two likelihood ratio statistics, each of which is distributed as chi-squared. CHAPTER 21 + Models for Discrete Choice 743 Many modifications of the Poisson model have been analyzed by economisis.”* In this and Lhe next few sections, we briefly examine a few of them. 21.9.2 TESTING FOR OVERDISPERSION The Poisson model has been criticized because of its implicit assumption that the variance of y equals its mcan. Many extensions of the Poisson model that relax this assumption have been proposed by Hausman, Hall, and Griliches (1984), McCullagh and Nelder (1983), and Cameron and Trivedi (1986), to name but a few. The first step in this extended analysis is usually a test [or overdispersion in the context of the simple medel. A number of authors have devised tests for “overdisper- sion” within the context of the Poisson model. [Sec Cameron and Trivedi (1990), Gurmu (1991), and Lee (1986).] We will consider three of the common tests, one based on a regression approach, onc a conditional moment test, and a third, a Lagrange multi- plier test, based on an alternative model. Conditional moment tests are developed in Section 17.64. Cameron and Trivedi (1990) offer several different tests for overdispersion. A simple regression based procedure used for testing the hypothesis Ho: Var[yi] = Ely], Hu Var[y] = Ely] +og(E[y) is carried out by regressing where à; is the predicted value from the regression. on cither a constant term or À; with- out a constant term. À simple + test of whether the coefficient is significantly different from zero tests Hh versus Jh. Cameron and Trivedi's regression based test for overdispersion is formulated around the alternative Var[y] = E [u]+g(£[y;)). This is a very specific type of overdis- persion. Consider the more general hypothesis that Var[y:] is completely given by E [3:). The alternative is that the variance is systematically related to the regressors in a way thatisnot completely accounted for by E [y;]. Formally, wc have E [3] = exp(f'x;) = Ai. The null hypothesis is that Var[3;] = À; as well, We can test the hypothesis using the conditional moment test described in Section 17.6.4. The expecicd first derivatives and the moment restriction are Eluly —4)]=0 and Eful(n-4) -A]=0 To carry out the test, we do the following. Lete; = y; — À; andz; =x; without the constant term. 1. Compute the Poisson regr 2. Computer=5 ufe? — estimates, ion by maximum likelihood, | = 55.4 2%: based on the maximum likclihood “Where have been numerous surveys of models for count data, including Cameron and Trivedi (1986) and Gurmu and Trivedi (1994) 744 CHAPTER 21 + Models for Discrete Choice 3. Compute M'M = Si, gu? D'D = 57 xyx/e), and M'D = 52 nxiuei 4. Compute S= MM - M'D(D'D!D'M, 5. C=rSris the chi-squared statistic. It has K degrees of freedom. The next section presents the negative binomial model. This model relaxes the Poisson assumption that the mean equals the variance. The Poisson model is obtained as a parametric restriction on the negative binomial model, so a Lagrange multiplier test can be computed. m general, if an alternative distribution for which the Poi model is obtained as a parametric restriction, such as the negative binomial model, can bc specified, then a Lagrange multiplicr statistic can be computed. [See Cameron and Trivedi (1986, p. 41).] The LM statistic is V2 Lica rã; The weight, w;. depends on the assumed alternative distribution. Tor the negative binomial model discussed later, W; equals 1.0. Thus, under this alternative, the statistic is particularly simple to compute: * (een? 2h À The main advantage of Lhis test statisticis lhat one necd only estimate the Poisson mode! to compute it. Under the hypothesis of the Poisson model, the limiting distribution of the LM statistic is chi-squared with one degree of freedom. LM 21.9.3 HETEROGENEITY AND THE NEGATIVE BINOMIAL REGRESSION MODEL The assumed equality of the conditional mean and variance functions is typically taken to bc the major shortcoming of the Poisson regression model. Many alternatives have been suggested [see Hausman, Hall, and Griliches (1984), Cameron and Trivedi (1986. 1998). Gurmu and “Irivedi (1994), Johnson and Kotz (1993), and Winkelmann (1997) for discussion.) The most common is the negative binomial model. which arises from a natural formulation ol cross-section heterogeneity. We generalize the Poisson model by introducing an individual, unobserved effect into the conditional mean, na=x8+e=Inã+Inu. where the disturbance «; reficcts either specification error as in the classical regression model or the kind of cross-sectional heterogeneity that normally characterizes microe- conomic data. Then, the distribution of y; conditioncd on x; and “; (i.c.. s;) remains Poisson with conditional mean and variançe ju;: edit (Aus mo The unconditional distribution f(y; |x;) is the expected value (over u) of fly | xi. ui). o eim (iu sobo= [ER guga o a Foix) = CHAPTER 21 + Models for Discrete Choice 747 hypothesis is 0.75044 with one degree of freedom. The critical value from thc table is 3.84, so again, the hypothesis of the Poisson model is not rejected. However, the conditional moment test is contradictory; C=r'S lr =26.555. There are cight degrees of freedom. The 5 percent critical value from the chi-squared table is 15.507, so the hypothesis is now rejected. This test is much more general, sincc the form of overdispersion is noL specified, which may explain the difference. Note that this result affizms McCullagh and Nelder's conjecture. 21.9.5 POISSON MODELS FOR PANEL DATA The familiar approaches to accommodating heterogeneity in panel data have fairly straightforward extensions in the count data setting. [Hausman, Hall, and Griliches (1984) give full details for these models.] We will examine them for the Poissen model. The authors [and Allison (2000)] also give results for the negative binomial model. Consider first a fixed effects approach. The Poisson distribution is assumcd to have conditional mean . log Ar = 8x + a. wherc now, x; has been redefincd to exclude the constant term. The approach used in the linear model of transforming »; to group mcan deviations does not remove the heterogeneity, nor does it Icave a Poisson distribution lor the transformed variable. However, the Poisson model with fixed effects can be fit using the methods described for the probit model in Section 21.5,1b, The extension to the Poisson model requires only the minor modificatons, gy = (Yu — Ay) and A =— Air. Everything clsc in that derivation applies with only a simple change in the notation. The first order conditions tor maximizing the log-likelihood function for the Poisson model will include t an £ , =5 Qu em) =0 where pj =X. tl da; This implies an explicit solution for «; in terms of 8 in this model, é =ln (47 Eate) =In (2) mn Ds fia hs Unlike the regression or the probit model, this does not requirc that there be within group variation in y;,-—all the values can be the same. It docs require that at least one observation for individual 7 be nonzero, however. The rest of the solution for the fixed effects estimator follows the same lines as that for the probit model, An alternative approach. albeit with little practical gain. would be to concentrate the log likelihood function by inscrting this solution for «; back into the original log likclihood, then maximizing Lhe resulting function of 8. While logically this makes sensc, the approach suggested earlier for thc probit model is simpler to implement. An estimator that is not a function of the fixed effects is found by obtaining the joint distribution of (ya, ..., Y;7) conditional on their sum, For the Poisson model, a 748 CHAPTER 21 + Models for Discrete Choice close cousin to the logit model discussed earlier is produced: om) «E po se cm) o where eSBra ESB Pi=SI va E Deuéta O, The contribution of group í to the conditional log-likelihood is So T nL;=5 ln pa. 1=1 Note, once again, that the contribution to In Z of a group in which y; =0 in every period is zero. Cameron and Trivedi (1998) have shown that these two approaches give identical results, The fixed effects approach has the same flaws and virtues in this setting as in the probit case. Ttis not necessary to assume that the heterogeneity is uncorrelated with the included, exogenous variables. If tho uncorrelatedness of the regressors and the hetero- geneity can be maintained, then the random effects model is an attractive alternative model. Once again, the approach used in the linear regression model, partial deviations from the group means followed by generalized least squares (see Chapter 13), is not usable here. The approach used is to formulate the joint probability conditioncd upon the heterogencity, then integrate it out of the joint distribution. Thus, we form 4 POr esem tu) = | | pOu lu). t=1 Then the random cficet is swept out by obtaining Plyas = f pOn corn ds 4 = [ PO» = BuloOn sin ud]. This is exactly the approach used earlier to condition the heterogencity out of the Poisson model to produce the negative binomial model. I£, as before, wc take p(y |ui) to be Poisson with mean A; = exp(x;,8 + 1) in which exp(u;) is distributed as gamma with mean 1.0 and variance 1/e, then the preceding steps produce the negative binomial distribution, Yin | ui) ge) du [rorE, Ei nilEta CHAPTER 21 + Models for Discrete Choice 749 where 9 O; 75 For estimation purposes, we have a negative binomial distribution for Y; = 35, yi with mean A; = 5 Ai There is a mild preference in the received literature tor the fixed effects estimators over the random effcets estimators. The virtue of dispensing with the assumption of uncorrclatedness of the regressors and the group specific effects is substantial. On the other hand, the assumption does come at a cost. In order to compute the probabilitics or lhe marginal effects it is necessarily to estimate the constants, o;. The unscalcd coefficients in these models are of limited usefulness because of the nonlinearity of the conditional mean functions. Other approaches to the random effects model have been proposed. Greene (1994, 1995a) and Terza (1995) specify a normally distributed heterogeneity, on the assumption that this is a more natural distribution [or the aggregate of small independent effects. Brannas and Johanssen (1994) have suggested a semiparametric approach based on the GMM estimator by superimposing a very gencral lorm of heterogeneity on the Poisson modcl. They assume that conditioned on a random effect e;. Yi: is distributed as Poisson with mean ;A;. The covariance structure of &;, is allowed to be fully gen- cral. Forts =1,..., T, Var[eu] = 07. Covlen, e;:] = yiy(lt — s|). For long time series, this model is likely to have far too many parameters to bc identified without some re- strictions. such as first-order homogencity (8; = 8 Yi), uncorrelatedness across groups, (476) = O for i £ ;]. groupwise homoscedasticity (o? = o? Yi), and nonautocorrelated- ness [7 (1) = 0Vr 0]. With these assumptions, the estimation procedure they propose is similar to (he procedures suggested earlier. 1f the model imposes enough restrictions, then the parameters can be estimated by the method o! moments. The authors discuss estimation of the model in its ull generality. Finally, the latent class model discussed in Section 16.2.3 and the random parameters model in Section 17.8 extend naturally to the Poisson model. Indeed, most of the received applications of the latent class structure have been in the Poisson regression framework. [See Greene (2001) for a survey.] 21.9.6 HURDLE AND ZERO-ALTERED POISSON MODELS In some settings, the zero outcome of the data generating process is qualitatively differ- ent from the positive ones. Mullahy (1986) argues that this fact constitutes a shortcoming of the Poisson (or negative binomial) model and suggests a “hurdle” mode] as an alter- native.” In his formulation, a binary probability model determines whether à zero or a nonzero outcome occurs, then, in Lhe laiter case, a (truncated) Poisson distribution describes the positive outcomes. The model is ProbGg = 0|x,) = e”? -eedal bOr =; |x)=— Pobr=j= For a similar treatment in a continuous data application, sec Cragg (1971). 752 CHAPTER 21 + Models far Discrete Choice ARE BIAS EStiinates of a Spiit PoputauaniMogal Poisson and Logit Models Split Population Model Variable Poissonfory Logitfory>0 Poissonfory Logitfory>0 Constant —0,8196 —2.2442 1.0010 2.1540 (0.1453) (0.2515) (0.1267) (0.2900) Age 0.007181 002245 —0.005073 —0.02469 (0.003978) (0.007313) (0003218) (0.008451) Income 0.07790 0.06931 0.01332 —0.1167 (002394) (0.04198) (0.02249) (0.04941) Expend - 0004102 —0.002359 (0.0003740) (0.0001948) Own/'Rent —0.3766 0.3865 (0.1578) (0.1709) Log L —1396.719 —645.5649 —1093.0280 nÊ (of) 938.6 1061.5 intuition that the Poisson model does not adequately describe the data; the value is 6.9788. Using the model parameters to compute a prediction of the number of zeros, it is clear that the splitting model does perform beiter than the basic Poisson regression. 21.10 SUMMARY AND CONCLUSIONS “This chapter has surveyed techniques for modcling discrete choice. We examined four classes of models; binary choice, ordered choice, multinomial choice, and models for counts. The first three of thesc are quite far removed from the regression models (lin- ear and nonlinear) that have been the focus of the preceding 20 chapters. The most important difference concerns the modeling approach. Up to this point, we have been primarily interested in modeling the conditional mean function for outcomes lhat vary continuously. Tn this chapter, wc have shifted our approach to one of modeling the conditional probabilitios of events. Modeling binary choice—the decision between two alternatives—is a growth arca in thc applied econometrics literature. Maximum likelihood estimation of fully parame- terized models remains lhe mainstay of the literature. But, wc also considered semipara- metric and nonparametric forms of the model and examined models for time series and panel data. The ordered choice model is a natural extension of the binary choice setting and also a convenient bridge between models of choice betwecn two alternatives and more complex models of choice among multiple alternatives. Multinomial choice mod- eling is likewise a large field, both within economics and, especially, in many other fields. such as markcling, transportation. political science, and so on. The multinomial logit model and many variations of it provide an especially rich ftamcwork within which modelers have carefully matched behavioral modeling to empirical specification and estimation, Finally, models of count data are closer to regression models than the other three fields. The Poisson regression model is essentially a nonlinear regression, but. as in the other cascs, it is more fruitful to do the modeling in terms of the probabilitics of discrete choice rather than as a form of regression analysis. CHAPTER 21 + Models for Discrete Choice 753 Key Terms and Concepts * Attributes « Binary choice model « Bivariate probit * Bootstrapping * Butler and Moffiti method + Choice based sampling * Chow test + Conditional likelihood function + Conditional logit * Count data * Fixcd effects model * Full information ML + Generalized residual * Goodness of fit measure * Grouped data * Heterogeneity + Heteroscedasticity * Incidental parameters problem * Inclusive valuc * Independence from irrelevant alternatives * Index function model * Individual data * Initial conditions Exercises « Kernel density estimator + Kernel function » Lagrange multiplier test + Latent regression * Likelihood equations * Likelihood ratio test * Limited information ML « Lincar probability model e Logit + Marginal effects + Maximum likelihood + Maximum score estimator * Maximum simulated likclihood + Mcan-squared deviation * Minimal sufficient statistic * Minimum chi-squared estimator * Multinomial logit « Multinomial probit + Multivariate probit « Negative binomial model * Nested Jogit * Nomnested models * Normil * Ordered choice model « Overdispersion + Persistence * Poisson model * Probit » Proportions data * Quadrature + Qualitative choice + Qualitative response * Quasi-MLE * Random coefficients » Random effects model * Random parameters model + Random utility model * Ranking, * Recursive model * Robust covariance estimation « Sample selection « Scoring method « Semiparametric estimation * State dependence * Unbalanced sample * Unordered * Weibull model 1. A binomial probability model is to be based on the following index function model: y=a+rBd+e, s=1 ify>0, y=0 otherwise. The only regressor, d. is a dummy variable. The data consist of 100 observations that have the following: 1132 16 Obtain the maximum likelihood estimators of « and 8. and estimate the asymptotie standard errors of your estimates. Test the hypothesis that 8 equais zero by using a Wald test (asymptotic 7 test) and a likelihood ratio test. Use the probit model and then repeat, using the logit model. Do your results change? [Hint: Formulate the log-likelihood in terms of a and é = « + 8.) 754 CHAPTER 21 + Models for Discrete Choice 2 Suppose that a lincar probability model is to be fit to a set of observations on a dependent variable y that takes values zero and one, and a single regressor x that varies continuously across observations. Obtain the exact expressions for the least squares slope in the regression in terms of Lhe mean(s) and variance of x, and interpret the result. . Given the data set sliLoo1100111 x 19254 673526 estimate a probil model and tesf the hypothcsis that x is not influential in determin- ing the probability that y equals one. - Construct the Lagrange multiplier statistic for testing the hypothesis that all the slopes (but not the constant term) equal zero in the binomial logit model. Prove that the Lagrange multiplicr statistic is n R? in lhe regression ol (3; = p) on the as, where P is the sample proportion of 1s. . We are interested in the ordered probit model. Our data consist of 250 observations, of which the response are yl0 12 4 n [50 40 45 80 35 Using the preceding data, obtain maximum likelihood estimates of thc unknown pa- rameters of the model. [Hint: Consider the probabilities as the unknown parameters.] - The following hypothetical data give the participation rates in a particular type of reeycling program and the number of trucks purchased for collection by 10 towns in a small mid-Atlantic state: Town 1 2 3 4 5 6 7 8 9 10 Trucks 160 250 170 365 210 206 203 305 270 340 Participation% ua 74 8 87 62 83 as 84 a 7% The town of Eleven is contemplating initiating a recycling program but wishes to achicve a 95 percent rate of participation. Using a probit model for your analysis, a. How many trucks would the town expect to have to purchase in order to achieve their goal? [Hint: Sec Section 21.4.6.] Note that you will use n; = 1. b. Tf trucks cost $20,000 cach, then is a goal of 90 percent reachable within a budget ot $6.5 million? (That is, should they expecr to reach the goal?) e. According to your model, what is the marginal value of the 301st truck in terms of the increase in the percentage participation? - A data set consists Of n=m + no + 13 observations on py and x. For the first ny observations, y=1 and x=1. For the next»; observations, y=Oandx=1. For thc last 15 observations, y = and x =0, Prove that neither (21-19) nor (21-21) has a solution.
Docsity logo



Copyright © 2024 Ladybird Srl - Via Leonardo da Vinci 16, 10126, Torino, Italy - VAT 10816460017 - All rights reserved