Baixe econometric analysis - GREENE 6TH EDITION - appendix e outras Notas de estudo em PDF para Matemática Computacional, somente na Docsity! As APPENDIX A msra/a/0=— MATRIX ALGEBRA TERMINOLOGY A mustrix is a rectangular array of numbers, denoted em dia e ak A=[anj=[Aju= [dm dk. (A-I) Am Go o Bnk The typical element is used to denote the matrix. A subscripted element of a matrix is always Tead as Aros column: An example is given in Table A.1. In these data, thc rows are identified with vears and the columns with particular variables. A vector is an ordered set of numbers arranged either in a row or a column. In vicw of the preceding. a row vector is also a matrix with one row, whercas a column vector is a matrix with one columm. Thus, in Table A.1. the five variables observed for 1972 (including the date) constimte a row vector, whereas the time series of nine values for consumption is a column vector. A matrix can also be viewed as a set of column vectors or as a set ol row vectors.! The dimensions of a matrix are the numbers of rows and columns it contains. “A is an 4 x K matrix” (read “m by K”) will always mean (hai A has 2 rows and columns. Tf x equals K. then A isa square matrix. Several particular types of squarc matrices occur frequently in econometrics. e A symmetric matrix is one in which a; = ag for alli and k. * A diagonal matrix is a square matrix whose only nonzero elements appear on the main diagonal, that is. moving from upper left to lower right * A sealar matrix is a diagonal matrix with thc same value in all diagonal elements. * Aniidentity matrix is a scalar matrix with ones on the diagonal. This matrix is always denoted L A subscript is sometimes included to indicate its or order. For example, * A triangular matrix is onc that has only zeros either above or below the main diagonal. If the zeros arc above the diagonal, thc matrix is lower triangular, A.2 ALGEBRAIC MANIPULATION OF MATRICES A214 EQUALITY OF MATRICES Malrices (or vectors) A and B are equal if and only if they have the same dimensions and each element of A eguals the corresponding element of B. That is, A=B ifandonlyifar=br foralliandá. (A-2) ! Henceforth, we shall denoto a matrix by a boldfaced capital letter. asis A in (A-1), and a vector as a boldfaced lowercasc letter, as in a. Unless otherwise noted, a vector will atways be assumed to be a column vector. - 803 804 APPENDIXA + Matrix Algebra Column 2 3 5 1 Consumprion GNP 4 Discouns Rate Row Year (billionsofdollars) (billionsofdotlars) GNPDefiator (N.Y Fed, avg) 1º 1972 7371 11859 1.0000 4.50 2 973 8120 13264 10575 6.44 3 194 808.1 * 14342 11508 783 4 1975 96,4 15492 12579 6.25 5 196 1084.3 17180 13234 5.50 6 1977 1204.4 19183 1.4005 546 71978 1346.5 21639 1.5042 746 8 1979 15072 24178 16342 10.28 9 1980 16672 2633.1 17864 11.77 Source: Data from the Economic Repor of the President (Washington, D.C: US. Government Printing Office, 1983). A2.2 TRANSPOSITION The transpose of a matrix A, denoted A. is obtained by creating the matrix whose th row is the th column of the original matrix. Uhus, if B = A”, then each column of A will appear as the corresponding row of B. LA is x K, then A'is K x. An equivalent definition of lhe transpose of a matrix is B=A'Sby=au foralliandk. (A-3) “The definition of a symmetric matrix implies that if (and only if) A is symmctric, then A = A! (A-4) It also follows from the definition that for any A, (AJ=A. (A-S) Finally, the transpose of a column vector, a, is a row vector: at=ja a a) A.2.3 MATRIX ADDITION The operations vf addition and subtraction are extended to matrices by defining C=A+B= [0 + bu). (A-6) AB = [au — bu). (A-7) Matrices cannot be added unless they have the same dimensions, in which casc they are said to be conformable for addition. A zero matrix or null matrix is one whose elements are all zero, Tn lhe addition of matrices, the zero matrix plays thc same role as the scalar O in scalar addition: lhat is A+0=A, (A-8) MN follows from (A-6) that matrix addition is commutative, A+B=B4A. (A-9) APPENDIX A + Matrix Algebra 807 * Transpose of a product: (AB = B'/A'. (4-23) * — Transpose of an extended product: (ABC) = C'B'A'. (A-24) A2.7 SUMS OF VALUES Denote by i a vector that contains a column of ones. Then, Su=nta+o tm (A-25) = If all elements in x are equal to the same constant a, then x = ai and Dx=irav=any (A-26) = For any constant a and vector x, o Sam Duas “(AZ = = ia = 1/n, then we obtain the arithmetic mean, - = 53 u= (A-28) ia from which it [ollows that Sustxent ia The sum of squares of the elements in a vector x is a Dg=xs (A-29) = xhile the sum of the products of the 17 elements in vectors x and y is Dxn=xy. (4:30) = By the definition of matrix multiplication, [Xu = [xx] á (A-31) is the inner product of the kth and fth columns of X. For example, for the data set given in Table A.1, if we define X as the 9 x 3 matrix containing (year, consumption, GNP), then 1980 [X'Xhs = 57 consumption, GNP, = 737.1(1185.9) + -.. +1667.2(0633.1) = 19,743,711,34. 1EX isa x K, then [again using (A-14)] . i= , 808 APPENDIX A + Matrix Algebra This form shows (hat the K x K matrix XX is the sum of x K x K matrices, cach formed from a single row (year) of X. For the example given earlier, this sum is of nine 3 x 3 matrices, each formed from one row (year) of the original data matrix. A.2.8 A USEFUL IDEMPOTENT MATRIX A fundamental matrix in statistics is thg one that is uscd to transform data to deviations from their mean. First, mM (A-32) The matrix (17)! is an 2 x 4 matrix wilh cvcry element equal to 1/n. The set of values in deviations form is x — E BB -f-n]- (433) . a Since x = Ix, = [ = É - aii x =Mix. (4:34) Henceforth, the symbol Mº will he used only for this matrix. lts diagonal elements arc all (1 = 1/n), and its off-diagonal elements are <1/n. 'The matrix Mº is primarily useful in com- puting sums of squared deviations. Some computations are simplificd by the result which implies that ?'Mº = (, The sum of deviations about the mean is then Se-g= E For a síngle variable x, the sum of squared deviations about thc mcan is 3? -1P= (> 5) nt, (A-36) = i=1 Lava, a IM] =0x=0. (A-35) In matrix terms, DD = aa o = (My Mo = xMí Mx. i=l e Two properties ot Mº arc uscful at this point. First, since all off-diagonal elements of Mº equal =1/n. MP is symmetric, Second. as can casily be verified by mulliplication, Mº is equal to its square; MM" = Mº. APPENDIX A + Matrix Algebra 809 DEFINITION A.1 Idempotent Matrix An idempotent matrix, M, is one that is equal to its square, that is, Nº = MM = M.1/M is a symmetric idempotent matrix (all of the idempotem matrices we shall encounter are asymmetric), then MM = M encenar Thus, Mº is a symmetric idempotent matrix, Combining results, we obtain ” Sa -=xmêx. (4-3) A Consider constructing a matrix of sums of squares and cross products in devialions from the column means. For two vectors x and y, Seo -D=MoMy, (4:38) E so t Demo Sm-ny-» Sopa Sow-m E E TÉ we put the two column vectors x and yin an x x 2 matrix Z = [x,y], then M'Z is then x 2 matrix in which the two columns of data arc in mean deviation form. Then Mix xM'y Mix yMy (A-39) MÍZUM TO) = LMM'Z = 7/M'Z A.3 GEOMETRY OF MATRICES A3.1 VECTOR SPACES The K elements of a column vector can be viewed as the coordinates of a pointin a K-dimensional space, as shown in Figure Al for two dimensions, or as the definition of the line segment connecting the origin and the point defined by a. Two basic arithmetic operations are defincd for vectors. seatar multiplication and addition. A scalar multiple of a vector, a. is another vector, say a*, whose coordinates are the scalar multiple of a's coordinates. Thus, in Figure A.1, 812 APPENDIXA + Matrix Aigebra 5d Second coordinate 1 2 3 4 5 First coordinate Since a* is a multiple of a, a and a” are linearly dcpendent. For another example, il 1 3 10 a- |. b= bl. and e= fi): 1 2 —-e=0, a+b 3º then soa, and care linearly dependent. Any of the three possible pairs of them, however, are lincarly independent. gts DEFINITION A.5 Linear Independence A set of vectors is tinearty independent if and only if the only solution to cm tem + +arar=0 a=m==ek=0, The preceding implies the following equivalent definition of a basis. APPENDIX À 4 Matrix Algebra 813 & É 4 DEFINITION A.6 Basis for a Vector Space 7 É A basis for a vector space of K dimensions is any sei of K lincarly independent vectors in í that vector space. Since any (K + 1)st veclor can be written as a lincar combination of the X basis vectors, it follows that any set of more than K vectors in &* must be linearly dependent. A.3.4 SUBSPACES ne DEFINITION A.7 Spanning Vectors The set of ul! linear combinations of a set of vectors is the vector space that is spanned by those vectors. For example, by definition, the space spanned by a basis for RX is E&. An implication of this is that il a and b are a basis for E? and e is another vector in Rê, the space spanned by [a.b. c] is, again, Rê. Of course, e is superíluous. Nonctheless. any vector in R? car be expressed as à linear combination ol a. b. and e. (The linear combination will not be unique. Suppose, tor example, that a and e are also a basis for &º.) Consider the set of three coordinate vectors whose lhird element is zero. In particular, a =[4 o O and b=[h b O Vectors a and b do not span the three-dimensional space Fº. Every lincar combination of a and b has a third coordinate equal to zero: thus, for instance, €' = [1 2 3] could not bc written as a lincar combination of a and b. If (a1b» — a>b1) is not equal to zero [see (A-41)], however, then any vector whose third element is zero can be expressed as a linear combination of a and b. So, although a and b do not span Bº, they do span something, the set of vectors in Fº whosc third element is zero. This area is a plane (the “floor” of the box in a three-dimensional figure). This planc in Kº is a subspace, in this instance, a two-dimensional subspace. Note that if is nor R?; it is the set of vectors in E* whose third coordinate is 0, Any plane in Rº, regardless of how it is oriented, forms a two-dimensional subspace, Any two independent vectors that lie in that subspace will span it. But without a third vector that points in some other direction, we cannot span any more of I&º than this two-dimensional part of it. By the same logic, any line in R* is a one-dimensional subspace, in this case, the set of all vectors in E? whose coordinates arc multiples of those of the vector that define the linc. A subspace is a vector space in al] the respects in which we have defined it. We emphasize that it is not a vector space ol lower dimension. For example, R? is not a subspace of Ie”. The essential diflcrence is the number of dimensions in the vectors. The vectors in Rê that form a two-dimensional subspace are still thrce-element vectors: they all just happen to lie in the same plane. The space spanned by a set of vectors in X£ has almost K dimensions. 1f (his space has fewer than K dimensions, it is a subspace, or hyperplane. But the important point in the preceding discussion is that every set of vectors spans some space; it may be the entire space in which the vectors reside, or it may be some subspace ofit. 814 APPENDIXA + Matrix Algebra A3.5 RANK OF A MATRIX We vicw a matrix as a sct of column vectors. The number of columns in the matrix cquals the number of vectors in the set, and the number of rows equals the number of coordinates in each column vector. DEFINITION A.8 Column Space The column space of a matrix is the vector space that is spanned by its column vectors. T the matrix contains K rows, its column space might have K dimensions. But, as we have seen. itraight have fewer dimensions; the column vectors might be linearly dependent, or there might be fewer than K of them. Consider the matrix 156 A=|2 68 718 It contains threc vectors from R?, but the third is the sum of the first two, so the column space of this matrix cannot have three dimensions, Nor does it have only one, since the three columns arc not all scalar multiples of one another. Hence, il has two, and the column space of this matrix is a two-dimensional subspace of R?. reage mma sisemarestarasem : DEFINTFION A.9 Column Rank The column rank of a matrix is the dimension of the vector space that is spanned by its column vectors. Tt follows that the column rank of a matrix is equal to the largest number of linearly inde- pendent column vectors it contains. The column rank of A is 2. For another specific example, consider 123 515 B-lç45 314 It can be shown (we shall see how later) that this matrix has a column rank equal to 3. Since each column ot B is a vector in Rº, the column space of B is a thrce-dimensional subspace of Eº. Consider, instead, the set of vectors obtained by using the rows of B instead of the columns. The new matrix would be 1563 C=|[2141 3554 This matrix is composed of four column vectors from *º. (Note that C is B'.) The column space of Cisat most Rº, since four vectors in Kº must be linearly dependent. In fact, the column space of APPENDIX A + Matrix Algebra 817 For 2 x 2 matrices, the computation of the determinant is a e b q=ad- be. (A-50) Notice that it is a function of all the elements of the matrix. This statement will be (rue, in general. For more lhan two dimensions, (he determinant can be obtained by using an expansion by cofactors. Using any row, say É, we obtain x IIS Da IA k=1,..,K, (4-5) a where As is lhe matrix obtained (rom A by deleting row 7 and column X. The determinant of As is called a minor of A. When the correct sign, (—1)-*, is added, it becomes a cofactor. This operation can be done using any column as well, For example, a 4 x 4 determinant becomes a sum of four 3 x 3, whereas a 5 x 5 is a sum of five 4 x 4s, cach of which is a sum of four 3 x 3s, and so on. Obviously, it is a good idea to base (A-S1) on a row or column with many zeros in it, if possible. Tn practice, this rapidly becomes a heavy burden. It is unlikely, though, that you will ever calculate any determinants over 3 x 3 without a computer. A 3 x 3, however, might be computed on occasion; if so, the following shorteut will prove useful: gu Qu ds Gol doa 023] = Gula + dyloadm + GLsdados — Qsidiaadira — doa p0s; — Quids. dn da as Although (A-48) and (A-49) were given for diagonal maírices, they hold for gencralmatrices Cand D. Onc special case of (A-48) to note is that of c = —1. Multiplying a matrix by —1 does not nceessarily change the sign of its determinant. It does so only if the order of the matrix is odd. By using the expansion by cofactors formula, an additional result can be shown: A(=A'| (A-52) A3.7 A LEAST SQUARES PROBLEM Given a vector y and a matrix X, wc arc interested in expressing y as a linear combination of the columns of X. There are two possibilities. Tf y lies in the column space of X, then we shall be abc ta find a vector b such that y=xh. (4:53) Figure A.3 illustrates such a casc for three dimensions in which the two columns of X both have a third coordinate equal to zero. Only ys whose third coordinate is zero, such as yº in the figure, can be expressed as Xb for some b. For the gencral case. assuming lhat y is. indeed, in the column space of X, we can find the coefficients b by solving the set ol cquations in (A-53), The solution is discussed in (he next section. Suppose, however, that y is not in the column space of X. In the context of this example, suppose thal y's third component is not zero. Then lhere is no b such that (A-53) holds. We can, however, write =Xb+e, (4:54) where é is the difference between y and Xb. By Lhis construction, we find an Xh that is in the column space oL X, and e is the difference, or “residual.” Figure A.3 shows two examples, y and “Mi equals j, then the determinant is a principal minor. 818 APPENDIXA + Matrix Algebra Second coordinate Third cocrdinate y*. For the present, wc consider only y. We are interested in finding the b such that y is as close as possible to Xb in the sense that e is as short as possible, Raposo açao pe arara cr cumsstronsemgenecemer DEFINITION A.10 Length ofa Vector The length, or norm, of a vector eis le = Vere. (A-55) E ê eras ss rem The problem is to find the b for which el = y — Xblt is as small as possible. The solution is that b that makes e perpendicular, or orthogonal, to Xb. geseee mca pras ee sensesa morena gerentes rara cscsaenam DEFINITION A.11 Orthogonal Vectors Two nonzero vectors a and h are orthogonal, wrincn a 1 b, ifand onty if ab=ba=0. ater nastatara Returning once again to our fitting problem, we find that the b we seek is that for which el Xb. Expanding this sct of equations gives the requirement (Xby'e = 0 =5Xy-bX'Xb = b[X'y — X'Xb] APPENDIX A + Matrix Algebra 819 or, assuming b is not 0, the set of equations X'y = X'Xb. “The means of solving such a set of eguations is the subject of Section A.S. 3n Figure A.3, the linear combination Xb is called Lhc projection of y into the columm space of X. The figure is drawn so that, although y and y* are different, they are similar in that the projection of y lies on top of that of y*. The question we wish to pursue here is. Which vector, y or y”, is closer to its projection in the column space of X? Superficially, it would appear that y is closer, since e is shorter than e". Yet y* is much more nearly parallel to its projection than y, so the only reason thal its residual vector is longer is Lhat y“ is longer compared with y. A measure of comparison that would be unaflected by the length of the vectors is thc angle between the vector and ils projection (assuming that angle is not zero). By this measure, 6* is smaller than é, which would reverse the earher conclusion. THEOREM A.2 The Cosine Law The angle & between two vectors a and b satisfies eee ab > alba” The two vectors in the calculation would be y or y“ and Xb or (Xb)*. A zero cosinc implies that the vectors are orthogonal, If the cosine is one, then the angle is zero. which means that the vectors are the same, (They would be if y were in the column space of X.) By dividing by the lengths, we automatically compensate for (he length of y. By this measure, we find in Figure A 3 that y* is closer to its projection, (Xl)* Lhan y is to its projection, Xb. A.4 SOLUTION OF A SYSTEM OF LINEAR EQUATIONS Consider the sct of 1 lincar equations Ax=b, (A-56) in which the K clements of x constitute the unknowns. A is a known matrix of coefficients, and b is a specified vector of values. We arc interested in knowing whether a solution exists; if so, then how to obtain il: and finally, if it docs exist, then whether it is unique. A 41 SYSTEMS OF LINEAR EQUATIONS For most of our applications, we shall consider only square systems of equations, that is, those in which A is a square matrix. In what follows, therefore, we take n to equal K. Since the number of rows in À is the number of equations. whereas the number of columns in À is the number of variables, this case is the familiar one of “x equations in x unknowns.” There are two types of systems of equations. 822 APPENDIXA + Matrix Algebra Note the condition preceding (A-64). It may be that AB is a square, nonsingular matrix when neither A nor B are even square. (Consider, for example, A'A.) Extending (A-64), we have (ABO! =C HAB! =C BAT, (A-65) Recall that for a data matrix X, X'X is Lhe sum of the outer products of the rows X. Suppose that we have already computed 8 = (X'X07! for a number of ycars of data, such as thosc given at the beginning of this appendix, The following result, which is called an vpdating formula, shows how to compute the new $ that would result when a ncw row is added to X: 1 Tah? 1 rapa A aba | (A-66) Note the reversal of the sign in the inversc. Two more gencral forms of (A-66) that are occasionally uselul are ALbbpi=A!s [ 1 Adbepi= Ads) [A debe! = A +[rrem | Ape A. (A-66a) [A+BCB) ! =A!5AB[C) +B'A'B] (A-66b) A.4.3 NONHOMOGENEOUS SYSTEMS OF EQUATIONS For the nonhomogencous system Ax=b, if A is nonsingular, then the unique solution is x=A 'b. A44 SOLVING THE LEAST SQUARES PROBLEM We now have the tool needed to solve the least squares problem posed in Section A3.7. We found the solution vector, b to be the solution to the nonhomogenous system X'y = X'Xb. Let z equal the vector X'y and let A equal the square matrix X'X. The equation system is then Ab a. By the results above, if A is nonsingular, then bD=A Ja=(X% “(xy assuming Lhat (he matrix to be inverted is nonsingular. We have reached the irreducible minimum. 1f the columns of X arc lincarly independent, that is, if X has ful) rank, then this is the solution to the Icast squares problem. If the columns of X are linearly dependent, then this system has no unique solution. A.5 PARTITIONED MATRICES In formulating the elements of a matrix—it is sometimes useful to group some of the elements in submatrices, Let 1 4/5 A=|29]3 [es al a 916 An Amu APPENDIX A + Matrix Algebra 823 A is a partitioned matrix. “Lhe subscripts of the submatrices are defined in thc same fashion as those for the elements of a matrix. A common special case is the block diagonal matrix: A, A-(M 0], 0 Ax where An and Az arc square matrices. A.5.1 ADDITION AND MULTIPLICATION OF PARTITIONED MATRICES For conformably partitioned matrices A and B. Au+Bu Av+Bp A+B= A -67) po +By An +Bo 6 , and A Au Av] [Bu Bo) (AnBi+ AB AnBp+ AnBy (468) An Ax]|Bn Br» AnBn + AzBn AnBo + AB») In all these, the matrices must be conformable for the operations involved. For addition, the dimensions of A; and B; must be the same, For multiplication, the number of columns in A, must equal the number ofrows in B;; for all pairs i and j. That is, all the necessary matrix products of the submatrices must be defined. Two cases frequently encountered are of the form Ai] As » an Mar , N IN =[A, A] | = [AÇA, + 445] (A-69) and An 0TIAn O ApÃv 0 = . (70) | º 4) I º NI | 0 anal (47 A.5.2 DETERMINANTS OF PARTITIONED MATRICES The determinant of a block diagonal matrix is obtaincd analogously to that of a diagonal matrix: Al: Apa] (A-7D) The determinant of a general 2 x 2 partitioned matrix is An Ap =| Aml [An — AvA,) 4x | = (Anil: [Az — AvAçiÃo|. (4:72) Au Am A.5.3 INVERSES OF PARTITIONED MATRICES The inverse of a block diagonal matrix is An 07! laio = -73) | 0 ad) Í 0 AS) (4:73) which can be verified by direct multiplication, 824 APPENDIXA + Matrix Algebra For the gencral 2 x 2 partitioned matrix. one form of lhe partifioned inverse is Au An) [AG (L+ AvEADAI) AGIADE: An An) . (A-74) -BAnAn F where “4 Fr= (An — AnAÇiAD). The upper left block could also be writtcn as Fi= (An — AAA) A.5.4 DEVIATIONS FROM MEANS Suppose that we begin with a column vector of 2 values x and let We arc interested in the lower right-hand element of A-!. Upon using the definition of F, in (A-74), this is - F = [xx (in qo]! = t [» -i (5) E) - fe | - (5) n] 5) cem, Therefore, the lower right-hand value in the inverse matrix is t — Dia? Now. supposc that we replace x with X. a matrix with several columns. We seek Lhe lower right block of (Z'Z)-!, where Z = [i, X]. The analogous result is (My! = LER = [ex xi rx] = (UMINT, which implies that the K x K matrix in the lower right corner of (Z/'Z)-! is lhe inverse of the K x K matrix whosc jkth element is ST (5; — £;)(x4k — 54). Thus, when a data matrix contains à column of ones, the elements of Lhe inverse of (he matrix of sums of squares and cross products will be computed from the original data in the form of deviations from the respective column means. A.5.5 KRONECKER PRODUCTS A calculation that helps to condense the notation when dealing with scts of regression modeis (see Chapters 14 and 15) is the Kronecker product. For general matrices A and B, auB a auB 1B aoB o aB Ago [Sib ab dnb o (4:75) anB aoB -- axB APPENDIX A + Matrix Algebra 827 Since Lhe veciors arc orthogonal and ce; = 1, we have Ce GO - err Gu Go ve Ger cc- . =L (A-81) ti Cet co Crek Result (A-81) implics Lat c (A-82) Consequently, cc =cc'=1 (A-83) as well, so Lhe rows as well as the columns of € arc orthogonal. A6.4 DIAGONALIZATION AND SPECTRAL DECOMPOSITION OF A MATRIX By premultiplying (A-80) by C' and using (A-81), we can extract the characteristic roots of A. DEFINITION A.15 Diagonalization of a Matrix The diagonalizarion of a matrix A is CAC=CCA=IA=A. (A-84) É E Alternatively, by postmultiplying (A-80) by C' and using (A-83), we obtain a useful representation ofA. ESA pera peer permorgamsrcregagnsemerggpermemsa DEFINITION A.16 Spectral Decomposition of a Matrix The spectral decomposition of À is In this representation, the X x K matrix A is written as a sum of K rank one matrices. This sum is also called the eigenvalue (or. “own” value) decomposition of A. In this connection. the term signarure of the matrix is sometimes used to describe the characteristic roots and vectors. Yet another pair of terms tor the parts of this decomposition are the latent roots and latent vectors ofA. A.6.5 RANK OF A MATRIX The diagonalization result enables us to obtain the rank of a matrix very easily. To do so, we can use Lhe following result. 828 APPENDIX A + Matrix Algebra THEOREM A.3 Rank ofa Product For any matrix A and nonsingular matrices B and C. the rank of BAC is egual to the rank ofA. The proofissimple. By (A-45) rank(BAC) = rank|(BAJC] = rank(BA). By (A-43), rank(BA) = rank(A'B9, and applving (A-45) again, ranktA'B”) = rank(A?) since B' is nonsingular if B is nonsingular (once again, by A-43). Finally, appiying (A-43) again to obtain rank(A') = rank(Ã) gives the result. remeoecromeneeeersaneameea Since € and C' are nonsingular. we can usc them to apply this result to (A-84). By an obvious substitution, rank(A) = rank(A). (A-86) Finding the rank of A is trivial. Since A is a diagonal matrix, ils rank is just the number of nonzero values on its diagonal, By extending this result, we can prove Lhe following theorems, (Proofs are brief and are left for the reader.) peesçrereçemesisaen THEOREM A.4 Rank ofa Synmetric Matrix The rank of a symmetric matrix is the number of nonzero characteristic roots it contains. Note how this result enters the spectral decomposition given above. Itany of the characteristie roots are zero, Lhen the number of rank one matrices in the sum is reduced correspondingly. would appear that this simple rule will not be useful if A is not square. But recall that tank(A) = rank(A'A). (A-B7) Since A'A is always square, we can usc it instead of A. Indeed, we can usc it even if A is square, which Ieads to a fully gencral result. THEOREM A.5 Rank of'a Matrix The rank of any matrix A equais the number of nonzero characteristic roots in A'A. Since the row tank and column rank of a matrix are equal, we should be able to apply Theorem A.5 to AA” as well. This process, however, requires an additional result. rei st rm ms i THEOREM A.6 Roots of an Outer Product Matrix i The nonzero characteristic roots of AA! are the same as those of A'A. é APPENDIX A + Matrix Algebra 829 The proof is left as an exercise. A useful special case the reader can examine is the characteristic roots of aa” and a'a, where ais an 2 x 1 vector. Ifa characteristic root of a matrix is zero, lhen wc have Ac — 0. Thus, if the matrix has a zero root, it must be singular. Otherwise, no nonzero « would exist. In general, thercfore, a matrix is singular: that is, it does not have full rank if and only if it has at least one zero root, A.6,6 CONDITION NUMBER OF A MATRIX As the preceding might suggest, there is a discrete difference between (ul rank and short rank matrices. In analyzing data matrices such as (he onc in Section A.2, however, we shall often encounter cases in which a matrix is not quite short ranked, becausc it has all nonzero roots, but it is close. That is, by some measure, wc can come very close to being able to write one column as a lincar combination of the others. This case is important: we shall examine it at lengih in our discussion of multicollincarity. Our definitions of rank and determinant will fail to indicate this possibility, but an altemnalive measure, the condition number, is designed for that purpose. Formally, the condition mumber for a square matrix A is o. E so te (A-88) minimum root For nonsquare matríces X, such as the data matrix in the example, we use A=X'X. Asa further refinement. because the characteristic roots are alfceted by the scaling of the columns of X, we scale the columns to have length | by dividing cach column by its norm [ses (A-58)]. For the X in Section A.2, the largest characteristic root of A is 4.9255 and the smallest is 0.0001543. Therefore, the condition number is 178.67, which is extremely large. (Values greater than 20 are large.) “That the smallest root is close to zero compared with the largest means that this matrix is nearly singular. Matrices with large condition numbers are difficult to invert accurately. A6.7 TRACE OF A MATRIX The trace of a square K x K matrix is the sum of its diagonal clements: e TUA) = 57 ame t=1 Some casily proven results are tr(cA) = c(trA)), (A-89) IA) = IrA). (AO) u(A + B) = tr(A) + tr(B), (A-9t) udo)=K. (A-92) tHAB) = tr(BA). (A-93) u'a = tr(a'a) = tr(aa”) K K trata = SO aja = 505 ade fe da dm The permutation rule can be extended to any cyetic permutation in a product: tr(ABCD) = tr(BCDA) = tr(CDAB) = (r(DABC). (A-94) 832 APPENDIXA + Matrix Algebra “The characteristic roots of A” are the rth power of those of A, and the characteristic vectors are the same, 1f A is only nomnegative definite—that is, has roots that are either zero or positive-—then (A-105) holds only for nonnegative r. A.6.10 | IDEMPOTENT MATRICES Idempotent matrices arc equal to their squares [see (A-37) to (A-39)]. In view of their importance in econometries, we collect a few results related to idempotent matrices at this point. First, (A-101) implies that if À is a characteristic root of an idempotent matrix, then À = ÀÉ for all nonnegative integers K. As such, if A is a symmetric idempotent matrix, then all its roots are one or zero. Assume that all the roots of A are onc. Then A = LandA =CAC = CIC =€CC' = Tí the roots arc not all one. then one or more are zero. Consequently, we have the following results for symmetric idempotent matrices* o Theonty full rank, symmetrio idempotent matrix is the identity matrix 1. (A-106) e Alisymmetric idempotent matrices except the identity matrix are singular, (A-107) “The final result on idempotent matrices is obtained by observing that the count of the nonzero roots of A is also equal to their sum. By combining Theorems A.5 and A.7 with the result that for an idempotent matrix, the roots are all zero or onc, we obtain this result: e Therank ofasymmetric idempoten: matrix is equal to ils trace. (A-108) A.6.11 FACTORING A MATRIX In some applications, we shall require a matrix P such that PP=A". One choice is so that, PP= (CATA TEC =CA 'C, as desired.” Thus, the spectral decomposition of A, À = €. computalion. [he Cholesky tactorization of a symmcetric positive definite matrix is an alternative represen- tation that is usefulin regression analysis. Any symmetric positive definite matrix A may be written as the product of a lower triangular matrix L and its transposc (whichis an upper triangular matrix) L/ = U. Thus, A = LU. This result is the Cholesky decomposition of A. The square roots of the diagonal clements of L, é. are the Cholesky values of A. By arraying these in a diagonal matrix D. we may also write À = LD“'DÊD “U = L*D2U",whichissimilar ta the spectral decomposition in (A-85). The usctulness of this formulation arises when lhe inverse of A is required. Once L is C' is a useful result for this kind ot Not all idempotent matrices arc symmetric. We shall not encounter any asymmetric ones in cur work. however. 1OWe say tal this is “one” choice because if A is symmetric, as it will be in all our applications, there are other candidates. The reader can casily verify that CAI2C' = A 12 works as well. APPENDIX A + Matrix Algebra 833 computed, finding A! = U-1 is also straightforward as well as extremely fast and accurate, Most recently developed econometric software packages use this technique for inverting positive definite matrices. A third type of decomposition of a matrix is uscful for numerical analysis when the inverse is difficult to obtain because the columns of A are “nearly” collinear. Any 2 x K matrix A tor which = K can be wriltcnin the form À = UWV' where U is an orthogonal » x K matrix—that is UU =Ix—Wisa K x K diagonal matrix such thatm; = O, and Visa K x K matrix such that VV = Ex. This result is called the singular value decomposition (SVD) of A, and w; are the singular values of A.!! (Note that if A is square. lhen the spectral decomposition is a singular value decomposition.) As with the Cholesky decomposition, the uscfulness of the SVD arises in inversion, in this case, of A'A. By multiplying it out, we obtain that (AA) "Uis simply VW ?Vº, Onee the SVD of A is computed, the inversion is trivial. The other advantage of this format is its numerical stability, which is discussed at length in Press et al. (1986). Press et al. (1986) recommend the SVD approach as the method of choice for solv- ing least squares problems because of its accuracy and numcrical stability. A commonly uscd alternative method similar to the SVD approach is the OR decomposilion. Any 2 x K matrix, X with» = K can be written in the form X = QR in which the columns of Q are orthonormal (Q'Q = 1) and R is an upper triangular matrix. Decomposing X in this fashion allows an cx- tremely accurate solution to the least squares problem that does not involve inversion or direct solution of the normal equations. Press ct al. suggest that this method may have problems with rounding errors in problems when X is nearly of short rank, but based on other published results, this concern seems relatively minor)? A.6.12 THE GENERALIZED INVERSE OF A MATRIX Inversc matrices arc fundamental in cconometrics. Although we shall not require them much in our treatment in this book, there are more general forms of inverse matrices than we have considered thus far. A generalized inverse of a matrix A is another matrix A' that satisfies the following requirements: 1, AA ASA 2 AtAA' = 3. AA issymmetric. 4 AA issymmetric. A unique A “ can be found for any matrix, whether A issingular or not, orevenif A is not square. É The unique matrix that satisfies all four requirements is called the Moore—Penrose inverse or pseudoinverse of A. [A happens to be square and nonsingular, then the gencralized inverse will be the familiar ordinary inverse. But if A does not exist, then A* can still be computed. An important special case is the overdetermined system of equatious Ab=y, “ Discussion of lhe singular value decomposition (and listings of computer programs for the computations) may de found in Press et al. (1986). 12fhe National Institute of Standards and Technology (NIST) has published à suile of benchmark problems that test the accuracy of least squares computations (htip:iwwwnistgovilldiv89BAstrd). Using these prob- lems, which include some extremely difficult, i sets, we found that the OR method would reproduce all the NIST certified solutions to 15 digits of accuracy, which suggests that the QR method should be satisfactory for all but the worst problems, TA proof of uniqueness, with several other results, may be found in Theil (1983) 834 APPENDIX A + Matrix Algebra where A has 7 rows K < n columns, and column rank equal to R < K. Suppose that R equals K,so that (A'A)”! exists. Then the Moore—Penrose inversc of A is At=(A AJA, which can be verified by multiplication. A “solution” to the system of equations can be written b=A*y. This is the vector that minimizes the length of Ab — y. Recall this was the solution to the least squares problem obtained in Section A 4.4. Jf y lies in the column space of A, this vector will be zero, but otherwise, it will not. Now suppose that A does not have full rank. The previous solution cannot be computed. An alternative solution can be obtained, however. We continue to use the matrix A'A. In the spectral decomposition of Section A.6.4, if A has rank R, then there are R terms in the summation in (A-85). In (A-102). the spectral decomposition using the reciprocals of the characteristic roots is used to compute the inverse. To compute the Moore—Penrose inverse, we apply this caleulation to A'A using only the nonzero roots, then postmuliply the result by A”. Ler €, be the Reharacteristic veetors corresponding to the nonzero roots, which we array in the diagonal matrix, Ay. Then the Moore—Penrose inversc is A =CATÍCIAS, which is very similar to the previous result. IA is a symmetric matrix with rank R < K. the Moore—Penrosc inverse is compuied preciscly as in the preceding equation without postmultiplying by A. Thus, for a symmetrie matrix A, At =CAT'C, where A! is a diagonal matrix containing the reciprocals of the nonzero roots of A. A? QUADRATIC FORMS AND DEFINITE MATRICES Many optimization problems involve double sums of (he form 9=5 5 xa; (A-109) Te ja This quadratic form can be written q="Ax, where A is à symmetric matrix. In general, q may be positive. negativ it depends on A and x. There are some matrices, however. for which q will be positive regardiess of x, and others for which q will always be negative (or nonnegative or nonpositive). For a given matrix A, À. JfxAx>(<)0 [or allnonzero x, then A is positive (negative) definite. 2 WMxAx=>(<)0for all nopzero x, then A is nonnegative definite or positive semidefinite (nonpositive definito). Tt might scem that it would be impossible to check a matrix for definiteness, since x can be chosen arbitrarily. But we have already used the set of results necessary to do so. Recall that à APPENDIX A 4 Matrix Algebra 837 In order to establish this intuitive result, we would make usc of the following, which is proved in Goldberger (1964, Chapter 2): es THEOREM A.12 Ordering for Positive Definite Matrices 4f A and B are two positive definite matrices with the same dimensions and if every char- acteristic root of A is larger than (at least as large us) the corresponding characteristic root of B when both seis of roots are ordered from largest to smallest, then A — B is positive (nonnegative) definite. Ceerreceonescenemtatesnesesrssa The roots of the inverse are the reciprocals of the roots of the original matrix, so the theorem can be applied to the inverse matrices. A.8 CALCULUS AND MATRIX ALGEBRA!* A.8.1 DIFFERENTIATION AND THE TAYLOR SERIES A variable y is a function of another variable x written v=f00, y=800, y= and so on, if cach value of x is associated with a single value of y. In this relationship, y and x are sometimes labeled the dependent variable and the independent variable, respectively. Assuming that the funciion f(x) is continuous and difierentiable, we obtain the following derivatives: D mn PY a O = a fw)= and so on. A frequent use of the derivatives of f(x) is in the Taylor series approximation. A Taylor series is a polynomial approximation to f (9). Letting xº be an arbitrarily chosen expansion point e VE o? fo = 1094505 e (em aty, “1 (A-121) The choice of the number of terms is arbitrary; the more thal are uscd, thc more accurate the approximation will be, The approximation used most frequently in econometrics is the linear approximation, PO) ms a+Bx (A-122) where, by collecting terms in (A-121), a = [f(x") — f'(xºyxº] and 8 = f(x). The superscript “O” indicates that the function is evaluated at xº. The quadratic approximation is foxarprryo, (A-123) whereo =[50— [ox fa], =[/º— fUxTandy = 14For a complete exposition, see Magnus and Neudecker (1988). 838 APPENDIXA + Matrix Algebra We can regard a function y = f(x, x2,.... xy) as a sealar-valued function of a vector; lhat is, y = f(x). The vector of partial derivatives, or gradient vector, or simply gradient, is àw/ôm] [A arco — laxam) | 6 . o = | um] = | RA, (4124) oval Lt, The vector g(x) or g is used to represent the gradient. Notice that it is a column vector. The shape of the derivative is determincd by the denominator of the derivative. A second derivatives matrix or Hessian is computed as Pyjdmôx dyjaxido --- By/0mBx, Pyjómêa dpjamax ce Pfôxa dam H= =[4A. (A-125) Py/dxadx y/Bmmdão ce Pey/amda In general, H is a square, symmetrie matrix. (The symmetry is obtained for continuous and continuously differentiable functions from Young's theorem,) Each columa of H is the derivative of g with respect to the corresponding variable im x'. Therefore, H= | 2y/0m) Mdy/0m) david) | dlap/dm) a(ay/am) By o 9x ôx> dx, “Cor cao ax dxax” The first-order, or línear Taylor series approximation is v 5 fa) +D f(x — af). (A-126) = The right-hand side is aro fa)+ [o 4 |a- =1565 golpe = (1º + ps This produces the linear approximation, ymu+f' The second-order, or quadratic, approximation adds the second-order terms in the expansion, DD Blue) (mm) = 6 ya, i=i j=1 to the preceding one. Collecting terms in the same manner as in (A-126), we have y=a+px+ Ierx, (A-127) where as fogu+ DOR, B=8 Hx and P=Hº. A linear function can be written APPENDIX A + Matrix Algebra 839 so atado Fa (A-128) Note, in particular, lhat 9(a'x)/9x = a, not a”. En a set of linear functions y=Ax, each clement y of y is x=ax, where aí is the ith row of A [see (A-14)]. Therefore, dy; . dx a; =transpose olith row of A, and dy fax a! do/ox | | dyn/axt EM Collecting all terms, we find that 94x/9x = A, whereas the more familiar form will be dAx ” =A". e E (A-129) A quadratic form is written xAx= õ 3 na. (A-130) = For example, 13 a-[1 so lhat xAx= x) 444 + 6x. Then . axAx [2n+6x) [2 6] [m] a pers pe am which is the general result when A is a symmetric matrix. If A is not symmetric, then a(X AX) ôx =(A+AX (A-132) Referring to the preceding double summation, we find that for cach term, the coefficient on a;; is xx;. Therefore, atras da 842 APPENDIXA + Matrix Algebra A.8.3 CONSTRAINED OPTIMIZATION His often necessary to solve an optimization problem subject to some constrainis on the solution One method is mercly to “solve out” (he constraints. For example, in the maximization problem considered earlier, suppose that the constraint x, = x — x: is imposed on the solution. For a single constraint such as this one, it is possible merely to substitute the righi-hand side of this equation for x, in the objcetive function and solve the resulting problem as à function of the remaining two variables. For more general constraints, however, or when there is more than one constraint, the method of Lagrange multipliers provides a more straightforward method of solving the problem. We maximize, f(x) subject to cx) = 0, ata) =, (A-140) cio = 0, The Lagrangean approach to this problem is to find the stationary points—that is, the points at which the derivatives arc zero—of : La = FO +D rica) = fi9 + Neto. (as) ai The solutions satisfy the equalions aLº Bf6) dx'e(x) — = + =0(px 1), a d à “ x xx (A-142) E em =0U 1 E =em=0Ux1. The second term in 91%/8x is gúem dera [oem], qm (adam àx dx dx where € is the matrix of derivativos of the constraints with respect to x. The jth row ol lhe 4 xn matrix € is the vector of derivatives of the jth constraint, e (x), with respect to x. Upon collecting terms, the first-order conditions arc sr aro) a xo (A-144) =) 5 Te) = There is one very important aspect of the constraincd solution to consider. In the unconstrained solution. we have df(x)/9x = O. From (A-144), we obtain, for a constrained solution, aro ax —CA, (A-145) which will not equal O unless à = 0, This result has two important implications: * he constraíned solution cannot be superior to the unconstrained solution. This is implied by he nonzero gradicnt at the constrained solution. (That is, unless € = 0 which could happen if the constraints werc nonlinear. But, even if so, the solution is still no better than the unconstraincd optimum,) range multiplicrs are zero, then the constrained solution wil! equal the unconstrained solution. APPENDIX A + Matrix Algebra 843 To continue thc example begun earlier, supposc that we add the following conditions: x — x +43 = 0, nata =0 To put this in the format ol the general problem, write the constraints as e(x) = Cx = 0, where epa The Lagrangcan function is Rx M)=ax—xAx+A'Ca. Note the dimensions and arrangement of the various parts. In particular, Cis a 2 x 3 matrix, with one row for each constraint and one column for each variable in the objective function. The vector of Lagrange multipliers lhus has two elements, one for each constraint. The necessary conditions are a-24x+C1=0 (three equations) (A-146) and €x=0 (two equations). These may be combined in the single equation Pe celhi-lo! co 0)" Using the partitioncd inverse of (A-74) produces the solutions à=-[CACTICA a (A-147) and x= FATI-CICA 'co CA Ta, (A-148) The two results, (A-147) and (A-148), yield analytic solutions for À and x. For the specific matrices and veclors of the example. these are À = [-0.5 —7.5]. and the constrained solution vector, x=[1.5 0 —1.5/. Note that in computing the solution to this sort of problem, il is not necessary to use the rather cumbersome form of (A-148). Once à is obtained from (A-147), (he solution can be inserted in (A-146) for a much simpler computation. The solytion 1 2 1 x=5A '2+-41C€ suggests a useful result for the constrained optimum: constraincd solution = unconstraincd solution + [24] 'C' (A-149) Finally, by inserting the two solutions in the origina] function, we find that R = 24.375 and Rº = 2.25, which illustrates again that the constrained solution (in this meximizarion problem) is inferior 10 the unconstraincd solution. 844 APPENDIX A + Matrix Algebra A.8.4 TRANSFORMATIONS Tt a function is strictly monotonic, then il is a ene-to-one fenction. Each y is associated with exactly one value of x, and vice versa. In this case, an inverse function exists, which expresses x as a function of v, written v= fo) and x= 1109. An example is the inverse relationship between the log and the exponential functions. The slope of the inverse function, dx dfO) way =1""03, is the Jacobian of the transformation from y to x. For example, if y=a+ba, then : a 4 1 x=-24 | 55)” is the inversc transformation and dx a dy br Looking ahead to the statistical application of this concept, we observe that if y= /(x) were vertical, then this would no longer be a functional relationship. The same x would be associated with more than onc valuc of y. In this case, al this value of x, we would find that / =0, indicating a singularity in the function. a column vector of functions, y = Fx), then dm/dy dufóyo ce da/ôM ox dxofdy dxo/ôxo ce dxa/Om “yo : dxnfôy dando co dan/ôm Consider the set of lincar functions y = Ax = f(x). The inverse transformation is x = 1 (y), which will be x=Ay, if A is nonsingular. If A is singular. then there is no inverse transformation. Let J be the matrix of partial derivatives of the inverse functions: dx != [3] The absolute value of thc determinant of J, is the Jacobian determinant of the transformation from y to x. In the nonsingular case, abs(|H) = abs(JA 1) = Está: APPENDIX B + Probability and Distribution Theory 847 B.3 EXPECTATIONS OF A RANDOM VARIABLE DEFINITION B.1 Mean ofa Random Variable The mean, or expected value, of a random variable is » xfx) ifxis discrete, É EpJ=( " (B-1t) | fio dx ifxis continuous. | The notation 37, or f,, used henceforth, means the sum or integral over the entire range of values of x. The mean is usually denoted px. It is a weighted average of the values taken by x, where the weights are the respective probabilitics. It is not necessarily a value actually taken by the random variable. For example, the expected number of heads in one toss of à fair coin is $. Other measures of central tendency are the median, which is the valuc m such that Prob(X < =1 and Prob(X > mb :, and the mode, which is the valuc of x at which f(x) takes its um. The first of these measures is more frequently used than the second. Loosely speaking, the median corresponds more closely than the mcan to the middle of a distribution. Itis unaffected by extreme values. In the discretc case, the modal value of x has the highest probability of oceurring. Let g(x) be a function of x. The function that gives the expected value of g(x) is denoted Sem Probr= 2) if Xis discrere, Elgoo]=4 — (B-12) geo fe) dx if X is continuous. Hg(x) = a + bx for constants « and b, then Ela+bx]=a+5EL]. An important case is the expected value of a constant 4, which is just a. DEFINTTION B.2 Variance of à Random Variable The variance of a random variable is Varlx] = Eltx — 4] Dem? fm ifxis discreto, 11 (B-13) fe — a) fQ)dx ilx is continuous. Ebesmnrecsurscronatammm nana Var[x], wbich must be positive, is usually denoted o?, This function is a measure of the dispersion of a distribution. Computation of the variance is simplified by using the following 848 APPENDIXB + Probability and Distribution Theory important result: Varfx] = Ef] = 4º. (B-14) A convenient corollary to (B-14) is Eb]=0" +42. (B-15) By inserting y = a + bx in (B-13) and expanding. we find that Varfa + bx] = b? Var(x), (B-16) which implies, for any constant a, that Var[a] = 0. (BIN To describe a distribution, we usually use 5, the positive square root, which is the standard deviation of x. The standard deviation can be interpreted as having the same units of measurement as and pu. For any random variable x and any positive constant k, the Chebychev inequality states that Prob(p —ko sr <athkoy>1— F (B-18) Two other measures often used to describe a probability distribution are skewness = El(x — p)?] and kurtosis = El(x — 9%]. Skewness is a measurc of the asymmetry of a distribution. For symmetric distributions, Fu-0)=Hu+s) and skewness = 0. For asymmetric distributions. the skewness will be positive if the “long tail” is in the positive direction. Kurtosis is a measure of the thickness of the tails of the distribution. A shorthand expression for other central moments is Hr = El nu]. Since jt, tends to explode as r grows, the normalized measure, 4, /0', is often used for description. Two common measures are skewncss coefficient = 2 o and L degree of excess — “5 — 3, a The second is bascd on the normal distribution, which has excess of zero. Foz any two functions g1(x) and ga(x). Elgto) + gt] = Eleito] + Elga00]. (B-19) For the general casc of a possibly nontinear g(x), Elgco] = fem dx (B-20) APPENDIX B + Probability and Distribution Theory 849 and Varfgo)| = f (gu) — Elgco]) fo) dr. (8.21) (For convenience, we shall omit the equivalent definitions for discrete variables in the fol- lowing discussion and use the integral to mcan either integration or summation, whichever is appropriate.) A device used to approximate E[g(x)] and Varlg(x)] is the linear Taylor series approxi- mation: 869 = [g0) = gx] + gta = fi + far = 00). (B-22) If the approximation is reasonably accurate, then the mcan and variance of g"(x) will be ap- proximately equal to the mean and variance of g(x). A natural choice for the expansion point is 2º = a = E(x). Inserting this valuc in (B-27) gives a 806) = [g(g) — gu] + 8 (u)x, (B-23) so lhat Elga)) = (yo) (B-24) and Varig] = [g'()] Varfx]. (B-25) A point to note in vicw of (B-22) to (B-24) is (hat Elg(x)] will generally not equal g( Ex). For the special case in which g(x) is concave—that is, where 4(1) <U—we know from Jensen's inequality that E[g(0)] < (E [x]). For example, E[log(r)] < lost E fx). B.4 SOME SPECIFIC PROBABILITY DISTRIBUTIONS Certain experimental situations naturally give rise to specific probability distributions. In the majority of cases im cconomics, however, the distributions used are merely models ot the observed phenomcna. Although the normal distribution, which we shall discuss at length, is the mainstay of econometric research, economists have used a widc variety of other distributions. A few are discussed herc.! B.4.1 THE NORMAL DISTRIBUTION “The general form of the normal distribution with mean and standard deviation o is 1 fotu? tocar, (8:26) This result is usually denoted x — N[g, o2]. Fhe standard notation x = f(x) is used to state that “x has probability distribution /():” Among the most useful properties of the normal distribution ! A much more complete listing appears in Maddala (1977, Chaps. 3 and 18) and in mos! mathematical statistics texibooks. See also Poirier (1995) and Stuart and Ord (1989). Another useful reference is Evans, Haslings and Peacock (1993). Johnson et al. (1974, 1993, 1994, 1995, 1997) is an encyclopedic reference on the subject Of statistical distributions. 852 APPENDIX B + Probability and Distribution Theory Chi-squared[2] Density Density é ifzisan N[0, 1] variabie and x is x?[n] and is independent of z, then the ratio (B-36) has the & distribution with n degrees of freedom. The: distribution has the same shapc as the normal distribution but has thicker tails. Figure B.3 illustrates the z distributions with three and 10 degrees of freedom with the standard normal distribution. Two effects that can be seen in the figure are how the distribution changes as the degrees of freedom increases, and, overall, the similarity of the t distribution to (he standard normal. This distribution is tabulated in the same manner as the chi-squared distribution, with several specific cutoff points corresponding to specified tail areas for various values of the degrees of freedom parameter. Comparing (B-35) with m = 1 and (B-36), we see the useful relationship between the £ and F distributions: e Trina), then = FL. a]. If the numerator in (8-36) has a nonzero mcan, then the random variable in (B-36) has a non- central f distribution and its square has a noncentral E distribution. These distributions arise in the F'tests of linear restrictions [sce (6-6)] when the restrictions do not hold as follows: 1. Noncentra! chi-squared distribution. If z has a normal distribution with mean e and standard deviation 1, then the distribution of 2? is noncentra! chi-squared wilh parameters 1 and 422. a 1lz- Ny, £] with J elements. then z'E !'z has a noncentral chi-squared distribution with J degrees of freedom and noncentrality parameter E! ,/2, which we denote xe, WE! n/2]. b. Ifz Níg, 1) and M is an idempotent matrix with rank 4, then z'Mz = x2(J, u'Mg/2]. APPENDIX B + Probability and Distribution Theory 853 Normai[0,1]. t[3] and 1/10] Densitics as — Normaljo4] .36 Density 8H «D9 «SO, -40 2. Noncentral E distribution. 11 A has a noncentral chi-squared distribution with noncentrality parameter À and degrees of freedom n and A; has a central chi-squared distribution with degrees of freedom 1 and is independent of X;, then p= Um Koln has a noncentral F distribution with parameters 1, 1», and 2.2 Note that in each of these cases, the statistic and the distribution are the familiar ones, except that the effect of the nonzero mean, which induces Lhe noncentrality, is to push the distribution to the right. B.4.3 DISTRIBUTIONS WITH LARGE DEGREES OF FREEDOM The chi-squared, +, and F distributions usually arise in connection with sums of sample observa- tions. The degrecs of freedom parameter in each casc grows with the number of observations. We olten deal with larger degrees of treedom than are shown in the tables. Thus, the standard tables are often inadequate. In all cases. however, there are limiting distributions that we can use when the degrecs of freedom parameter grows large. The simplest case is the 1 distribution. The t distribution with infinite degrees of freedom is equivalent to the standard normal distribution. Beyond about 100 degrees of freedom, they arc almost indistinguishable. For degrees of freedom greater than 30, a reasonably good approximation for the distribution of the chi-squared variable x is z= (O)? — (un 12, (B-37) 2The denominator chi-squared could also be noncentral, but we shall not use any statistics wilh doubly noncentral distributions. 854 APPENDIXB + Probability and Distribution Theory which is approximately standard normally distributed. Thus, Prob[n) < a) = 6|(29)'2 — Gn 1)!2]. As used in cconometrics, the + distribution with a large-denominator degrecs of Ircedom is common, Ás 7 becomes infinite, the denominator of E converges identically to one, so we can treat the variable x=mF (8-38) as a chi-squared variable with m degrees of freedom. Since the numerator degree of freedom will typically be small. this approximation will suffice for the types of applications we arc likely to encounter? E not, then the approximation given carlier for the chi-squared distribution can be applied tom F. B.4.4 SIZE DISTRIBUTIONS: THE LOGNORMAL DISTRIBUTION In modeling size distributions, such as the distribution of firm sizesin an industry or the distribution of income in a country, the lognormal distribution, denoted LM[41, 02], has been particularly useful Loan : = Lido x p0/0] , x>0 fa) Fa > A lognormal variable x has Elg=e"2 and Varfa] = esto (e! 1). The relation between thc normal and lognormal distributions is Ky- LN[u,0?], ny No). A useful result for transformations is given as follows: Hx has a lognormal distribution with mean é and variance 2?, then ox Mg,0?), where =In0?- iin(? +22) and o? =Ind+x/6). Since the normal distribution is preserved under lincar transformation, ify- LNfu,0?], then ny = Nfry, ro]. If y, and 3, are independent lognormal variables with x — LN[ya, 07] and 35 — LNTHo, 07], then sua = LN (gu + po, 0] +02]. àSee Johnson and Kotz (1994) for other approximations. *A study of applications of the lognormal distribution appears in Aitchison and Brown (1969). APPENDIX B + Probability and Distribution Theory 357 Density Relative frequency The simplest case is the first one. "The probabilitics associated with the new variable are computed according to the laws of probability. TF y is derived from x and the function is one to one, then the probability that Y = y(x) equals the probability that X = x. Tf several values of x yield the same valuc of y. then Prob(Y = y) is the sum of the corresponding probabilíties for x. The second type of transtormation is illustrated by the way individual data on income are typ- ically obtained in a survey. Income in the population can be expected to be distributed according lo some skewed, continuous distribution such as the one shown in Figure 13.5. Data are often reported categorically. as shown in the lower part of the figure. Thus, the random variable corresponding to observed income is a discrete transtormation of the actual underiying continuous random variable. Suppose, for example, that the transformed variable yis the mean income in Lhc respective interval, Then Prob(Y =) = PM-o0<X<a). Prob(Y =) = Pla <X<b), Prob(Y =) = P(b<X<o. and so on, which illustrates the general procedure. Ixis a continuous random variable with pdf f.(x) and if y = g(%) is à continuous monotonic function of x, then the density of y is obtained by using the change of variable technique to find 858 APPENDIX B + Probability and Distribution Theory the cdf of y: » Poby == [fg on ON dy. This equation can now be written as » Prob(y < b) = fts dy. Hence, fd = fg "Oulg "OL 5 4B41) To avoid the possibility of a negative pdl if g(x) is decreasing, we use the absolute value of the derivative in the previous expression. The term [g-“()| must be nonzero for the density of y to be nonzero. In words, the probabilities associated with intervals in the range of y must bc associated with intervals in the range oí'x. If the derivative is zero. the correspondence y = g(x) is vertical, and hence all values of y in the given range are associated with the same value of x. This single point must have probability zero. One of the most usetul applications of the preceding result is the lincar transformation of a normally distributed variable. Tf x — N'[g. 07], then the distribution of is found using the result above, First, Lhe derivative is obtained from the inverse transformation =L bsro vo S v= 0 qrarsoytu> = Theretore. (= A eorto-ufad ty | — Co vao Fhis is the density of à normally distributed variable with mean zero and unit standard deviation one. This is this resul which makes it unnecessary to have separate tables for the different normal distributions which result from different means and variances. B.6 REPRESENTATIONS OF A PROBABILITY DISTRIBUTION The probability den function (pdf) is a natural and familiar way to formulate the distribution ofa random variable. But, there are many other functions that are used to identify or characterize a random variable, depending on the setting. In each of these cases, we can identify some other function of the random variable that has a onc to onc relationship with the density. We have already used one of these quite heavily in the preceding discussion. For a random variable which has density function f(x), the distribution function or pdf, F(x9, is an equally informative function that identifies the distribution: the relationship between /) and F(x) is defined in (B-6) for a diserete random variable and (B-8) for a continuous onc. Wc now consider several other related functions. For a continuous random variable, the survival function is S(x)=1 — F()=Prob[X> x]. This function is widely used in epidemiology where x is time until some transition, such as recovery APPENDIX B + Probability and Distribution Theory 859 from a disease. Thc hazard function for a random variable is = fo fo) PO = sã 150) The hazard function is a conditional probability: ' ht) =limProX<x< A+] X>x) hazards have been used in econometries in studying the duration of spells, or conditions, such as unemployment, strikes, time until business failures, and so on. The connection between the hazard and the other functions is (x) = —dIn Stx)/dx. As an exercise, you might want to ver- ily the interesting, special case ol Aty)=1/», à constant—the only distribution which has this characteristic is the exponential distribution noted in Section BA4.5. For the random variable X, with probability density function f(x), if the function Me) = Efe] exists, then it is the moment-generating function, Assuming the function exists, it can be shown that EMO de | = Elx']. The moment generating function, like the survival and the hazard functions, is a unique char- acterization of a probability distribution. When it exists, the moment-generating function has a one-ta-one correspondence with the distribution. Thus, for example, if we begin with some ran- dom variable and find that a transformation of it has a particular MGF, then we may infer that the function of the random variable has the distribution associated with that MGF. A convenient application of this result is the MGrF for the normal distribution. The MGF for the standard normal distribution is M,(4) = “2, A useful feature of MGESs is the following: ifx and y are independent, then the MGF of x + vis MDMAD. This result has been used to establish the contagion property of some distributions, that is, the property that sums of random variables with a given distribution have that same distribution. The normal distribution is a familiar example. This is usually not the case, It is for Poisson and chi-squared random variables. One gualification of all o! the preceding is that in order for these results to hold, the MGF must exist. Fl will [or the distributions that we will encounter in our work, but in at least one important case, we cannot be sure of this. When computing sums of random vari- ables which may have different distributions and whose specific distributions nced not be so well behaved, it is likely that the MGF of the sum does not exist. However, the characteristie function, óleo = Eje!t] will always exist, at least for relatively small 7. The characteristic function is the device used to prove that certain sums of random variables converge to a normally distributed variable—that is, the characteristic function is a fundamental tool in proofs of the central limit theorem. 862 APPENDIXB + Probability and Distribution Theory The sign of the covariance will indicate the direction of covariation of X and Y. lis magnitude depends on the scales of measurement. however. Tn view of this fact, a prefcrable measure is the correlation coclficient: Hx y]= (B-53) GO where 9, and o, are the standard deviations of x and y. respectively. Lhe correlation coefficient has the same sign as the covariance but is always betwcen —1 and 1 and is thus unaffected by any scaling of the variables. Vaxiables that are uncorrelated are not necessarily indopendent. For example, in the dis- crete distribuion 1. D= 0.0)=f(1,1) the correlation is zero, but f(1, 1) does not equal f(1) f,(1)=(1)(3). An important exception is Lhe joint normal distribution discussed sub- sequently. in which lack of correlation does imply independence. Some gencral results regarding expectations in a joint distribution, which can be verified by applying lhe appropriate definitions, are Efax +by+c] = a E(x] + bEly| +, (B-54) Varfax + by + c] = 4? Var[x] + b?Var[y] + 2ab Covx, y] = Varfaz + by), (5) and Covfax + by, ex + dy] = ac Var[x] + bd Var|y] + (ad + be)Covx, 3]. (B-56) 1£ X and Y are uncorrelated, then Var[x + y] = Varfe — o = Voto al v. (8:57) For any two functions gi (x) and g2(9), if x and y are independent, then ElgtIga(0)] = Eleito E[ga 09]. (B-58) B.7.4 DISTRIBUTION OF A FUNCTION OF BIVARIATE RANDOM VARIABLES The result for a function of a random variable in (B-41) must be modified for a joint distribution. Suppose that «1 and x» have a joint distribution fi(x, 25) and that yy and 3, are two monotonic functions of x and x: »1 = (mg) 32 = polo, do). Since Lhc functions are monotonic, the inverse transformations, m=%0.32) do = 22(p1, 32), APPENDIX B + Probability and Distribution Theory 863 exist. The Jacobian of the transformations is the matrix of partial derivatives, = [om/0n dm/ap] [ox “ox/ayp 0x7) |oy| The joint distribution of yy and yy is HO) = fliQ, 42), 2201, P)JabsaZ). The determinant of the Jacobian must be nonzero for the transtormation lo exist. A zero deter- minant ímplies thar the two transformations are functionally dependent. Certainly the most common application of the preceding in econometrics is the linear trans- formation of a set of random variables. Supposc that x, and x, arc indcpendently distributed NT0, 1], and the transformations are Yi = 01 + Bim + Bro, ' 3» = 02 + Boda + fo. To obtain the joint distribution of x and 35, we first write the transformations as y=a+Bx The inverse transformarion is x=B '(y-a). so the absolute value of the deierminant of the Jacobian is 1 abs|B!” abs|Z| = abs|B-!| The joint distribution of x is the product of the marginal distributions since thcy arc independent. Thus, O = QUE RD = pay tese, Inserting the results for x(y) and Jinto £,(%, y>) gives 1 abs!B| EO) = Qro e ABB Ta? “This bivariate normal distribution is the subject of Section B.9, Note that by formulating it as we did above, we can generalize easily to the multivariate case, that is, with an arbitrary number of variables. Perhaps the more common situation is that in which it is necessary to find the distribution of one function of two (or more) random variables. À strategy that often works in this case is te form the joint distribution of the transformed variable and one of the original variables, then integrate (or sum) the latter out ol the joint distribution to obtain the marginal distribution. Thus, to find the distribution of yi (xi, 42), We might formulate = (a, 22) p=. 864 APPENDIX EB + Probability and Distribution Theory The absolute valuc of the determinant of the Jacobian would then be dx dx! ; J=abs dy |= am(5) 0 1 Y The density of y would then be mon= [ fiber. yo), 30] absiZ | dyo. = B.8 CONDITIONING IN A BIVARIATE DISTRIBUTION Conditioning and the use of conditional distributions play a pivotal role in econometric modeling, We consider some general results for a bivariate distribution. (A!l these results can be extended directly to the multivariate casc.) Tn a bivariate distribution, there is a conditional distribution over y for each valuc of x. The conditional densities are fora) = ÁED (B-59) fo) and 1t follows from (B-46) that: tfx and y arc independent, then f(yixy= f(y) and flxly= Ato). (B-60) The interpretation is that if thc variables arc independent, the probabilities of events relating to one variable are unrelated to the other. Thc definition of conditional densities implies the important result fe) = food fo) = SD ho). (B-61) B.8.1 REGRESSION: THE CONDITIONAL MEAN A conditional mean is the mean of the conditional distribution and is defined by vfQIx)dy if y is continuous, Epid= 40 (8.62) Sosfolo ifyisdiscrete, ; The conditional mean function E[y | x] is called the regression of y on x. A random variable may always be wrilten as v= Elylal4 (1 Elyial) = Elyla]+e. APPENDIX B 4 Probal ility and Distribution Theory 867 THEOREM B.6 Linear Regr y In a bivariate distribution, if Ely |x] = « + Bx and if Var[y| x] is a constant, then Varfy x) = VarbJ(1 - Corr [y.x]) = 02 (1-5). (B-71) emeesosrepsoperesageraçd The proof is straighiforward using Theorems B.2 to B4. eee esco Pesomemessrassememsanemementosça B.8.4 THE ANALYSIS OF VARIANCE The variance decomposition result implies that in a bivariate distribution, variation in y arises from two soure: 1, Variation because &[y|x] varies with x: regression variance = Var,[E[y |]. (5:72) 2. Variation because, in each condiitional distribution, y varies around the conditional mean: residual variance = E,|Var[y | x]. (B-73) Thus, Var[y] = regression variance + residual variance. (8-74) In analyzing a regression, wc shall usually be interested in which of Lhe two parts of the total variance, Var[y], is the larger one, A natural measure is the ratio regression variance coctiicient of determination — "CETSSSION variâncc (B-75) total variance In the setting of a lincar regression, (13-75) arises from another relationship that emphasizes the interpretatien ol the correlation cocfficient. Jf Ely|s]=a+8x. then the coefficient of determination= COD = 2, (B-76) where pº is the squared correlation between x and y. We conclude that the correlation coefficient (squared) is a measure of the proportion of the variance of y accounted for by variation in the mean of y given x, Ktis in this sense that correlation can be interpreted as a measure of linear association between Lwo variables. B.9 THE BIVARIATE NORMAL DISTRIBUTION A hivariate distribution (hat embodies many of the features described carlicr is the bivariate normal, which is the joint distribution of two normally distributed variables. The density is fe.) = (877 868 APPENDIX B + Proba ty and Distribution Theory “Lhe parameters jt,. 64, 4, and 0, are the means and standard deviations of the marginal distri- butions oLx and y, respectively. The additional parameter p is the correlation between x and p. The covarianee is Oy = pero. (B-78) The density is defincd enly if p is not 1 or —1, which in turn requires that Lhe two variables not be linearly related. 1f x and y have a bivariate normal distribution, denoted (3) Nox, Hu 02, 02.6], then * The marginal distributions are normal: 6) = N[t, 02], (B-78) HO) = Ng, 02]. * The conditional distributions arc normal: 1G In) = N[e+ x. 021 — 03] (B-80) w=uy—Bus É and likewise for f(x 3). e xand y are independent il and only if p = 0. The density factors into Lhe product of the two marginal normal distributions if p = 0. Two things to note about Lhe conditional distributions beyond their normality are their lincar regression functions and their constant conditional variances. Thc conditional variance is less than the unconditional varíance, which is consistent with the results of the previous section. B.10 MULTIVARIATE DISTRIBUTIONS The extension of the results for bivariate distributions to more than two variables is direct. ILis made much more convenient by using matrices and vectors. The term random vector applies to a vector whose elements are random variables. Fhe joint density is f(x), whereas the cdfis rim Pa a Fix = Í f e [ HO dh dy di. (B-81) Note that the cdf is an n-fold integral. The marginal distribution of any onc (or more) of the n variables is obtained by integrating or summing over the other variables. B.10.1 MOMENTS Thc expected value of a vector or matrix is the vector or matrix of expected values. A mean vector is defined as m Elm] uz Ex) . “| =Eb (B-82) E Il ] o) Lets] APPENDIX B + Proba ty and Distribution Theory 869 Define the matrix Gude — pa) Ga plo o) ce Cy — ua — pa) (ro — pe) — Hei) (xo — pu) — pa) (o — flo) On — ln) Go ma-m'= Go — Ca — o) Cem — lo — 2) eo (im — tio Ota — dm), The expected value of each element in the matrix is the covariance of the two variables in the product. (The covariance of a variable with itself is its variance.) Thus, on Ma ce Cm O O tm Ela-ma-m]=. . = Efe] — um, (B-83) Cm O o Om which is Lhe covariance matrix of (he random vector x. Henceforth, we shall denote lhe covariance matrix of u random vector in boldtace, as in Varlx|= E. By dividing o;; by c;9;, we obtain (he correlation matrix: Voe pr co pu Pa opa e pm R= ” om Po Po co 1 B.10.2 SETS OF LINEAR FUNCTIONS Our earlier results for the mean and variance of a linear function can be extended 10 the multi- variate casc. For the mean, Ela + ax +: +anxa] = Ela'x] =0Elal+aE[o)+---+a, Ely) (B-84) = artty + opa dee + nda =a'. For the variance, Varfa'x]) = El(a'x— Eju'x|)”] = Elfo(s- em) Ela'tx — aJtx — p)'a] as Eb] = nanda(x— pm) = — Ya. Since a is a vector of constants, Varta'x] = a El A ufa =a'2a= 5" 5 aaja. (B-85) = ja 872 APPENDIX B + Probability and Distribution Theory THEOREM B.7 (Continued) and x — N(go, Ex). (B-101) The conditional distribution of x1 given xo is normal as well: os mão N(gio, Eio) (B-102) where tia = da + EE — 45) (B-l02a) En2= En -LpEy En. (B-102b) i Proof: We partition w and E as shown above and insert the parts in (B-95). To constmuct the density, we use (A-72) to partition the determinant, [E = [Eni [Eu — Eng Lo and (A-74) io partition the inverse, Zu E “o Ex Ep For simplicity, we let Ejo -ErhB BL Ex +BEB) B= TpEj. Inserting these in (B-95) and collecting terms produces the joint density as a product of two terms: Fax) = fam | xo) falo). The first of these is a normal distribution with mean q and variance Eny2, whereas the second is the marginal distribution of'x». The conditional mean vector in (he multivariate normal distribution is a linear function of the unconditional mean and the conditioning variables, and the conditional covariance matrix is constant and is smaller (in the sense discussed in Section A.7.3) than Lhe unconditional covariance matrix, Notice that the conditional covariance matrix is the inverse of the upper leftblock of E !; that is, this matrix is of the form shown in (A-74) for the partitioned inverse of a matrix. B.11.2 THE CLASSICAL NORMAL LINEAR REGRESSION MODEL An important special case of the preceding, is that in which x, is a single variable y and x, is X variables, x. Then the conditional distribution is a multivariate version of that in (B-80) with 8 = Elo Where ox is the vector of covariances of y with x,. Recall that any random variable, », can be written as its mcan plus the deviation from thc mean. If we apply this tautology to the multivariate normal, we obtain v= Elyi+(y- ElyIx]) =a+8'x+e APPENDIX B + Probability and Distribution Theory 873 where £ is given above, o = py — 8'k,, and e has a normal distribution. We thus have, in this multivariate normal distribution, the classical normal linear regression model. B.11.3 LINEAR FUNCTIONS OF A NORMAL VECTOR Any lincar function ofa vector ol joint normally distributed variables is also normally distributed. The mean vector and covariance matrix of Ax, where x is normally distributed, follow the gencral pattern given carlier. Thus, Tfs — N[u, E), then Ax+b- N[Ag+b, ATA]. (B-103) 11 A does not have full rank, then AZA! is singular and the density does not exist in the full dimensional space of x though it does exist in the subspace of dimension equal to the rank of E. Nonctheless, the individual elements of Ax +b will still be normally distributed, and the joint distribution of the full vector is still a multivariate normal. B.11.4 QUADRATIC FORMS IN A STANDARD NORMAL VECTOR The cartier discussion of the chi-squared distribution gives the distribution of xx if x has a standard normal distribution, It follows from (A-36) that xx= > x; = We know from (B-32) that x'x has a chi-squared distribution. Itscems natural, therefore, to invoke (B-34) for the two parts on (he right-hand side of (B-104). It is nor yet obvious, however, that either of the two terms has a chi-squared distribution or that the two terms are independent, as required. To show these conditions, it is necessary to derive the distributions ol idemputent quadratie forms and to show when they are independent. To begin, the second term is the square ot nz, which can easily be shown to have a standard normal distribution. Thus, the second! term is the square of a standard normal variable and has chi- squared distribution with one degree otireedom. But the first term is the sum of nonindependent variables, and it remains to be shown that the two terms are independent. > “3 4nR. (B-104) [e x a rem a DEFINITION B.3 Orthonormal Quadratic Form A particular case of (B-103) is the following: If x N|0.1) and C is à square matrix such that CC =T, then Cx [0.1]. Consider, then, a quadratic form in a standard normal vector x with symmetric matrix A: g=xAx. (B-105) Let the charactcristic roots and vectors of A bc arranged in a diagonal matrix A and an orthogonal matrix €, as in Section A.6.3. Then g=YX€CAC'x. (B-106) By definition, € satisfics the requirement that C'C = 1. Thus, the vector y = C'x has a standard 874 APPENDIX.B + Probability and Distribution Theory normal distribution. Conseguently, q=yAr= (B-107) = 1f à; is always one or zero, then (B-108) which has a chi-squared distribution. The sum is taken over the j = 1,..., J elements associated wilh the roots lhat are equal to one. A matrix whose characteristic roots are all zero or one is idempotent. Therefore, we have proved the next theorem. THEOREM B.8 Distribution of an Idempotent Quadratic Form in 1x NO, 1] and A is idempotem, then x Ax has a chi-squared distribution with degrees E : a Standard Normal Vector E É offreedom equal to the number of unit roots of A which is equal to the rank of A. Eraser as E E em E E The rank of a matrix is equal to thc number of nonzero characteristic roots it has. Therelore. the degrees of freedom in the preceding chi-squarcd distribution cquals 4, the rank of A. We can apply this result to the carlicr sum of squares. The first term is Du BP =eMêx, a where Mº was defined in (A-34) as the matrix that transforms data to mean deviation form: ' 1 Mi=1- ir. a Since MÍ is idempotent, the sum of squared deviations from lhe mean has a chi-squared distri- bution. The degrees of Ireedom equals the rank Mº, which is not obvious except for the useful result in (A-108). that * The rank of an idempotent matrix is equal to ils trace. (B-109) Each diagonal element of Mº is 1 — (1/1); hence, the trace is n[1 — (1/19) = n— 1. Therefore, we have an application of Thcorem B.8. Tx MO, DS tu Don]. (B-110) Wo have already shown (hat the second term in (B-104) has a chi-squarcd di degree of freedom. It is instructive to set this up as a quadratic form as well; ribution with one x [54] x=xljl whcrej= (=) (B-111) The matrix in brackets is the outer product of a nonzero vector, which always has rank one. Yon can verify that itis idempotent by multiplication. Thus, x'x is the sum of two chi-squared variables, APPENDIX € 4 Estimation and Inference 877 THEOREM B.12 Independence of à Linear and a Quadratic Form Alinear function Lxanda symmetric idempotent quadratic form x Ax in a standard normal É vector are sratisicaliy independem if LA = 0, The proof follows the same logic as that for two quadratic forms. Write x Ax as x A'Ax = (Ax)(Ax). The covariance matrix of the variables Lx and Axis LA = 0, which cstablishes the independence of these (wo random vectors, The independence of the linear function and the quadratic form follows since functions of independent random vectors are also independent. The + distribution is defined as the ratio of a standard normal variable to the square root of a chi-squared variable divided by its degrees of freedom: = MO] feuyry A particular case is alrz = ST Due whcre s is the standard devialion of the values ol. The distribution of the two variables in 7|n 1] was shown earlier; we need only show that they are independent. But 1 nã dn—-U= jx and x'Mºx =" Tt suffices to show Lhat Mºj = 0, which follows from Mi i-idD dD=0. ipi APPENDIX C —mal0/0/ 0 ESTIMATION AND INFERENCE INTRODUCTION The probability distributions discussed in Appendix B serve as models for thc underlying data gen- erating processes that produce our observed data, The goal of statistical inference in econometrics às to usc the principles of mathematical statistics to combine thesc theoretical distributions and the observed data into an empirical model of the economy. This analysis takes place in one of two frameworks, classical or Bayesian. The overwhelming majority of empírical study in cconometrics 878 APPENDIXC + Estimation and Inference has been done in the classical framework. Our focus, therefore, will be on classical methods of inference. Bayesian methods arc discussed in Chapter 16. €C.2 SAMPLES AND RANDOM SAMPLING The classical theory of statistical inference centers on rules for using the sampled data effectively. These rules, in turn, are based on the properties of samples and sampling distríbutions. A sample of n observations on one or more variables, denoted x,.x>...., x, is a random sample if the 2 observations arc drawn independently from the same population, or probability distribution, f(x, 8). The sample may be univariate if x, is a single random variable or multi- variate if cach observation contains several variables. A random sample of observations, denoted [xi. xo, ..., Ka] Of fx), o. IS Said to be independent, identically distributed, which we denote id. The vector 8 contains one or more unknown parameters. Data are generally drawn in one Of two settings. A cross section is a sample of a number of observational units all drawn at the same point in time. A time series is a set of obscrvations drawn on the same obscrvational unit at a number of (usually evenly spaccd) points in time. Many recent studies have been based on time series cross sections, which generally consist of the same cross sectional units obscrved at several points in time. Since the typical data set of this sort consists of a large number of cross-sectional units observed at a few points in time, the common term panel data set is usually more fitting for this sort of study. C.3 DESCRIPTIVE STATISTICS Before attempting to estimate parameters of à population or fit models to data, we normally examinc the data themselves. In raw form, the sample data are a disorganized mass of information. so we will need some organizing principles 10 distill the information into something meaningful. Consider, first, examining (he data on a single variable, In most cases, and particularly if the number of observalions in the sample is large, we shall use some summary statisties to describe the sample data. Of most interest are measures of location—that is, the center of thc data—and scale, or the dispersion of the data. A few measures of central tendency are as follows: median: M = middle ranked observation, (C-1) maximum — minimum sumple midrange: midrange = z The dispersion of the sample observations is nsualty measured by the Dt +] 12 standard deviation; s, = [ n (€-2) Other measures, such as the average absolute deviation from the sample mean, are also used. although less frequently than the standard deviation. The shape of the distribution of values is ! An excellent reference is Leamer (1978), A summary of the cesults as they apply to econometries is contained inZellner (1971 )andin Judgeetal. (1985), See, as well, Poirier (1991). A recent textbook witha heavy Bayesian emphasis is Poirier (1995). - APPENDIX € + Estimation and Inferençce 879 often of interest as well. Samples of income or expenditure data, for example, tend to be highly skewed while financial data such as asset Teturns and exchange rate movements are relatively more symmetrically distributed but arc also more widely dispersed than other variables that might be observed. Two measures used to quantify these cfiecis are the Sum ser — 1) skewmess = | , o Eat | and kurtosis = [Ene D (Benchmark values for these two measures are zero for a symmetric distribution, and Lhree for one which is “normally” dispersed. The skewness cocfficient has a bit less of the intuilive appeal of the mean and standard deviation, and the kurtosis measure has very little at all. The box and whisker plot is a graphical device which is often used to capture a large amount of information about the sample in a simple visual display. This plot shows in a figure the median, the range of values contained in the 25th and 75th percentile, some limits that show the normal range of values expected, such as the median plus and minus two standard deviations, and in isolation values that could be viewed as outlicrs. A box and whisker plot is shown in Figure C.1 for the income variable in Example C.1. If the sample contains data on more than one variable, we will also be interested in measures of association among the variables. A seattex diagram is usefulin a bivariate sample if the sample contains a reasonable number of obscrvations. Figure C.1 shows an example for a small data set. If the sample is a multivariate one, then the degree of linear association among the variables can be measured by the pairwise measures Dia Gu — AM — n-1 covariance:s,, = correlation: r,, . xSy Tf the sample contains data on several variables, then it is sometimes convenient to arrange the covariances or corrciations in a covariance matrix: S = [5,;] (C-4) or correlation matrix: R = [r;;]. Some useful algebraic results for any two variables (x, yj), É bare .,n, and constants a and <€.5) (€-6) (7 (€-8) 882 APPENDIXC + Estimation and Inference Kernel Density Estimate for Income 020 MIS LO Density v 20 40 60 so 100 K INCOMF A gomes Distributin Range Relutive Frequency Cumutative Frequency <$10.000 015 10,000-25,000 0.30 25,000-50,000 0,40 >50,000 015 the population, although not perfectly. The precise manner in which thesc quantities reflect the population values defines the sampling distribution of a sample statistic. DEFINITION C.l Statistic A statistic is any function computed from the data in a sample. Eesreneosoronaanço Tf another sample were drawn under identical conditions, different values would be obtained for lhe observations, as cach one is a random variable, Any statistic is a function of these random values, so itis also a random variable with a probability distribution called a sampling distribution. For example, the following shows an exact result for Lhe sampling behavior of a widely nsed statistic. APPENDIX € + Estimation and Inference 883 as sin a E sereno THEOREM C.l Sampling Distribution of the Sample Mean 4f xi, ....4y are a random sample from a population with mean ju and variance 02, then Xis a random variable with mean u and variance 02 jn. Proofix = Un)Zix ElZ] = (1/m)Ljk = 4. The observations are independem, so Var(£) = (1/n)? Var[E;x;] = (1/12)£;o Example C.3 illustrates the behavior of the sample mean in samples of four observations drawn irom a chi squared population with one degree of freedom. The crucial concepts illus- trated in this example arc, first, the mean and variancc results in Theorem €..1 and, second, the phenomenon of sampling variabil Notice that the lundamental result in Theorem C.! does not assume a distribution for x;. Indecd, looking back at Section C.3, nothing we have done so far has required any assumption about a particular distribution. Example C.3 Sampling Distribution of a Sample Mean Figure C.3 shows a frequency plot of the means of 1,000 random samples of four observations drawn from a chi-squared distribution with one degree of freedom, which has mean 1 and variance 2. We arc often interested in how a staristic behaves as Lhe sample size increascs. Example C4 illustrates one such case. Figure C.4 shows two sampling distributions, one based on samples of three and a second, of the same statistic, but based on samples of six. The eficet of increasing sample size in this figure is unmistakable. It is easy to visualize the behavior of this statistic if we exirapolate the experiment in Example €.4 to samples ot, say, 100. Example C.4 Sampling Distribution of the Sample Minimum lfx,,..., Xp are a random sample from an exponential distribution with f(x) = de **, thenthe sampling distribution of the sample minimum in a sample of n observations, denoted x is Fx) = (nojetrism, Since E [x] = 1/6 and Varix]= 1/82, by analogy E Px] = 1 /(n6) and Varbx,] = 1/(8)2. Thus, in increasingly larger samples, the minimum will be arbitrarily close to O. [The Chebychev ineguality in Theorem D.2 can be used to prove this intuitively appealing result.) Figure G.4 shows the results of a simple sampling experiment you can do to demon- strate this effect. It requires software that will allow you to produce pseudorandom num- bers uniformly distributed in the range zero to one and that will let you plot a histogram and control the axes. (We used EA/LimDep. This can be done with Stata, Excel, or several other packages.) The experiment consists of drawing 1,000 sets of nine random values, i 1,000, 7 9. To transform these uniform draws to exponential with pa- —we used 4 = 1.5, use the inverse probability transform—see Section E.2,3. For an exponentially distributed variable, the transformation is 2; = —(1,/8) log(1 — Ui). We then created z1;|3 from the first three draws and z(9|6 from the other six. The two histograms show clearly the effect on the sampling distribution of increasing sample size from just 3to6. Sampling distributions ave used to make inferences about the population. [o consider a perhaps obvious example, because the sampling distribution ot the mean ol a set of normally distributed obscrvations has mean pt. the sample mean is a natural candidate for an estimate of 4. The observation lhat the sample “mimics” the population is à statement about lhe sampling 884 APPENDIXC 4 Estimation and Inference sor Br 7H sp Mean = 0.9038 60 Variance = 0.5637 ssh 7 Hi Frequency O 12 14 16 18 20 22 24 26 28 30 11/13 15 17 19 21 23 25 27 29 31 Sample mean istributiorn of Mesús:of 1 APPENDIX € + Estimation and Inference 887 DEFINITION C.4 Mean-Squared Error The mean-squared error of an estimator is MSE[8 6] = E[(ô - 037] = Vartê) + (Biasjá 16) fg is a scalar, (€.9) MSEI6 16] = Var[8] + Bias[ô | 6]Bias[ê 6] if b isa vector. Figure CS illustrates thc effect. On average, the biased estimator will be closer to the true parameter than will the unbiased estimator. Which of these criteria should be used in a given situation depends on the particulars of that setting and our objectives in the study. Unfortunately, the MSE criterion is rarely operation: minimum mcan-squared crror cstimators, when they exist at all, usually depend on unknown parameters. Thus, we are usually less demanding. A commonly used criterion is minimum variance unbiasedness. Example C.5 Mean-Squared Error of the Sample Variance In sampling from a normal distribution, the most frequently used estimator for o? is ltis straightforward to show that s? is unbiased, so 254 Varfs?) = E MSEIS [02]. unbiased É biased à ” Density Estimator 8ss APPENDIXC + Estimation and Inference IA proot is based on the distribution of the idempotent quadratic form (x — i)'Mº(x — ig), which we discussed in Section B11.4,] A less frequently used estimator is a 3-5 =Un-1)/ns. 1 This estimator is slightly biased downward: Ep (n>E(S) (no? so its bias is El6? — o?) = Bias[6? |0?] = But it has a smaller variance than s?: 2 Var[82] = [=| [A « Ver[s?]. n n—1 To compare the two estimators, we can use the difference in their mean-squared errors: MSE[6? [0?] - MSE[S |0%]= 0? [= 1 - | <0. The biased estimator is a bit more precise. The difference will be negligible in a large sample, but, for example, it is about 1.2 percent in a sample of 16. €.5.2 EFFICIENT UNBIASED ESTIMATION In a random sample of 4 observations, the density of each observation is f(x;, 8). Since the n observations are independent, their joint density is Fixo. tm 8) = HQ) 000,6) ++ Flan) z (C-10) = 6.0) = 10 a, = This Iunction, denoted L(8 | X), is called the likelihood function for O given the data X. Ttis frequently abbreviated to 16). Whcre no ambiguity can arise, we shall abbreviate it further to £. Example C.6 Likelihood Functions for Exponential and Normal Distributions lfx4,..., Xp are a sample of n observations from an exponential distribution with parameter 8, then º tl Lt9) =| [oe = ret DM, A H'x,,...,X, are a sample of n observations from a normal distribution with mean q and standard deviation o, then n Ltu,0) = | (2x0) -12e-titeonm o NI (ca) = (Bro!) -nig-tnzo mom? APPENDIX C 4 Estimation and Inference 889 The likelihood function is the cornerstone for most oi our theory of parameter estimation. An important result for cfficient estimation is the following. nen THEOREM C.2 Cramér-Rao Lower Bound Asstning that the density of x satisfies certain regularity conditions, the variance of an unbiased estimator of à parameter O will always be at least as large as o sn ço) ama . o]! = e) = (e(e) ) . (C-12) The quantity 10) is the information number for the sample. We will prove the result that the negative of the expected second derivative equais the expected square of the first derivative in Chapter 17. Proof of the main result of the theorem is quite involved. See, for example, Stuart and Ord (1989). meneame The regularity conditions are technical in nature. (Sec Section 17.4.1.) Loosely, they are con- ditions imposcd on the density of the random variable that appears in Lhe likelihood function; these conditions will ensure lhat the Lindberg-Levy central limit thcorem will apply to the sam- ple of observations on lhe random vector y In ftx |8)/98. Amonp the conditions are finite moments of x up to order 3. An additional condition normally included in the set is that the rango of the random variable be independent of the parameters. In some cases, the second derivative of the log likelihood is a constant, so the Cramér- Rao bound is simple to obtain. For instance, in sampling from an exponential distribution, from Example C6, InL=nIng-95 14, 7 anL n do O q» sogiln 1/00? = —n/6? and Lhe variance bound is [I(9)]-! = 62/n. In most situations, the second derivalive is a random variable with a distribution of its own. The [ollowing examples show two such cases. Example C.7 Variance Bound for the Poisson Distribution For the Poisson distribution, Int = aInt EUA BêInt as 892 APPENDIXC + Estimation and Inference Example C.9 Estimated Confidence Intervais for a Normal Mean and Variance In a sample of 25, x = 1.83 and s = 0.51. Construct a 95 percent confidence interval for p. Assuming that the sample of 25 is from a normal distribution, 5(R = 4) Prob (2064 < < 2064) =0.95, where 2.064 is the critical value from a t distribution with 24 degrees of freedom. Thus, the confidence interval is 1.63 + [2.064(0.51)/5] or [1.4195, 1.8405]. Remark: Had the parent distribution not been specified, it would have been natural to use the standard normal distribution instead, perhaps relying on the central limit theorem. But a sam- ple size of 25 is small enough that the more conservative t distribution might still be preferable. The chi-squared distribution is used to construct a confidence interval for the variance of a normal distribution. Using the data from Example 4.29, we find that the usual procedure would use 2 prs(124 « es <a94) =0.95, a where 12.4 and 39.4 are the 0.025 and 0.975 cutoff points from the chi-squared (24) distribu- tion. This procedure leads to the 95 percent confidence interval [0.1581, 0.5032]. By making use of the asymmetry of the distribution, a narrower interval can be constructed. Allocating 4 percent to the left-hand tail and 1 percentto the right instead of 2.5 percent to each, the two cutoff points are 13.4 and 42.9, and the resulting 95 percent confidence interval is [0.1455, 0.4659]. Finally, the confidence interval can be manipulated to obtain a confidence interval for a function of a parameter. For example, based on the preceding, a 95 percent confidence interval for o would be [V/0.1581, 0.5032] = [0.3976, 0.7094). C.7 HYPOTHESIS TESTING The second major group of statistical inference procedures is hypothesis tests. The classical testing procedures are based on constructing a statistic from a random sample that will enable the analyst to decide, with reasonabic confidence, whether or not the data in the sample would have bcen generated by à hypothesized population. The formal procedure involves a statement of the hypothesis, usually ir tecms of a “null” or maintained hypothesis and an “altemative,” conventionally denoted Hy and 14, respectively. The procedure itself is a rule, stated in terms of the data, that dictates whether the null hypothesis should be rejected or not. For example, the hypolhesis might state a parameter is equal to a specified value. The decision rule might state thal the hypothesis should be rejected if a sample estimate of thar parameter is too far away from that value (where “far” remains to be defined). The classical, or Neyman-Pearson, methodology involves partitioning the sample space into two regions. If the observed data (i.e., the test statistic) fall in the rejection region (sometimes called the critical region). then the null hypothesis is rejected; if they fal in the acceptance region, then it is not. C.714 CLASSICAL TESTING PROCEDURES Since the sample is random, the test statistic, however defincd, is also random. The same test procedure can lead to dillerent conclusions in different samples. As such, there are two ways such a procedure can be in error: 1. Type F error, The procedure may lead to rejection of the null hypothesis when it is truc. 2, Type M error, The procedure may fail to reject the null hypothesis when it is false. APPENDIX C + Estimation and Inference 893 To continue the previous example, lhere is some probability that the estimate of the parameter will be quite far from the hypothesized value, even if the hypothesis is true. This outcome might cause a type 1 error. E ns ] E É DEFINITION C.ó Sizeofa Test Ê : The probability ofa type I error is the size of the test. This is conventionally denoted wand É i is also called the significance level. i e recomeço The size of the test is under the control of the analyst. Tt can be changed just by changing the decision rule. Indeed, the type I error could he eliminated altogether just by making the rejection region very small, but this would come at a cost. By climinating the probability of a type Lerror—that is, by making it unlikely that the hypothesis is rejecicd—we must increase the probability ot a type TI error. Ideally, we would like both probabilities to be as small as possible. It is clear, however, that there is a tradeoft berween the two. The best we can hope for is that for a given probability of type T error, the procedure we choose will have as small a probability of type 1 exrot as possible. gemea DEFINITION €.7 Power ofa Test The power of a test is the probability that i will correctly lead to rejection of a false null hypothesis: power=1— = 1 — Prob(type II error). (0-16) eemzevemeen ces For a given signilicance level «, we would like 8 ta be as small as possible. Since £ is defined in terms of the alternative hypothesis, it depends on Lhe value of the parameter. Example C.10 Testing a Hypothesis About a Mean For testing Ho:u = 44º in a normal distribution with known variance o?, the decision rule is to reject the hypothesis If the absolute value of the z statistic, VI(X — 1º)/w, exceeds the predetermined critical value. For a test at the 5 percent significance level, we set the critical value at 1.96. The power of the test, therefore, is the probability that the absolute value of the test statistic will exceed 1.96 given that the true value of 4 is, in fact, not 4º. This value depends on the alternative value of «, as shown in Figure C.6. Notice that for this test the power is equal to the size at the point where « equals uº. As might be expected, the test becomes more powerful the farther the true mean is from the hypothesized value. Testing procedures, like estimators, can be compared using à number of criteria. DEFINITION €C.8 Most Powerful Test A test is most powerful if it has greater power than any other test of the same size. repre E E gs a 894 APPENDIXC + Estimation and Inference This requirement is very strong. Since the power depends on the alternative hypolhesis, we might require that the test be unitormly most powerful (UMP), that is, havc greater power than any other test of the same size for all admissible values of the parameter. There are few situations in which a UMP testis available. We usually must be less stringent in our requirements. Noncthelcss, the criteria for comparing hypothesis testing procedures arc gencrally based on their respective power functions. A common and very modest requirement is thai the test be unbiased. DEFINITION C.º Unbiased Test A test is unhiased if its power (1 — 8) is greater than or equal to its size a for all values of the parameter, Tf a test is biased, then, for some values of Lhe parameter, we arc more likely to accept the null hypothesis when it is false than when it is truc. The use of the term unbiased here is unrelated to the concept of an unbiased estimator. Fortunately, there is little chance of confusion. Tesis and estimators are clearly connected, how- ever. The following criterion derives, in general, from Lhe corresponding attribute of a parameter estimate, DEFINITION C.10 Consistent Test A test is consistent ifits power goes to one as the sample size grows to infinity. APPENDIX D + Large Sample Distribution Theory 897 and results needed for this analysis. A few additional results will be developed in the discussion of time series analysis later in the book. D.2 LARGE-SAMPLE DISTRIBUTION THEORY" In most cases, whether an estimator is exactly unbiascd or what its exact sampling variance is in samples of a given size will be unknown. But we may be able to obtain approximate results about the behavior of the distribution of an estimator as the sample becomes large. For example, it is well known that the distribution of the mean of a sample tends to approximate normality as the sample size grows, regardless of the distributioa of the individual observations. Knowledge about the limiting behavior of the distribution of an estimator can bc used to infer an approximate distribution for the estimator in a finite sample, To describe how this is done, it is necessary, lirst, to present some results on convergence of random variables. D.2.1 CONVERGENCE IN PROBABILITY Limiting arguments in this discussion will be with respect to the sample size 1. Let x, be a sequence random variable indexed by the sample size. ppreemsemenme à DEFINITION D.1 Convergence in Probability E “É The random variable x, converges in probability to a constant c if lima, Prob(|x, — c] > 2) = 0 for any positive e. es esvgssrsec nano Convergence in probability implies that the values that the variable may take that are not close to « become increasingly unlikely as 1 increases, To consider one example, suppose that the random variable x, takes two values, zero and 2, with probabilities 1 — (1/1) and (1/n), respec- tively. As 7 incrcases, Lhe second point will become ever more remote from any constant but, at the same time, will become increasingly less probable. Tn this example, x, converges in probability to zero. The crux of this form of convergence is that all the mass of the probability distribution becomes concentrated at points close to c. If x, converges in probability to c, then we write plimx, =. D-1) We will make frequent use of a special case of convergence in probability. convergence in mean square or convergence in quadratic mean. en THEOREM D.1 Convergence in Quadratic Mean If x, has mean py and variance o) such tha the ordinary limits of un und 02 are c and 0, respectively, then X, converges in mean square to c, and 1a comprehensive summary of many results in large-sample theory appears in White (2001). “The results discussed here will appty to samples ol independent observations. Time series cases in which observations are correlated are analyzed in Chapters 19 and 20. 898 APPENDIXD 4 Large Sample Distribution Theory A proof of Theorem D.I can be based on another usclu) theorem. 'i THEOREM D.2 Chebychev's nequality 1 xy is a random variable and c and e are constant, then Probilx, — cl>8)< i Eltxa — cP]/e. semestres “To establish the Chebychev incqualiiy wc use another result [see Goldberger (1991, p. 31)). ger THEOREM D.3 Murkov's Inequality Hm is a nomegative random variable and 8 is a positive constant, ihen Prob[y = 6] < Ely]/8. Proof: Ely,] = Probly < 6] Ely | vn < 6] Probly; = 5] Ely | yu = 8]. Since yy is nom- negative, both terms must be nomnegative, so Elya] = Probm > 5] Ely | > 0] Since Ely | Yu > 8] must be greater than or equal to 5, Ely) > Prob[yy > 8]ô, which is the result. pesos cemeevemsmspsnsteserm erorreomemeessinmennasconenneneaçt É Ê i . Ê i i | Now. to prove Theorem D.L.. let ya be (xy — implies that |x, — e] > &. Finally, we will use E = fin, SO lhal we have 2 and 8 be +? in Theorem D.3. Then, (x, «2 > 6 special case of the Chebychey incquality, where Probclxa — al > 8) < 05/62. (D-2) Taking the limits of x, and o? in (D-2), we see lhat if Jim elx] =c ando lim Varfe] =0, (D-3) then plim am We have shown that convergence in mean square implies convergence in probability. Mean- square convergence implies that the distribution of x, collapses to a spike at plim xy. as Shown in Figure D.l. Example D.1 Mean Square Convergence of the Sample Minimum in Exponential Sampling As noted in Example G.4, in sampling of n observations from an exponential distribution, tor the sample minimum xe, 1 =lm L=o neo nã den, o) and lim Var'x, iva xo] Therefore, plim xy = O. Note, in particular, that the variance is divided by n?. Thus, this estimator converges very rapidly to 0. APPENDIX D + Large Sample Distribution Theory 899 Density Estimator onvaraençã to. sap Convergence in probability does not imply convergence in mean square. Consider the simple example given earlier in which x, equals either zero or 1 wilh probabilities 1 — (1/2) and (1/1). The cxact expected value of x, is 1 for alln, which is not the probability limit. Indeed, if we Ict Prob(x, = 4º) = (1/4) instead. the mean of the distribution explodes, but the probability limit is still zero. Again, the point x, = nº becomes ever more extreme but. at Lhc same time, becomes ever less likely. The conditions for convergence in mean square are usually easier to verify than those for the more gencral form. Fortunately, we shall rarely encounter circumstances in which it will be necessary to show convergence in probability in which we cannot rely upon convergence in mean square. Our most frequent use of this concept will be in formulating consistent estimators, É & DEFINITION D.2 Consistent Estimator An estimator É, of a parameter O is a consistent estimaior of 6 if and onty if plim 0, = 8 (D-4) pssigermessnganermerando THEOREM D.4 Consistency ot the Sample Mean ! É The mean ofa random sample from any population with finite mean | and finite variance É 02 is q consistent estimator of u. á Proef: Eli.) = u and Varlio] = 02/n. Therefore, Zx converges in mean squareto mor É Plim Za =p É