Baixe econometric analysis - GREENE 6TH EDITION - appendix e outras Notas de estudo em PDF para Matemática Computacional, somente na Docsity! As
APPENDIX A
msra/a/0=—
MATRIX ALGEBRA
TERMINOLOGY
A mustrix is a rectangular array of numbers, denoted
em dia e ak
A=[anj=[Aju= [dm dk. (A-I)
Am Go o Bnk
The typical element is used to denote the matrix. A subscripted element of a matrix is always
Tead as Aros column: An example is given in Table A.1. In these data, thc rows are identified with
vears and the columns with particular variables.
A vector is an ordered set of numbers arranged either in a row or a column. In vicw of the
preceding. a row vector is also a matrix with one row, whercas a column vector is a matrix with one
columm. Thus, in Table A.1. the five variables observed for 1972 (including the date) constimte a
row vector, whereas the time series of nine values for consumption is a column vector.
A matrix can also be viewed as a set of column vectors or as a set ol row vectors.! The
dimensions of a matrix are the numbers of rows and columns it contains. “A is an 4 x K matrix”
(read “m by K”) will always mean (hai A has 2 rows and columns. Tf x equals K. then A isa
square matrix. Several particular types of squarc matrices occur frequently in econometrics.
e A symmetric matrix is one in which a; = ag for alli and k.
* A diagonal matrix is a square matrix whose only nonzero elements appear on the main
diagonal, that is. moving from upper left to lower right
* A sealar matrix is a diagonal matrix with thc same value in all diagonal elements.
* Aniidentity matrix is a scalar matrix with ones on the diagonal. This matrix is always
denoted L A subscript is sometimes included to indicate its or order. For example,
* A triangular matrix is onc that has only zeros either above or below the main diagonal. If
the zeros arc above the diagonal, thc matrix is lower triangular,
A.2 ALGEBRAIC MANIPULATION OF MATRICES
A214 EQUALITY OF MATRICES
Malrices (or vectors) A and B are equal if and only if they have the same dimensions and each
element of A eguals the corresponding element of B. That is,
A=B ifandonlyifar=br foralliandá. (A-2)
! Henceforth, we shall denoto a matrix by a boldfaced capital letter. asis A in (A-1), and a vector as a boldfaced
lowercasc letter, as in a. Unless otherwise noted, a vector will atways be assumed to be a column vector.
- 803
804 APPENDIXA + Matrix Algebra
Column
2 3 5
1 Consumprion GNP 4 Discouns Rate
Row Year (billionsofdollars) (billionsofdotlars) GNPDefiator (N.Y Fed, avg)
1º 1972 7371 11859 1.0000 4.50
2 973 8120 13264 10575 6.44
3 194 808.1 * 14342 11508 783
4 1975 96,4 15492 12579 6.25
5 196 1084.3 17180 13234 5.50
6 1977 1204.4 19183 1.4005 546
71978 1346.5 21639 1.5042 746
8 1979 15072 24178 16342 10.28
9 1980 16672 2633.1 17864 11.77
Source: Data from the Economic Repor of the President (Washington, D.C: US. Government Printing
Office, 1983).
A2.2 TRANSPOSITION
The transpose of a matrix A, denoted A. is obtained by creating the matrix whose th row is
the th column of the original matrix. Uhus, if B = A”, then each column of A will appear as the
corresponding row of B. LA is x K, then A'is K x.
An equivalent definition of lhe transpose of a matrix is
B=A'Sby=au foralliandk. (A-3)
“The definition of a symmetric matrix implies that
if (and only if) A is symmctric, then A = A! (A-4)
It also follows from the definition that for any A,
(AJ=A. (A-S)
Finally, the transpose of a column vector, a, is a row vector:
at=ja a a)
A.2.3 MATRIX ADDITION
The operations vf addition and subtraction are extended to matrices by defining
C=A+B= [0 + bu). (A-6)
AB = [au — bu). (A-7)
Matrices cannot be added unless they have the same dimensions, in which casc they are said to be
conformable for addition. A zero matrix or null matrix is one whose elements are all zero, Tn lhe
addition of matrices, the zero matrix plays thc same role as the scalar O in scalar addition: lhat is
A+0=A, (A-8)
MN follows from (A-6) that matrix addition is commutative,
A+B=B4A. (A-9)
APPENDIX A + Matrix Algebra 807
* Transpose of a product: (AB = B'/A'. (4-23)
* — Transpose of an extended product: (ABC) = C'B'A'. (A-24)
A2.7 SUMS OF VALUES
Denote by i a vector that contains a column of ones. Then,
Su=nta+o tm (A-25)
=
If all elements in x are equal to the same constant a, then x = ai and
Dx=irav=any (A-26)
=
For any constant a and vector x,
o Sam Duas “(AZ
= =
ia = 1/n, then we obtain the arithmetic mean, -
= 53 u= (A-28)
ia
from which it [ollows that
Sustxent
ia
The sum of squares of the elements in a vector x is
a
Dg=xs (A-29)
=
xhile the sum of the products of the 17 elements in vectors x and y is
Dxn=xy. (4:30)
=
By the definition of matrix multiplication,
[Xu = [xx] á (A-31)
is the inner product of the kth and fth columns of X. For example, for the data set given in
Table A.1, if we define X as the 9 x 3 matrix containing (year, consumption, GNP), then
1980
[X'Xhs = 57 consumption, GNP, = 737.1(1185.9) + -.. +1667.2(0633.1)
=
19,743,711,34.
1EX isa x K, then [again using (A-14)]
. i= ,
808 APPENDIX A + Matrix Algebra
This form shows (hat the K x K matrix XX is the sum of x K x K matrices, cach formed from
a single row (year) of X. For the example given earlier, this sum is of nine 3 x 3 matrices, each
formed from one row (year) of the original data matrix.
A.2.8 A USEFUL IDEMPOTENT MATRIX
A fundamental matrix in statistics is thg one that is uscd to transform data to deviations from
their mean. First,
mM
(A-32)
The matrix (17)! is an 2 x 4 matrix wilh cvcry element equal to 1/n. The set of values in
deviations form is
x — E
BB -f-n]- (433)
. a
Since x = Ix,
= [ = É - aii x =Mix. (4:34)
Henceforth, the symbol Mº will he used only for this matrix. lts diagonal elements arc all
(1 = 1/n), and its off-diagonal elements are <1/n. 'The matrix Mº is primarily useful in com-
puting sums of squared deviations. Some computations are simplificd by the result
which implies that ?'Mº = (, The sum of deviations about the mean is then
Se-g=
E
For a síngle variable x, the sum of squared deviations about thc mcan is
3? -1P= (> 5) nt, (A-36)
=
i=1
Lava,
a
IM] =0x=0. (A-35)
In matrix terms,
DD = aa o = (My Mo = xMí Mx.
i=l e
Two properties ot Mº arc uscful at this point. First, since all off-diagonal elements of Mº equal
=1/n. MP is symmetric, Second. as can casily be verified by mulliplication, Mº is equal to its
square; MM" = Mº.
APPENDIX A + Matrix Algebra 809
DEFINITION A.1 Idempotent Matrix
An idempotent matrix, M, is one that is equal to its square, that is, Nº = MM = M.1/M
is a symmetric idempotent matrix (all of the idempotem matrices we shall encounter are
asymmetric), then MM = M
encenar
Thus, Mº is a symmetric idempotent matrix, Combining results, we obtain
” Sa -=xmêx. (4-3)
A
Consider constructing a matrix of sums of squares and cross products in devialions from the
column means. For two vectors x and y,
Seo -D=MoMy, (4:38)
E
so t
Demo Sm-ny-»
Sopa Sow-m
E E
TÉ we put the two column vectors x and yin an x x 2 matrix Z = [x,y], then M'Z is then x 2
matrix in which the two columns of data arc in mean deviation form. Then
Mix xM'y
Mix yMy
(A-39)
MÍZUM TO) = LMM'Z = 7/M'Z
A.3 GEOMETRY OF MATRICES
A3.1 VECTOR SPACES
The K elements of a column vector
can be viewed as the coordinates of a pointin a K-dimensional space, as shown in Figure Al
for two dimensions, or as the definition of the line segment connecting the origin and the point
defined by a.
Two basic arithmetic operations are defincd for vectors. seatar multiplication and addition. A
scalar multiple of a vector, a. is another vector, say a*, whose coordinates are the scalar multiple
of a's coordinates. Thus, in Figure A.1,
812 APPENDIXA + Matrix Aigebra
5d
Second coordinate
1 2 3 4 5
First coordinate
Since a* is a multiple of a, a and a” are linearly dcpendent. For another example, il
1 3 10
a- |. b= bl. and e= fi):
1
2 —-e=0,
a+b 3º
then
soa, and care linearly dependent. Any of the three possible pairs of them, however, are lincarly
independent.
gts
DEFINITION A.5 Linear Independence
A set of vectors is tinearty independent if and only if the only solution to
cm tem + +arar=0
a=m==ek=0,
The preceding implies the following equivalent definition of a basis.
APPENDIX À 4 Matrix Algebra 813
&
É
4 DEFINITION A.6 Basis for a Vector Space 7
É A basis for a vector space of K dimensions is any sei of K lincarly independent vectors in
í that vector space.
Since any (K + 1)st veclor can be written as a lincar combination of the X basis vectors, it
follows that any set of more than K vectors in &* must be linearly dependent.
A.3.4 SUBSPACES
ne
DEFINITION A.7 Spanning Vectors
The set of ul! linear combinations of a set of vectors is the vector space that is spanned by
those vectors.
For example, by definition, the space spanned by a basis for RX is E&. An implication of this
is that il a and b are a basis for E? and e is another vector in Rê, the space spanned by [a.b. c] is,
again, Rê. Of course, e is superíluous. Nonctheless. any vector in R? car be expressed as à linear
combination ol a. b. and e. (The linear combination will not be unique. Suppose, tor example,
that a and e are also a basis for &º.)
Consider the set of three coordinate vectors whose lhird element is zero. In particular,
a =[4 o O and b=[h b O
Vectors a and b do not span the three-dimensional space Fº. Every lincar combination of a and
b has a third coordinate equal to zero: thus, for instance, €' = [1 2 3] could not bc written as a
lincar combination of a and b. If (a1b» — a>b1) is not equal to zero [see (A-41)], however, then
any vector whose third element is zero can be expressed as a linear combination of a and b. So,
although a and b do not span Bº, they do span something, the set of vectors in Fº whosc third
element is zero. This area is a plane (the “floor” of the box in a three-dimensional figure). This
planc in Kº is a subspace, in this instance, a two-dimensional subspace. Note that if is nor R?;
it is the set of vectors in E* whose third coordinate is 0, Any plane in Rº, regardless of how
it is oriented, forms a two-dimensional subspace, Any two independent vectors that lie in that
subspace will span it. But without a third vector that points in some other direction, we cannot
span any more of I&º than this two-dimensional part of it. By the same logic, any line in R* is a
one-dimensional subspace, in this case, the set of all vectors in E? whose coordinates arc multiples
of those of the vector that define the linc. A subspace is a vector space in al] the respects in which
we have defined it. We emphasize that it is not a vector space ol lower dimension. For example,
R? is not a subspace of Ie”. The essential diflcrence is the number of dimensions in the vectors.
The vectors in Rê that form a two-dimensional subspace are still thrce-element vectors: they all
just happen to lie in the same plane.
The space spanned by a set of vectors in X£ has almost K dimensions. 1f (his space has fewer
than K dimensions, it is a subspace, or hyperplane. But the important point in the preceding
discussion is that every set of vectors spans some space; it may be the entire space in which the
vectors reside, or it may be some subspace ofit.
814 APPENDIXA + Matrix Algebra
A3.5 RANK OF A MATRIX
We vicw a matrix as a sct of column vectors. The number of columns in the matrix cquals the
number of vectors in the set, and the number of rows equals the number of coordinates in each
column vector.
DEFINITION A.8 Column Space
The column space of a matrix is the vector space that is spanned by its column
vectors.
T the matrix contains K rows, its column space might have K dimensions. But, as we have seen.
itraight have fewer dimensions; the column vectors might be linearly dependent, or there might
be fewer than K of them. Consider the matrix
156
A=|2 68
718
It contains threc vectors from R?, but the third is the sum of the first two, so the column space of
this matrix cannot have three dimensions, Nor does it have only one, since the three columns arc
not all scalar multiples of one another. Hence, il has two, and the column space of this matrix is
a two-dimensional subspace of R?.
reage mma sisemarestarasem
: DEFINTFION A.9 Column Rank
The column rank of a matrix is the dimension of the vector space that is spanned by its
column vectors.
Tt follows that the column rank of a matrix is equal to the largest number of linearly inde-
pendent column vectors it contains. The column rank of A is 2. For another specific example,
consider
123
515
B-lç45
314
It can be shown (we shall see how later) that this matrix has a column rank equal to 3. Since each
column ot B is a vector in Rº, the column space of B is a thrce-dimensional subspace of Eº.
Consider, instead, the set of vectors obtained by using the rows of B instead of the columns.
The new matrix would be
1563
C=|[2141
3554
This matrix is composed of four column vectors from *º. (Note that C is B'.) The column space of
Cisat most Rº, since four vectors in Kº must be linearly dependent. In fact, the column space of
APPENDIX A + Matrix Algebra 817
For 2 x 2 matrices, the computation of the determinant is
a e
b q=ad- be. (A-50)
Notice that it is a function of all the elements of the matrix. This statement will be (rue, in
general. For more lhan two dimensions, (he determinant can be obtained by using an expansion
by cofactors. Using any row, say É, we obtain
x
IIS Da IA k=1,..,K, (4-5)
a
where As is lhe matrix obtained (rom A by deleting row 7 and column X. The determinant of
As is called a minor of A. When the correct sign, (—1)-*, is added, it becomes a cofactor. This
operation can be done using any column as well, For example, a 4 x 4 determinant becomes a
sum of four 3 x 3, whereas a 5 x 5 is a sum of five 4 x 4s, cach of which is a sum of four 3 x 3s,
and so on. Obviously, it is a good idea to base (A-S1) on a row or column with many zeros in
it, if possible. Tn practice, this rapidly becomes a heavy burden. It is unlikely, though, that you
will ever calculate any determinants over 3 x 3 without a computer. A 3 x 3, however, might be
computed on occasion; if so, the following shorteut will prove useful:
gu Qu ds
Gol doa 023] = Gula + dyloadm + GLsdados — Qsidiaadira — doa p0s; — Quids.
dn da as
Although (A-48) and (A-49) were given for diagonal maírices, they hold for gencralmatrices
Cand D. Onc special case of (A-48) to note is that of c = —1. Multiplying a matrix by —1 does
not nceessarily change the sign of its determinant. It does so only if the order of the matrix is odd.
By using the expansion by cofactors formula, an additional result can be shown:
A(=A'| (A-52)
A3.7 A LEAST SQUARES PROBLEM
Given a vector y and a matrix X, wc arc interested in expressing y as a linear combination of the
columns of X. There are two possibilities. Tf y lies in the column space of X, then we shall be abc
ta find a vector b such that
y=xh. (4:53)
Figure A.3 illustrates such a casc for three dimensions in which the two columns of X both have
a third coordinate equal to zero. Only ys whose third coordinate is zero, such as yº in the figure,
can be expressed as Xb for some b. For the gencral case. assuming lhat y is. indeed, in the column
space of X, we can find the coefficients b by solving the set ol cquations in (A-53), The solution
is discussed in (he next section.
Suppose, however, that y is not in the column space of X. In the context of this example,
suppose thal y's third component is not zero. Then lhere is no b such that (A-53) holds. We can,
however, write
=Xb+e, (4:54)
where é is the difference between y and Xb. By Lhis construction, we find an Xh that is in the
column space oL X, and e is the difference, or “residual.” Figure A.3 shows two examples, y and
“Mi equals j, then the determinant is a principal minor.
818 APPENDIXA + Matrix Algebra
Second coordinate
Third cocrdinate
y*. For the present, wc consider only y. We are interested in finding the b such that y is as close
as possible to Xb in the sense that e is as short as possible,
Raposo açao pe arara cr cumsstronsemgenecemer
DEFINITION A.10 Length ofa Vector
The length, or norm, of a vector eis
le = Vere. (A-55)
E
ê
eras ss rem
The problem is to find the b for which
el = y — Xblt
is as small as possible. The solution is that b that makes e perpendicular, or orthogonal, to Xb.
geseee mca pras ee sensesa morena gerentes rara cscsaenam
DEFINITION A.11 Orthogonal Vectors
Two nonzero vectors a and h are orthogonal, wrincn a 1 b, ifand onty if
ab=ba=0.
ater nastatara
Returning once again to our fitting problem, we find that the b we seek is that for which
el Xb.
Expanding this sct of equations gives the requirement
(Xby'e = 0
=5Xy-bX'Xb
= b[X'y — X'Xb]
APPENDIX A + Matrix Algebra 819
or, assuming b is not 0, the set of equations
X'y = X'Xb.
“The means of solving such a set of eguations is the subject of Section A.S.
3n Figure A.3, the linear combination Xb is called Lhc projection of y into the columm space
of X. The figure is drawn so that, although y and y* are different, they are similar in that the
projection of y lies on top of that of y*. The question we wish to pursue here is. Which vector, y
or y”, is closer to its projection in the column space of X? Superficially, it would appear that y is
closer, since e is shorter than e". Yet y* is much more nearly parallel to its projection than y, so
the only reason thal its residual vector is longer is Lhat y“ is longer compared with y. A measure
of comparison that would be unaflected by the length of the vectors is thc angle between the
vector and ils projection (assuming that angle is not zero). By this measure, 6* is smaller than é,
which would reverse the earher conclusion.
THEOREM A.2 The Cosine Law
The angle & between two vectors a and b satisfies
eee
ab
> alba”
The two vectors in the calculation would be y or y“ and Xb or (Xb)*. A zero cosinc implies
that the vectors are orthogonal, If the cosine is one, then the angle is zero. which means that the
vectors are the same, (They would be if y were in the column space of X.) By dividing by the
lengths, we automatically compensate for (he length of y. By this measure, we find in Figure A 3
that y* is closer to its projection, (Xl)* Lhan y is to its projection, Xb.
A.4 SOLUTION OF A SYSTEM OF LINEAR
EQUATIONS
Consider the sct of 1 lincar equations
Ax=b, (A-56)
in which the K clements of x constitute the unknowns. A is a known matrix of coefficients, and b
is a specified vector of values. We arc interested in knowing whether a solution exists; if so, then
how to obtain il: and finally, if it docs exist, then whether it is unique.
A 41 SYSTEMS OF LINEAR EQUATIONS
For most of our applications, we shall consider only square systems of equations, that is, those in
which A is a square matrix. In what follows, therefore, we take n to equal K. Since the number
of rows in À is the number of equations. whereas the number of columns in À is the number of
variables, this case is the familiar one of “x equations in x unknowns.”
There are two types of systems of equations.
822 APPENDIXA + Matrix Algebra
Note the condition preceding (A-64). It may be that AB is a square, nonsingular matrix when
neither A nor B are even square. (Consider, for example, A'A.) Extending (A-64), we have
(ABO! =C HAB! =C BAT, (A-65)
Recall that for a data matrix X, X'X is Lhe sum of the outer products of the rows X. Suppose
that we have already computed 8 = (X'X07! for a number of ycars of data, such as thosc given at
the beginning of this appendix, The following result, which is called an vpdating formula, shows
how to compute the new $ that would result when a ncw row is added to X:
1 Tah? 1
rapa A aba | (A-66)
Note the reversal of the sign in the inversc. Two more gencral forms of (A-66) that are occasionally
uselul are
ALbbpi=A!s [
1
Adbepi= Ads)
[A debe! = A +[rrem
| Ape A. (A-66a)
[A+BCB) ! =A!5AB[C) +B'A'B]
(A-66b)
A.4.3 NONHOMOGENEOUS SYSTEMS OF EQUATIONS
For the nonhomogencous system
Ax=b,
if A is nonsingular, then the unique solution is
x=A 'b.
A44 SOLVING THE LEAST SQUARES PROBLEM
We now have the tool needed to solve the least squares problem posed in Section A3.7. We found
the solution vector, b to be the solution to the nonhomogenous system X'y = X'Xb. Let z equal
the vector X'y and let A equal the square matrix X'X. The equation system is then
Ab
a.
By the results above, if A is nonsingular, then
bD=A Ja=(X% “(xy
assuming Lhat (he matrix to be inverted is nonsingular. We have reached the irreducible minimum.
1f the columns of X arc lincarly independent, that is, if X has ful) rank, then this is the solution
to the Icast squares problem. If the columns of X are linearly dependent, then this system has no
unique solution.
A.5 PARTITIONED MATRICES
In formulating the elements of a matrix—it is sometimes useful to group some of the elements in
submatrices, Let
1 4/5
A=|29]3 [es al
a 916 An Amu
APPENDIX A + Matrix Algebra 823
A is a partitioned matrix. “Lhe subscripts of the submatrices are defined in thc same fashion as
those for the elements of a matrix. A common special case is the block diagonal matrix:
A,
A-(M 0],
0 Ax
where An and Az arc square matrices.
A.5.1 ADDITION AND MULTIPLICATION
OF PARTITIONED MATRICES
For conformably partitioned matrices A and B.
Au+Bu Av+Bp
A+B= A -67)
po +By An +Bo 6 ,
and
A Au Av] [Bu Bo) (AnBi+ AB AnBp+ AnBy (468)
An Ax]|Bn Br» AnBn + AzBn AnBo + AB»)
In all these, the matrices must be conformable for the operations involved. For addition, the
dimensions of A; and B; must be the same, For multiplication, the number of columns in A,
must equal the number ofrows in B;; for all pairs i and j. That is, all the necessary matrix products
of the submatrices must be defined. Two cases frequently encountered are of the form
Ai] As » an Mar ,
N IN =[A, A] | = [AÇA, + 445] (A-69)
and
An 0TIAn O ApÃv 0
= . (70)
| º 4) I º NI | 0 anal (47
A.5.2 DETERMINANTS OF PARTITIONED MATRICES
The determinant of a block diagonal matrix is obtaincd analogously to that of a diagonal matrix:
Al: Apa] (A-7D)
The determinant of a general 2 x 2 partitioned matrix is
An Ap
=| Aml [An — AvA,) 4x | = (Anil: [Az — AvAçiÃo|. (4:72)
Au Am
A.5.3 INVERSES OF PARTITIONED MATRICES
The inverse of a block diagonal matrix is
An 07! laio
= -73)
| 0 ad) Í 0 AS) (4:73)
which can be verified by direct multiplication,
824 APPENDIXA + Matrix Algebra
For the gencral 2 x 2 partitioned matrix. one form of lhe partifioned inverse is
Au An) [AG (L+ AvEADAI) AGIADE:
An An)
. (A-74)
-BAnAn F
where
“4
Fr= (An — AnAÇiAD).
The upper left block could also be writtcn as
Fi= (An — AAA)
A.5.4 DEVIATIONS FROM MEANS
Suppose that we begin with a column vector of 2 values x and let
We arc interested in the lower right-hand element of A-!. Upon using the definition of F, in
(A-74), this is
-
F = [xx (in qo]! = t [» -i (5) E)
- fe | - (5) n] 5) cem,
Therefore, the lower right-hand value in the inverse matrix is
t —
Dia?
Now. supposc that we replace x with X. a matrix with several columns. We seek Lhe lower right
block of (Z'Z)-!, where Z = [i, X]. The analogous result is
(My! =
LER = [ex xi rx] = (UMINT,
which implies that the K x K matrix in the lower right corner of (Z/'Z)-! is lhe inverse of the
K x K matrix whosc jkth element is ST (5; — £;)(x4k — 54). Thus, when a data matrix contains à
column of ones, the elements of Lhe inverse of (he matrix of sums of squares and cross products will
be computed from the original data in the form of deviations from the respective column means.
A.5.5 KRONECKER PRODUCTS
A calculation that helps to condense the notation when dealing with scts of regression modeis
(see Chapters 14 and 15) is the Kronecker product. For general matrices A and B,
auB a auB
1B aoB o aB
Ago [Sib ab dnb o (4:75)
anB aoB -- axB
APPENDIX A + Matrix Algebra 827
Since Lhe veciors arc orthogonal and ce; = 1, we have
Ce GO - err
Gu Go ve Ger
cc- . =L (A-81)
ti Cet co Crek
Result (A-81) implics Lat
c (A-82)
Consequently,
cc =cc'=1 (A-83)
as well, so Lhe rows as well as the columns of € arc orthogonal.
A6.4 DIAGONALIZATION AND SPECTRAL DECOMPOSITION
OF A MATRIX
By premultiplying (A-80) by C' and using (A-81), we can extract the characteristic roots of A.
DEFINITION A.15 Diagonalization of a Matrix
The diagonalizarion of a matrix A is
CAC=CCA=IA=A. (A-84)
É
E
Alternatively, by postmultiplying (A-80) by C' and using (A-83), we obtain a useful representation
ofA.
ESA pera peer permorgamsrcregagnsemerggpermemsa
DEFINITION A.16 Spectral Decomposition of a Matrix
The spectral decomposition of À is
In this representation, the X x K matrix A is written as a sum of K rank one matrices. This sum
is also called the eigenvalue (or. “own” value) decomposition of A. In this connection. the term
signarure of the matrix is sometimes used to describe the characteristic roots and vectors. Yet
another pair of terms tor the parts of this decomposition are the latent roots and latent vectors
ofA.
A.6.5 RANK OF A MATRIX
The diagonalization result enables us to obtain the rank of a matrix very easily. To do so, we can
use Lhe following result.
828 APPENDIX A + Matrix Algebra
THEOREM A.3 Rank ofa Product
For any matrix A and nonsingular matrices B and C. the rank of BAC is egual to the rank
ofA. The proofissimple. By (A-45) rank(BAC) = rank|(BAJC] = rank(BA). By (A-43),
rank(BA) = rank(A'B9, and applving (A-45) again, ranktA'B”) = rank(A?) since B' is
nonsingular if B is nonsingular (once again, by A-43). Finally, appiying (A-43) again to
obtain rank(A') = rank(Ã) gives the result.
remeoecromeneeeersaneameea
Since € and C' are nonsingular. we can usc them to apply this result to (A-84). By an obvious
substitution,
rank(A) = rank(A). (A-86)
Finding the rank of A is trivial. Since A is a diagonal matrix, ils rank is just the number of nonzero
values on its diagonal, By extending this result, we can prove Lhe following theorems, (Proofs are
brief and are left for the reader.)
peesçrereçemesisaen
THEOREM A.4 Rank ofa Synmetric Matrix
The rank of a symmetric matrix is the number of nonzero characteristic roots it
contains.
Note how this result enters the spectral decomposition given above. Itany of the characteristie
roots are zero, Lhen the number of rank one matrices in the sum is reduced correspondingly.
would appear that this simple rule will not be useful if A is not square. But recall that
tank(A) = rank(A'A). (A-B7)
Since A'A is always square, we can usc it instead of A. Indeed, we can usc it even if A is square,
which Ieads to a fully gencral result.
THEOREM A.5 Rank of'a Matrix
The rank of any matrix A equais the number of nonzero characteristic roots in A'A.
Since the row tank and column rank of a matrix are equal, we should be able to apply
Theorem A.5 to AA” as well. This process, however, requires an additional result.
rei st rm
ms
i THEOREM A.6 Roots of an Outer Product Matrix
i The nonzero characteristic roots of AA! are the same as those of A'A.
é
APPENDIX A + Matrix Algebra 829
The proof is left as an exercise. A useful special case the reader can examine is the characteristic
roots of aa” and a'a, where ais an 2 x 1 vector.
Ifa characteristic root of a matrix is zero, lhen wc have Ac — 0. Thus, if the matrix has a zero
root, it must be singular. Otherwise, no nonzero « would exist. In general, thercfore, a matrix is
singular: that is, it does not have full rank if and only if it has at least one zero root,
A.6,6 CONDITION NUMBER OF A MATRIX
As the preceding might suggest, there is a discrete difference between (ul rank and short rank
matrices. In analyzing data matrices such as (he onc in Section A.2, however, we shall often
encounter cases in which a matrix is not quite short ranked, becausc it has all nonzero roots, but
it is close. That is, by some measure, wc can come very close to being able to write one column
as a lincar combination of the others. This case is important: we shall examine it at lengih in
our discussion of multicollincarity. Our definitions of rank and determinant will fail to indicate
this possibility, but an altemnalive measure, the condition number, is designed for that purpose.
Formally, the condition mumber for a square matrix A is
o. E so te (A-88)
minimum root
For nonsquare matríces X, such as the data matrix in the example, we use A=X'X. Asa further
refinement. because the characteristic roots are alfceted by the scaling of the columns of X, we
scale the columns to have length | by dividing cach column by its norm [ses (A-58)]. For the
X in Section A.2, the largest characteristic root of A is 4.9255 and the smallest is 0.0001543.
Therefore, the condition number is 178.67, which is extremely large. (Values greater than 20 are
large.) “That the smallest root is close to zero compared with the largest means that this matrix is
nearly singular. Matrices with large condition numbers are difficult to invert accurately.
A6.7 TRACE OF A MATRIX
The trace of a square K x K matrix is the sum of its diagonal clements:
e
TUA) = 57 ame
t=1
Some casily proven results are
tr(cA) = c(trA)), (A-89)
IA) = IrA). (AO)
u(A + B) = tr(A) + tr(B), (A-9t)
udo)=K. (A-92)
tHAB) = tr(BA). (A-93)
u'a = tr(a'a) = tr(aa”)
K K
trata = SO aja = 505 ade
fe
da dm
The permutation rule can be extended to any cyetic permutation in a product:
tr(ABCD) = tr(BCDA) = tr(CDAB) = (r(DABC). (A-94)
832 APPENDIXA + Matrix Algebra
“The characteristic roots of A” are the rth power of those of A, and the characteristic vectors
are the same,
1f A is only nomnegative definite—that is, has roots that are either zero or positive-—then
(A-105) holds only for nonnegative r.
A.6.10 | IDEMPOTENT MATRICES
Idempotent matrices arc equal to their squares [see (A-37) to (A-39)]. In view of their importance
in econometries, we collect a few results related to idempotent matrices at this point. First, (A-101)
implies that if À is a characteristic root of an idempotent matrix, then À = ÀÉ for all nonnegative
integers K. As such, if A is a symmetric idempotent matrix, then all its roots are one or zero.
Assume that all the roots of A are onc. Then A = LandA =CAC = CIC =€CC' = Tí the
roots arc not all one. then one or more are zero. Consequently, we have the following results for
symmetric idempotent matrices*
o Theonty full rank, symmetrio idempotent matrix is the identity matrix 1. (A-106)
e Alisymmetric idempotent matrices except the identity matrix are singular, (A-107)
“The final result on idempotent matrices is obtained by observing that the count of the nonzero
roots of A is also equal to their sum. By combining Theorems A.5 and A.7 with the result that
for an idempotent matrix, the roots are all zero or onc, we obtain this result:
e Therank ofasymmetric idempoten: matrix is equal to ils trace. (A-108)
A.6.11 FACTORING A MATRIX
In some applications, we shall require a matrix P such that
PP=A".
One choice is
so that,
PP= (CATA TEC =CA 'C,
as desired.” Thus, the spectral decomposition of A, À = €.
computalion.
[he Cholesky tactorization of a symmcetric positive definite matrix is an alternative represen-
tation that is usefulin regression analysis. Any symmetric positive definite matrix A may be written
as the product of a lower triangular matrix L and its transposc (whichis an upper triangular matrix)
L/ = U. Thus, A = LU. This result is the Cholesky decomposition of A. The square roots of the
diagonal clements of L, é. are the Cholesky values of A. By arraying these in a diagonal matrix D.
we may also write À = LD“'DÊD “U = L*D2U",whichissimilar ta the spectral decomposition in
(A-85). The usctulness of this formulation arises when lhe inverse of A is required. Once L is
C' is a useful result for this kind ot
Not all idempotent matrices arc symmetric. We shall not encounter any asymmetric ones in cur work.
however.
1OWe say tal this is “one” choice because if A is symmetric, as it will be in all our applications, there are
other candidates. The reader can casily verify that CAI2C' = A 12 works as well.
APPENDIX A + Matrix Algebra 833
computed, finding A! = U-1 is also straightforward as well as extremely fast and accurate,
Most recently developed econometric software packages use this technique for inverting positive
definite matrices.
A third type of decomposition of a matrix is uscful for numerical analysis when the inverse
is difficult to obtain because the columns of A are “nearly” collinear. Any 2 x K matrix A tor
which = K can be wriltcnin the form À = UWV' where U is an orthogonal » x K matrix—that
is UU =Ix—Wisa K x K diagonal matrix such thatm; = O, and Visa K x K matrix such
that VV = Ex. This result is called the singular value decomposition (SVD) of A, and w; are the
singular values of A.!! (Note that if A is square. lhen the spectral decomposition is a singular
value decomposition.) As with the Cholesky decomposition, the uscfulness of the SVD arises in
inversion, in this case, of A'A. By multiplying it out, we obtain that (AA) "Uis simply VW ?Vº,
Onee the SVD of A is computed, the inversion is trivial. The other advantage of this format is its
numerical stability, which is discussed at length in Press et al. (1986).
Press et al. (1986) recommend the SVD approach as the method of choice for solv-
ing least squares problems because of its accuracy and numcrical stability. A commonly uscd
alternative method similar to the SVD approach is the OR decomposilion. Any 2 x K matrix,
X with» = K can be written in the form X = QR in which the columns of Q are orthonormal
(Q'Q = 1) and R is an upper triangular matrix. Decomposing X in this fashion allows an cx-
tremely accurate solution to the least squares problem that does not involve inversion or direct
solution of the normal equations. Press ct al. suggest that this method may have problems with
rounding errors in problems when X is nearly of short rank, but based on other published results,
this concern seems relatively minor)?
A.6.12 THE GENERALIZED INVERSE OF A MATRIX
Inversc matrices arc fundamental in cconometrics. Although we shall not require them much
in our treatment in this book, there are more general forms of inverse matrices than we have
considered thus far. A generalized inverse of a matrix A is another matrix A' that satisfies the
following requirements:
1, AA ASA
2 AtAA' =
3. AA issymmetric.
4 AA issymmetric.
A unique A “ can be found for any matrix, whether A issingular or not, orevenif A is not square. É
The unique matrix that satisfies all four requirements is called the Moore—Penrose inverse or
pseudoinverse of A. [A happens to be square and nonsingular, then the gencralized inverse will
be the familiar ordinary inverse. But if A does not exist, then A* can still be computed.
An important special case is the overdetermined system of equatious
Ab=y,
“ Discussion of lhe singular value decomposition (and listings of computer programs for the computations)
may de found in Press et al. (1986).
12fhe National Institute of Standards and Technology (NIST) has published à suile of benchmark problems
that test the accuracy of least squares computations (htip:iwwwnistgovilldiv89BAstrd). Using these prob-
lems, which include some extremely difficult, i sets, we found that the OR method would
reproduce all the NIST certified solutions to 15 digits of accuracy, which suggests that the QR method should
be satisfactory for all but the worst problems,
TA proof of uniqueness, with several other results, may be found in Theil (1983)
834 APPENDIX A + Matrix Algebra
where A has 7 rows K < n columns, and column rank equal to R < K. Suppose that R equals
K,so that (A'A)”! exists. Then the Moore—Penrose inversc of A is
At=(A AJA,
which can be verified by multiplication. A “solution” to the system of equations can be
written
b=A*y.
This is the vector that minimizes the length of Ab — y. Recall this was the solution to the least
squares problem obtained in Section A 4.4. Jf y lies in the column space of A, this vector will be
zero, but otherwise, it will not.
Now suppose that A does not have full rank. The previous solution cannot be computed. An
alternative solution can be obtained, however. We continue to use the matrix A'A. In the spectral
decomposition of Section A.6.4, if A has rank R, then there are R terms in the summation in
(A-85). In (A-102). the spectral decomposition using the reciprocals of the characteristic roots is
used to compute the inverse. To compute the Moore—Penrose inverse, we apply this caleulation to
A'A using only the nonzero roots, then postmuliply the result by A”. Ler €, be the Reharacteristic
veetors corresponding to the nonzero roots, which we array in the diagonal matrix, Ay. Then the
Moore—Penrose inversc is
A =CATÍCIAS,
which is very similar to the previous result.
IA is a symmetric matrix with rank R < K. the Moore—Penrosc inverse is compuied
preciscly as in the preceding equation without postmultiplying by A. Thus, for a symmetrie
matrix A,
At =CAT'C,
where A! is a diagonal matrix containing the reciprocals of the nonzero roots of A.
A? QUADRATIC FORMS AND DEFINITE MATRICES
Many optimization problems involve double sums of (he form
9=5 5 xa; (A-109)
Te ja
This quadratic form can be written
q="Ax,
where A is à symmetric matrix. In general, q may be positive. negativ it depends on A
and x. There are some matrices, however. for which q will be positive regardiess of x, and others
for which q will always be negative (or nonnegative or nonpositive). For a given matrix A,
À. JfxAx>(<)0 [or allnonzero x, then A is positive (negative) definite.
2 WMxAx=>(<)0for all nopzero x, then A is nonnegative definite or positive semidefinite
(nonpositive definito).
Tt might scem that it would be impossible to check a matrix for definiteness, since x can be
chosen arbitrarily. But we have already used the set of results necessary to do so. Recall that à
APPENDIX A 4 Matrix Algebra 837
In order to establish this intuitive result, we would make usc of the following, which is proved in
Goldberger (1964, Chapter 2):
es
THEOREM A.12 Ordering for Positive Definite Matrices
4f A and B are two positive definite matrices with the same dimensions and if every char-
acteristic root of A is larger than (at least as large us) the corresponding characteristic root
of B when both seis of roots are ordered from largest to smallest, then A — B is positive
(nonnegative) definite.
Ceerreceonescenemtatesnesesrssa
The roots of the inverse are the reciprocals of the roots of the original matrix, so the theorem can
be applied to the inverse matrices.
A.8 CALCULUS AND MATRIX ALGEBRA!*
A.8.1 DIFFERENTIATION AND THE TAYLOR SERIES
A variable y is a function of another variable x written
v=f00, y=800, y=
and so on, if cach value of x is associated with a single value of y. In this relationship, y and x are
sometimes labeled the dependent variable and the independent variable, respectively. Assuming
that the funciion f(x) is continuous and difierentiable, we obtain the following derivatives:
D mn PY
a O = a
fw)=
and so on.
A frequent use of the derivatives of f(x) is in the Taylor series approximation. A Taylor
series is a polynomial approximation to f (9). Letting xº be an arbitrarily chosen expansion point
e
VE o?
fo = 1094505 e (em aty, “1 (A-121)
The choice of the number of terms is arbitrary; the more thal are uscd, thc more accurate the
approximation will be, The approximation used most frequently in econometrics is the linear
approximation,
PO) ms a+Bx (A-122)
where, by collecting terms in (A-121), a = [f(x") — f'(xºyxº] and 8 = f(x). The superscript
“O” indicates that the function is evaluated at xº. The quadratic approximation is
foxarprryo, (A-123)
whereo =[50— [ox fa], =[/º— fUxTandy =
14For a complete exposition, see Magnus and Neudecker (1988).
838 APPENDIXA + Matrix Algebra
We can regard a function y = f(x, x2,.... xy) as a sealar-valued function of a vector; lhat
is, y = f(x). The vector of partial derivatives, or gradient vector, or simply gradient, is
àw/ôm] [A
arco — laxam) | 6 .
o = | um] = | RA, (4124)
oval Lt,
The vector g(x) or g is used to represent the gradient. Notice that it is a column vector. The shape
of the derivative is determincd by the denominator of the derivative.
A second derivatives matrix or Hessian is computed as
Pyjdmôx dyjaxido --- By/0mBx,
Pyjómêa dpjamax ce Pfôxa dam
H= =[4A. (A-125)
Py/dxadx y/Bmmdão ce Pey/amda
In general, H is a square, symmetrie matrix. (The symmetry is obtained for continuous and
continuously differentiable functions from Young's theorem,) Each columa of H is the derivative
of g with respect to the corresponding variable im x'. Therefore,
H= | 2y/0m) Mdy/0m) david) | dlap/dm) a(ay/am) By
o 9x ôx> dx, “Cor cao ax dxax”
The first-order, or línear Taylor series approximation is
v 5 fa) +D f(x — af). (A-126)
=
The right-hand side is
aro
fa)+ [o 4 |a- =1565 golpe = (1º + ps
This produces the linear approximation,
ymu+f'
The second-order, or quadratic, approximation adds the second-order terms in the expansion,
DD Blue) (mm) = 6 ya,
i=i j=1
to the preceding one. Collecting terms in the same manner as in (A-126), we have
y=a+px+ Ierx, (A-127)
where
as fogu+ DOR, B=8 Hx and P=Hº.
A linear function can be written
APPENDIX A + Matrix Algebra 839
so
atado
Fa (A-128)
Note, in particular, lhat 9(a'x)/9x = a, not a”. En a set of linear functions
y=Ax,
each clement y of y is
x=ax,
where aí is the ith row of A [see (A-14)]. Therefore,
dy; .
dx a; =transpose olith row of A,
and
dy fax a!
do/ox | |
dyn/axt EM
Collecting all terms, we find that 94x/9x = A, whereas the more familiar form will be
dAx ”
=A". e
E (A-129)
A quadratic form is written
xAx= õ 3 na. (A-130)
=
For example,
13
a-[1
so lhat
xAx= x) 444 + 6x.
Then .
axAx [2n+6x) [2 6] [m]
a pers pe am
which is the general result when A is a symmetric matrix. If A is not symmetric, then
a(X AX)
ôx
=(A+AX (A-132)
Referring to the preceding double summation, we find that for cach term, the coefficient on a;;
is xx;. Therefore,
atras
da
842 APPENDIXA + Matrix Algebra
A.8.3 CONSTRAINED OPTIMIZATION
His often necessary to solve an optimization problem subject to some constrainis on the solution
One method is mercly to “solve out” (he constraints. For example, in the maximization problem
considered earlier, suppose that the constraint x, = x — x: is imposed on the solution. For a single
constraint such as this one, it is possible merely to substitute the righi-hand side of this equation
for x, in the objcetive function and solve the resulting problem as à function of the remaining two
variables. For more general constraints, however, or when there is more than one constraint, the
method of Lagrange multipliers provides a more straightforward method of solving the problem.
We
maximize, f(x) subject to cx) = 0,
ata) =, (A-140)
cio = 0,
The Lagrangean approach to this problem is to find the stationary points—that is, the points at
which the derivatives arc zero—of
:
La = FO +D rica) = fi9 + Neto. (as)
ai
The solutions satisfy the equalions
aLº Bf6) dx'e(x)
— = + =0(px 1),
a d à
“ x xx (A-142)
E em =0U 1
E =em=0Ux1.
The second term in 91%/8x is
gúem dera [oem], qm (adam
àx dx
dx
where € is the matrix of derivativos of the constraints with respect to x. The jth row ol lhe 4 xn
matrix € is the vector of derivatives of the jth constraint, e (x), with respect to x. Upon collecting
terms, the first-order conditions arc
sr aro)
a
xo (A-144)
=)
5 Te) =
There is one very important aspect of the constraincd solution to consider. In the unconstrained
solution. we have df(x)/9x = O. From (A-144), we obtain, for a constrained solution,
aro
ax
—CA, (A-145)
which will not equal O unless à = 0, This result has two important implications:
* he constraíned solution cannot be superior to the unconstrained solution. This is implied
by he nonzero gradicnt at the constrained solution. (That is, unless € = 0 which could
happen if the constraints werc nonlinear. But, even if so, the solution is still no better than
the unconstraincd optimum,)
range multiplicrs are zero, then the constrained solution wil! equal the
unconstrained solution.
APPENDIX A + Matrix Algebra 843
To continue thc example begun earlier, supposc that we add the following conditions:
x — x +43 = 0,
nata =0
To put this in the format ol the general problem, write the constraints as e(x) = Cx = 0, where
epa
The Lagrangcan function is
Rx M)=ax—xAx+A'Ca.
Note the dimensions and arrangement of the various parts. In particular, Cis a 2 x 3 matrix, with
one row for each constraint and one column for each variable in the objective function. The vector
of Lagrange multipliers lhus has two elements, one for each constraint. The necessary conditions
are
a-24x+C1=0 (three equations) (A-146)
and
€x=0 (two equations).
These may be combined in the single equation
Pe celhi-lo!
co 0)"
Using the partitioncd inverse of (A-74) produces the solutions
à=-[CACTICA a (A-147)
and
x= FATI-CICA 'co CA Ta, (A-148)
The two results, (A-147) and (A-148), yield analytic solutions for À and x. For the specific matrices
and veclors of the example. these are À = [-0.5 —7.5]. and the constrained solution vector,
x=[1.5 0 —1.5/. Note that in computing the solution to this sort of problem, il is not necessary
to use the rather cumbersome form of (A-148). Once à is obtained from (A-147), (he solution
can be inserted in (A-146) for a much simpler computation. The solytion
1
2
1
x=5A '2+-41C€
suggests a useful result for the constrained optimum:
constraincd solution = unconstraincd solution + [24] 'C' (A-149)
Finally, by inserting the two solutions in the origina] function, we find that R = 24.375 and
Rº = 2.25, which illustrates again that the constrained solution (in this meximizarion problem)
is inferior 10 the unconstraincd solution.
844 APPENDIX A + Matrix Algebra
A.8.4 TRANSFORMATIONS
Tt a function is strictly monotonic, then il is a ene-to-one fenction. Each y is associated with
exactly one value of x, and vice versa. In this case, an inverse function exists, which expresses x
as a function of v, written
v= fo)
and
x= 1109.
An example is the inverse relationship between the log and the exponential functions.
The slope of the inverse function,
dx dfO)
way
=1""03,
is the Jacobian of the transformation from y to x. For example, if
y=a+ba,
then
: a 4 1
x=-24 |
55)”
is the inversc transformation and
dx a
dy br
Looking ahead to the statistical application of this concept, we observe that if y= /(x) were
vertical, then this would no longer be a functional relationship. The same x would be associated
with more than onc valuc of y. In this case, al this value of x, we would find that / =0, indicating
a singularity in the function.
a column vector of functions, y = Fx), then
dm/dy dufóyo ce da/ôM
ox dxofdy dxo/ôxo ce dxa/Om
“yo :
dxnfôy dando co dan/ôm
Consider the set of lincar functions y = Ax = f(x). The inverse transformation is x = 1 (y),
which will be
x=Ay,
if A is nonsingular. If A is singular. then there is no inverse transformation. Let J be the matrix
of partial derivatives of the inverse functions:
dx
!= [3]
The absolute value of thc determinant of J,
is the Jacobian determinant of the transformation from y to x. In the nonsingular case,
abs(|H) = abs(JA 1) = Está:
APPENDIX B + Probability and Distribution Theory 847
B.3 EXPECTATIONS OF A RANDOM VARIABLE
DEFINITION B.1 Mean ofa Random Variable
The mean, or expected value, of a random variable is
» xfx) ifxis discrete,
É
EpJ=( " (B-1t) |
fio dx ifxis continuous. |
The notation 37, or f,, used henceforth, means the sum or integral over the entire range
of values of x. The mean is usually denoted px. It is a weighted average of the values taken by x,
where the weights are the respective probabilitics. It is not necessarily a value actually taken by
the random variable. For example, the expected number of heads in one toss of à fair coin is $.
Other measures of central tendency are the median, which is the valuc m such that Prob(X <
=1 and Prob(X > mb :, and the mode, which is the valuc of x at which f(x) takes its
um. The first of these measures is more frequently used than the second. Loosely speaking,
the median corresponds more closely than the mcan to the middle of a distribution. Itis unaffected
by extreme values. In the discretc case, the modal value of x has the highest probability of
oceurring.
Let g(x) be a function of x. The function that gives the expected value of g(x) is denoted
Sem Probr= 2) if Xis discrere,
Elgoo]=4 — (B-12)
geo fe) dx if X is continuous.
Hg(x) = a + bx for constants « and b, then
Ela+bx]=a+5EL].
An important case is the expected value of a constant 4, which is just a.
DEFINTTION B.2 Variance of à Random Variable
The variance of a random variable is
Varlx] = Eltx — 4]
Dem? fm ifxis discreto,
11
(B-13)
fe — a) fQ)dx ilx is continuous.
Ebesmnrecsurscronatammm nana
Var[x], wbich must be positive, is usually denoted o?, This function is a measure of the
dispersion of a distribution. Computation of the variance is simplified by using the following
848 APPENDIXB + Probability and Distribution Theory
important result:
Varfx] = Ef] = 4º. (B-14)
A convenient corollary to (B-14) is
Eb]=0" +42. (B-15)
By inserting y = a + bx in (B-13) and expanding. we find that
Varfa + bx] = b? Var(x), (B-16)
which implies, for any constant a, that
Var[a] = 0. (BIN
To describe a distribution, we usually use 5, the positive square root, which is the standard
deviation of x. The standard deviation can be interpreted as having the same units of measurement
as and pu. For any random variable x and any positive constant k, the Chebychev inequality states
that
Prob(p —ko sr <athkoy>1— F (B-18)
Two other measures often used to describe a probability distribution are
skewness = El(x — p)?]
and
kurtosis = El(x — 9%].
Skewness is a measurc of the asymmetry of a distribution. For symmetric distributions,
Fu-0)=Hu+s)
and
skewness = 0.
For asymmetric distributions. the skewness will be positive if the “long tail” is in the positive
direction. Kurtosis is a measure of the thickness of the tails of the distribution. A shorthand
expression for other central moments is
Hr = El nu].
Since jt, tends to explode as r grows, the normalized measure, 4, /0', is often used for description.
Two common measures are
skewncss coefficient = 2
o
and
L
degree of excess — “5 — 3,
a
The second is bascd on the normal distribution, which has excess of zero.
Foz any two functions g1(x) and ga(x).
Elgto) + gt] = Eleito] + Elga00]. (B-19)
For the general casc of a possibly nontinear g(x),
Elgco] = fem dx (B-20)
APPENDIX B + Probability and Distribution Theory 849
and
Varfgo)| = f (gu) — Elgco]) fo) dr. (8.21)
(For convenience, we shall omit the equivalent definitions for discrete variables in the fol-
lowing discussion and use the integral to mcan either integration or summation, whichever is
appropriate.)
A device used to approximate E[g(x)] and Varlg(x)] is the linear Taylor series approxi-
mation:
869 = [g0) = gx] + gta = fi + far = 00). (B-22)
If the approximation is reasonably accurate, then the mcan and variance of g"(x) will be ap-
proximately equal to the mean and variance of g(x). A natural choice for the expansion point is
2º = a = E(x). Inserting this valuc in (B-27) gives
a 806) = [g(g) — gu] + 8 (u)x, (B-23)
so lhat
Elga)) = (yo) (B-24)
and
Varig] = [g'()] Varfx]. (B-25)
A point to note in vicw of (B-22) to (B-24) is (hat Elg(x)] will generally not equal g( Ex).
For the special case in which g(x) is concave—that is, where 4(1) <U—we know from Jensen's
inequality that E[g(0)] < (E [x]). For example, E[log(r)] < lost E fx).
B.4 SOME SPECIFIC PROBABILITY
DISTRIBUTIONS
Certain experimental situations naturally give rise to specific probability distributions. In the
majority of cases im cconomics, however, the distributions used are merely models ot the observed
phenomcna. Although the normal distribution, which we shall discuss at length, is the mainstay
of econometric research, economists have used a widc variety of other distributions. A few are
discussed herc.!
B.4.1 THE NORMAL DISTRIBUTION
“The general form of the normal distribution with mean and standard deviation o is
1
fotu? tocar, (8:26)
This result is usually denoted x — N[g, o2]. Fhe standard notation x = f(x) is used to state that
“x has probability distribution /():” Among the most useful properties of the normal distribution
! A much more complete listing appears in Maddala (1977, Chaps. 3 and 18) and in mos! mathematical statistics
texibooks. See also Poirier (1995) and Stuart and Ord (1989). Another useful reference is Evans, Haslings
and Peacock (1993). Johnson et al. (1974, 1993, 1994, 1995, 1997) is an encyclopedic reference on the subject
Of statistical distributions.
852 APPENDIX B + Probability and Distribution Theory
Chi-squared[2] Density
Density
é ifzisan N[0, 1] variabie and x is x?[n] and is independent of z, then the ratio
(B-36)
has the & distribution with n degrees of freedom.
The: distribution has the same shapc as the normal distribution but has thicker tails. Figure B.3
illustrates the z distributions with three and 10 degrees of freedom with the standard normal
distribution. Two effects that can be seen in the figure are how the distribution changes as the
degrees of freedom increases, and, overall, the similarity of the t distribution to (he standard
normal. This distribution is tabulated in the same manner as the chi-squared distribution, with
several specific cutoff points corresponding to specified tail areas for various values of the degrees
of freedom parameter.
Comparing (B-35) with m = 1 and (B-36), we see the useful relationship between the £ and
F distributions:
e Trina), then = FL. a].
If the numerator in (8-36) has a nonzero mcan, then the random variable in (B-36) has a non-
central f distribution and its square has a noncentral E distribution. These distributions arise in
the F'tests of linear restrictions [sce (6-6)] when the restrictions do not hold as follows:
1. Noncentra! chi-squared distribution. If z has a normal distribution with mean e and
standard deviation 1, then the distribution of 2? is noncentra! chi-squared wilh parameters 1
and 422.
a 1lz- Ny, £] with J elements. then z'E !'z has a noncentral chi-squared distribution
with J degrees of freedom and noncentrality parameter E! ,/2, which we denote
xe, WE! n/2].
b. Ifz Níg, 1) and M is an idempotent matrix with rank 4, then z'Mz = x2(J, u'Mg/2].
APPENDIX B + Probability and Distribution Theory 853
Normai[0,1]. t[3] and 1/10] Densitics
as
— Normaljo4]
.36
Density
8H
«D9
«SO,
-40
2. Noncentral E distribution. 11 A has a noncentral chi-squared distribution with
noncentrality parameter À and degrees of freedom n and A; has a central chi-squared
distribution with degrees of freedom 1 and is independent of X;, then
p= Um
Koln
has a noncentral F distribution with parameters 1, 1», and 2.2 Note that in each of these
cases, the statistic and the distribution are the familiar ones, except that the effect of the
nonzero mean, which induces Lhe noncentrality, is to push the distribution to the right.
B.4.3 DISTRIBUTIONS WITH LARGE DEGREES OF FREEDOM
The chi-squared, +, and F distributions usually arise in connection with sums of sample observa-
tions. The degrecs of freedom parameter in each casc grows with the number of observations.
We olten deal with larger degrees of treedom than are shown in the tables. Thus, the standard
tables are often inadequate. In all cases. however, there are limiting distributions that we can use
when the degrecs of freedom parameter grows large. The simplest case is the 1 distribution. The
t distribution with infinite degrees of freedom is equivalent to the standard normal distribution.
Beyond about 100 degrees of freedom, they arc almost indistinguishable.
For degrees of freedom greater than 30, a reasonably good approximation for the distribution
of the chi-squared variable x is
z= (O)? — (un 12, (B-37)
2The denominator chi-squared could also be noncentral, but we shall not use any statistics wilh doubly
noncentral distributions.
854 APPENDIXB + Probability and Distribution Theory
which is approximately standard normally distributed. Thus,
Prob[n) < a) = 6|(29)'2 — Gn 1)!2].
As used in cconometrics, the + distribution with a large-denominator degrecs of Ircedom is
common, Ás 7 becomes infinite, the denominator of E converges identically to one, so we can
treat the variable
x=mF (8-38)
as a chi-squared variable with m degrees of freedom. Since the numerator degree of freedom
will typically be small. this approximation will suffice for the types of applications we arc likely
to encounter? E not, then the approximation given carlier for the chi-squared distribution can
be applied tom F.
B.4.4 SIZE DISTRIBUTIONS: THE LOGNORMAL DISTRIBUTION
In modeling size distributions, such as the distribution of firm sizesin an industry or the distribution
of income in a country, the lognormal distribution, denoted LM[41, 02], has been particularly
useful
Loan :
= Lido x p0/0] , x>0
fa) Fa >
A lognormal variable x has
Elg=e"2
and
Varfa] = esto (e! 1).
The relation between thc normal and lognormal distributions is
Ky- LN[u,0?], ny No).
A useful result for transformations is given as follows:
Hx has a lognormal distribution with mean é and variance 2?, then
ox Mg,0?), where =In0?- iin(? +22) and o? =Ind+x/6).
Since the normal distribution is preserved under lincar transformation,
ify- LNfu,0?], then ny = Nfry, ro].
If y, and 3, are independent lognormal variables with x — LN[ya, 07] and 35 — LNTHo, 07],
then
sua = LN (gu + po, 0] +02].
àSee Johnson and Kotz (1994) for other approximations.
*A study of applications of the lognormal distribution appears in Aitchison and Brown (1969).
APPENDIX B + Probability and Distribution Theory 357
Density
Relative
frequency
The simplest case is the first one. "The probabilitics associated with the new variable are
computed according to the laws of probability. TF y is derived from x and the function is one to
one, then the probability that Y = y(x) equals the probability that X = x. Tf several values of x
yield the same valuc of y. then Prob(Y = y) is the sum of the corresponding probabilíties for x.
The second type of transtormation is illustrated by the way individual data on income are typ-
ically obtained in a survey. Income in the population can be expected to be distributed according
lo some skewed, continuous distribution such as the one shown in Figure 13.5.
Data are often reported categorically. as shown in the lower part of the figure. Thus, the
random variable corresponding to observed income is a discrete transtormation of the actual
underiying continuous random variable. Suppose, for example, that the transformed variable yis
the mean income in Lhc respective interval, Then
Prob(Y =) = PM-o0<X<a).
Prob(Y =) = Pla <X<b),
Prob(Y =) = P(b<X<o.
and so on, which illustrates the general procedure.
Ixis a continuous random variable with pdf f.(x) and if y = g(%) is à continuous monotonic
function of x, then the density of y is obtained by using the change of variable technique to find
858 APPENDIX B + Probability and Distribution Theory
the cdf of y:
»
Poby == [fg on ON dy.
This equation can now be written as
»
Prob(y < b) = fts dy.
Hence,
fd = fg "Oulg "OL 5 4B41)
To avoid the possibility of a negative pdl if g(x) is decreasing, we use the absolute value of the
derivative in the previous expression. The term [g-“()| must be nonzero for the density of y to be
nonzero. In words, the probabilities associated with intervals in the range of y must bc associated
with intervals in the range oí'x. If the derivative is zero. the correspondence y = g(x) is vertical,
and hence all values of y in the given range are associated with the same value of x. This single
point must have probability zero.
One of the most usetul applications of the preceding result is the lincar transformation of a
normally distributed variable. Tf x — N'[g. 07], then the distribution of
is found using the result above, First, Lhe derivative is obtained from the inverse transformation
=L bsro vo S
v= 0 qrarsoytu> =
Theretore.
(= A eorto-ufad ty | —
Co vao
Fhis is the density of à normally distributed variable with mean zero and unit standard deviation
one. This is this resul which makes it unnecessary to have separate tables for the different normal
distributions which result from different means and variances.
B.6 REPRESENTATIONS OF A PROBABILITY
DISTRIBUTION
The probability den function (pdf) is a natural and familiar way to formulate the distribution
ofa random variable. But, there are many other functions that are used to identify or characterize
a random variable, depending on the setting. In each of these cases, we can identify some other
function of the random variable that has a onc to onc relationship with the density. We have
already used one of these quite heavily in the preceding discussion. For a random variable which
has density function f(x), the distribution function or pdf, F(x9, is an equally informative function
that identifies the distribution: the relationship between /) and F(x) is defined in (B-6) for a
diserete random variable and (B-8) for a continuous onc. Wc now consider several other related
functions.
For a continuous random variable, the survival function is S(x)=1 — F()=Prob[X> x].
This function is widely used in epidemiology where x is time until some transition, such as recovery
APPENDIX B + Probability and Distribution Theory 859
from a disease. Thc hazard function for a random variable is
= fo fo)
PO = sã 150)
The hazard function is a conditional probability:
' ht) =limProX<x< A+] X>x)
hazards have been used in econometries in studying the duration of spells, or conditions, such
as unemployment, strikes, time until business failures, and so on. The connection between the
hazard and the other functions is (x) = —dIn Stx)/dx. As an exercise, you might want to ver-
ily the interesting, special case ol Aty)=1/», à constant—the only distribution which has this
characteristic is the exponential distribution noted in Section BA4.5.
For the random variable X, with probability density function f(x), if the function
Me) = Efe]
exists, then it is the moment-generating function, Assuming the function exists, it can be shown
that
EMO de | = Elx'].
The moment generating function, like the survival and the hazard functions, is a unique char-
acterization of a probability distribution. When it exists, the moment-generating function has a
one-ta-one correspondence with the distribution. Thus, for example, if we begin with some ran-
dom variable and find that a transformation of it has a particular MGF, then we may infer that
the function of the random variable has the distribution associated with that MGF. A convenient
application of this result is the MGrF for the normal distribution. The MGF for the standard
normal distribution is M,(4) = “2,
A useful feature of MGESs is the following:
ifx and y are independent, then the MGF of x + vis MDMAD.
This result has been used to establish the contagion property of some distributions, that is, the
property that sums of random variables with a given distribution have that same distribution.
The normal distribution is a familiar example. This is usually not the case, It is for Poisson and
chi-squared random variables.
One gualification of all o! the preceding is that in order for these results to hold, the
MGF must exist. Fl will [or the distributions that we will encounter in our work, but in at
least one important case, we cannot be sure of this. When computing sums of random vari-
ables which may have different distributions and whose specific distributions nced not be so
well behaved, it is likely that the MGF of the sum does not exist. However, the characteristie
function,
óleo = Eje!t]
will always exist, at least for relatively small 7. The characteristic function is the device used to
prove that certain sums of random variables converge to a normally distributed variable—that
is, the characteristic function is a fundamental tool in proofs of the central limit theorem.
862 APPENDIXB + Probability and Distribution Theory
The sign of the covariance will indicate the direction of covariation of X and Y. lis magnitude
depends on the scales of measurement. however. Tn view of this fact, a prefcrable measure is the
correlation coclficient:
Hx y]= (B-53)
GO
where 9, and o, are the standard deviations of x and y. respectively. Lhe correlation coefficient
has the same sign as the covariance but is always betwcen —1 and 1 and is thus unaffected by any
scaling of the variables.
Vaxiables that are uncorrelated are not necessarily indopendent. For example, in the dis-
crete distribuion 1. D= 0.0)=f(1,1) the correlation is zero, but f(1, 1) does not
equal f(1) f,(1)=(1)(3). An important exception is Lhe joint normal distribution discussed sub-
sequently. in which lack of correlation does imply independence.
Some gencral results regarding expectations in a joint distribution, which can be verified by
applying lhe appropriate definitions, are
Efax +by+c] = a E(x] + bEly| +, (B-54)
Varfax + by + c] = 4? Var[x] + b?Var[y] + 2ab Covx, y]
= Varfaz + by), (5)
and
Covfax + by, ex + dy] = ac Var[x] + bd Var|y] + (ad + be)Covx, 3]. (B-56)
1£ X and Y are uncorrelated, then
Var[x + y] = Varfe —
o = Voto al v. (8:57)
For any two functions gi (x) and g2(9), if x and y are independent, then
ElgtIga(0)] = Eleito E[ga 09]. (B-58)
B.7.4 DISTRIBUTION OF A FUNCTION OF BIVARIATE
RANDOM VARIABLES
The result for a function of a random variable in (B-41) must be modified for a joint distribution.
Suppose that «1 and x» have a joint distribution fi(x, 25) and that yy and 3, are two monotonic
functions of x and x:
»1 = (mg)
32 = polo, do).
Since Lhc functions are monotonic, the inverse transformations,
m=%0.32)
do = 22(p1, 32),
APPENDIX B + Probability and Distribution Theory 863
exist. The Jacobian of the transformations is the matrix of partial derivatives,
= [om/0n dm/ap] [ox
“ox/ayp 0x7) |oy|
The joint distribution of yy and yy is
HO) = fliQ, 42), 2201, P)JabsaZ).
The determinant of the Jacobian must be nonzero for the transtormation lo exist. A zero deter-
minant ímplies thar the two transformations are functionally dependent.
Certainly the most common application of the preceding in econometrics is the linear trans-
formation of a set of random variables. Supposc that x, and x, arc indcpendently distributed
NT0, 1], and the transformations are
Yi = 01 + Bim + Bro, '
3» = 02 + Boda + fo.
To obtain the joint distribution of x and 35, we first write the transformations as
y=a+Bx
The inverse transformarion is
x=B '(y-a).
so the absolute value of the deierminant of the Jacobian is
1
abs|B!”
abs|Z| = abs|B-!|
The joint distribution of x is the product of the marginal distributions since thcy arc independent.
Thus,
O = QUE RD = pay tese,
Inserting the results for x(y) and Jinto £,(%, y>) gives
1
abs!B|
EO) = Qro e ABB Ta?
“This bivariate normal distribution is the subject of Section B.9, Note that by formulating it as we
did above, we can generalize easily to the multivariate case, that is, with an arbitrary number of
variables.
Perhaps the more common situation is that in which it is necessary to find the distribution
of one function of two (or more) random variables. À strategy that often works in this case is
te form the joint distribution of the transformed variable and one of the original variables, then
integrate (or sum) the latter out ol the joint distribution to obtain the marginal distribution. Thus,
to find the distribution of yi (xi, 42), We might formulate
= (a, 22)
p=.
864 APPENDIX EB + Probability and Distribution Theory
The absolute valuc of the determinant of the Jacobian would then be
dx dx! ;
J=abs dy |= am(5)
0 1 Y
The density of y would then be
mon= [ fiber. yo), 30] absiZ | dyo.
=
B.8 CONDITIONING IN A BIVARIATE DISTRIBUTION
Conditioning and the use of conditional distributions play a pivotal role in econometric modeling,
We consider some general results for a bivariate distribution. (A!l these results can be extended
directly to the multivariate casc.)
Tn a bivariate distribution, there is a conditional distribution over y for each valuc of x. The
conditional densities are
fora) = ÁED
(B-59)
fo)
and
1t follows from (B-46) that:
tfx and y arc independent, then f(yixy= f(y) and flxly= Ato). (B-60)
The interpretation is that if thc variables arc independent, the probabilities of events relating
to one variable are unrelated to the other. Thc definition of conditional densities implies the
important result
fe) = food fo)
= SD ho).
(B-61)
B.8.1 REGRESSION: THE CONDITIONAL MEAN
A conditional mean is the mean of the conditional distribution and is defined by
vfQIx)dy if y is continuous,
Epid= 40 (8.62)
Sosfolo ifyisdiscrete,
;
The conditional mean function E[y | x] is called the regression of y on x.
A random variable may always be wrilten as
v= Elylal4 (1 Elyial)
= Elyla]+e.
APPENDIX B 4 Probal
ility and Distribution Theory 867
THEOREM B.6 Linear Regr y
In a bivariate distribution, if Ely |x] = « + Bx and if Var[y| x] is a constant, then
Varfy x) = VarbJ(1 - Corr [y.x]) = 02 (1-5). (B-71)
emeesosrepsoperesageraçd
The proof is straighiforward using Theorems B.2 to B4.
eee esco
Pesomemessrassememsanemementosça
B.8.4 THE ANALYSIS OF VARIANCE
The variance decomposition result implies that in a bivariate distribution, variation in y arises
from two soure:
1, Variation because &[y|x] varies with x:
regression variance = Var,[E[y |]. (5:72)
2. Variation because, in each condiitional distribution, y varies around the conditional mean:
residual variance = E,|Var[y | x]. (B-73)
Thus,
Var[y] = regression variance + residual variance. (8-74)
In analyzing a regression, wc shall usually be interested in which of Lhe two parts of the total
variance, Var[y], is the larger one, A natural measure is the ratio
regression variance
coctiicient of determination — "CETSSSION variâncc (B-75)
total variance
In the setting of a lincar regression, (13-75) arises from another relationship that emphasizes the
interpretatien ol the correlation cocfficient.
Jf Ely|s]=a+8x. then the coefficient of determination= COD = 2, (B-76)
where pº is the squared correlation between x and y. We conclude that the correlation coefficient
(squared) is a measure of the proportion of the variance of y accounted for by variation in the
mean of y given x, Ktis in this sense that correlation can be interpreted as a measure of linear
association between Lwo variables.
B.9 THE BIVARIATE NORMAL DISTRIBUTION
A hivariate distribution (hat embodies many of the features described carlicr is the bivariate
normal, which is the joint distribution of two normally distributed variables. The density is
fe.) =
(877
868 APPENDIX B + Proba
ty and Distribution Theory
“Lhe parameters jt,. 64, 4, and 0, are the means and standard deviations of the marginal distri-
butions oLx and y, respectively. The additional parameter p is the correlation between x and p.
The covarianee is
Oy = pero. (B-78)
The density is defincd enly if p is not 1 or —1, which in turn requires that Lhe two variables not
be linearly related. 1f x and y have a bivariate normal distribution, denoted
(3) Nox, Hu 02, 02.6],
then
* The marginal distributions are normal:
6) = N[t, 02],
(B-78)
HO) = Ng, 02].
* The conditional distributions arc normal:
1G In) = N[e+ x. 021 — 03]
(B-80)
w=uy—Bus É
and likewise for f(x 3).
e xand y are independent il and only if p = 0. The density factors into Lhe product of the two
marginal normal distributions if p = 0.
Two things to note about Lhe conditional distributions beyond their normality are their lincar
regression functions and their constant conditional variances. Thc conditional variance is less than
the unconditional varíance, which is consistent with the results of the previous section.
B.10 MULTIVARIATE DISTRIBUTIONS
The extension of the results for bivariate distributions to more than two variables is direct. ILis
made much more convenient by using matrices and vectors. The term random vector applies to
a vector whose elements are random variables. Fhe joint density is f(x), whereas the cdfis
rim Pa a
Fix = Í f e [ HO dh dy di. (B-81)
Note that the cdf is an n-fold integral. The marginal distribution of any onc (or more) of the n
variables is obtained by integrating or summing over the other variables.
B.10.1 MOMENTS
Thc expected value of a vector or matrix is the vector or matrix of expected values. A mean vector
is defined as
m Elm]
uz Ex)
. “| =Eb (B-82)
E
Il
]
o) Lets]
APPENDIX B + Proba
ty and Distribution Theory 869
Define the matrix
Gude — pa) Ga plo o) ce Cy — ua — pa)
(ro — pe) — Hei) (xo — pu) — pa) (o — flo) On — ln)
Go ma-m'=
Go — Ca — o) Cem — lo — 2) eo (im — tio Ota — dm),
The expected value of each element in the matrix is the covariance of the two variables in the
product. (The covariance of a variable with itself is its variance.) Thus,
on Ma ce Cm
O O tm
Ela-ma-m]=. . = Efe] — um, (B-83)
Cm O o Om
which is Lhe covariance matrix of (he random vector x. Henceforth, we shall denote lhe covariance
matrix of u random vector in boldtace, as in
Varlx|= E.
By dividing o;; by c;9;, we obtain (he correlation matrix:
Voe pr co pu
Pa opa e pm
R=
” om Po Po co 1
B.10.2 SETS OF LINEAR FUNCTIONS
Our earlier results for the mean and variance of a linear function can be extended 10 the multi-
variate casc. For the mean,
Ela + ax +: +anxa] = Ela'x]
=0Elal+aE[o)+---+a, Ely)
(B-84)
= artty + opa dee + nda
=a'.
For the variance,
Varfa'x]) = El(a'x— Eju'x|)”]
= Elfo(s- em)
Ela'tx — aJtx — p)'a]
as Eb] = nanda(x— pm) =
— Ya. Since a is a vector of constants,
Varta'x] = a El A ufa =a'2a= 5" 5 aaja. (B-85)
= ja
872
APPENDIX B + Probability and Distribution Theory
THEOREM B.7 (Continued)
and
x — N(go, Ex). (B-101)
The conditional distribution of x1 given xo is normal as well:
os
mão N(gio, Eio) (B-102)
where
tia = da + EE — 45) (B-l02a)
En2= En -LpEy En. (B-102b) i
Proof: We partition w and E as shown above and insert the parts in (B-95). To constmuct
the density, we use (A-72) to partition the determinant,
[E = [Eni [Eu — Eng Lo
and (A-74) io partition the inverse,
Zu E “o
Ex Ep
For simplicity, we let
Ejo -ErhB
BL Ex +BEB)
B= TpEj.
Inserting these in (B-95) and collecting terms produces the joint density as a product of
two terms:
Fax) = fam | xo) falo).
The first of these is a normal distribution with mean q and variance Eny2, whereas the
second is the marginal distribution of'x».
The conditional mean vector in (he multivariate normal distribution is a linear function of the
unconditional mean and the conditioning variables, and the conditional covariance matrix is
constant and is smaller (in the sense discussed in Section A.7.3) than Lhe unconditional covariance
matrix, Notice that the conditional covariance matrix is the inverse of the upper leftblock of E !;
that is, this matrix is of the form shown in (A-74) for the partitioned inverse of a matrix.
B.11.2 THE CLASSICAL NORMAL LINEAR REGRESSION MODEL
An important special case of the preceding, is that in which x, is a single variable y and x, is
X variables, x. Then the conditional distribution is a multivariate version of that in (B-80) with
8 = Elo Where ox is the vector of covariances of y with x,. Recall that any random variable,
», can be written as its mcan plus the deviation from thc mean. If we apply this tautology to the
multivariate normal, we obtain
v= Elyi+(y- ElyIx]) =a+8'x+e
APPENDIX B + Probability and Distribution Theory 873
where £ is given above, o = py — 8'k,, and e has a normal distribution. We thus have, in this
multivariate normal distribution, the classical normal linear regression model.
B.11.3 LINEAR FUNCTIONS OF A NORMAL VECTOR
Any lincar function ofa vector ol joint normally distributed variables is also normally distributed.
The mean vector and covariance matrix of Ax, where x is normally distributed, follow the gencral
pattern given carlier. Thus,
Tfs — N[u, E), then Ax+b- N[Ag+b, ATA]. (B-103)
11 A does not have full rank, then AZA! is singular and the density does not exist in the full
dimensional space of x though it does exist in the subspace of dimension equal to the rank of E.
Nonctheless, the individual elements of Ax +b will still be normally distributed, and the joint
distribution of the full vector is still a multivariate normal.
B.11.4 QUADRATIC FORMS IN A STANDARD NORMAL VECTOR
The cartier discussion of the chi-squared distribution gives the distribution of xx if x has a standard
normal distribution, It follows from (A-36) that
xx= > x;
=
We know from (B-32) that x'x has a chi-squared distribution. Itscems natural, therefore, to invoke
(B-34) for the two parts on (he right-hand side of (B-104). It is nor yet obvious, however, that
either of the two terms has a chi-squared distribution or that the two terms are independent,
as required. To show these conditions, it is necessary to derive the distributions ol idemputent
quadratie forms and to show when they are independent.
To begin, the second term is the square ot nz, which can easily be shown to have a standard
normal distribution. Thus, the second! term is the square of a standard normal variable and has chi-
squared distribution with one degree otireedom. But the first term is the sum of nonindependent
variables, and it remains to be shown that the two terms are independent.
> “3 4nR. (B-104)
[e
x a rem a
DEFINITION B.3 Orthonormal Quadratic Form
A particular case of (B-103) is the following:
If x N|0.1) and C is à square matrix such that CC =T, then Cx [0.1].
Consider, then, a quadratic form in a standard normal vector x with symmetric matrix A:
g=xAx. (B-105)
Let the charactcristic roots and vectors of A bc arranged in a diagonal matrix A and an orthogonal
matrix €, as in Section A.6.3. Then
g=YX€CAC'x. (B-106)
By definition, € satisfics the requirement that C'C = 1. Thus, the vector y = C'x has a standard
874 APPENDIX.B + Probability and Distribution Theory
normal distribution. Conseguently,
q=yAr= (B-107)
=
1f à; is always one or zero, then
(B-108)
which has a chi-squared distribution. The sum is taken over the j = 1,..., J elements associated
wilh the roots lhat are equal to one. A matrix whose characteristic roots are all zero or one is
idempotent. Therefore, we have proved the next theorem.
THEOREM B.8 Distribution of an Idempotent Quadratic Form in
1x NO, 1] and A is idempotem, then x Ax has a chi-squared distribution with degrees
E
: a Standard Normal Vector
E
É offreedom equal to the number of unit roots of A which is equal to the rank of A.
Eraser as E E em E E
The rank of a matrix is equal to thc number of nonzero characteristic roots it has. Therelore.
the degrees of freedom in the preceding chi-squarcd distribution cquals 4, the rank of A.
We can apply this result to the carlicr sum of squares. The first term is
Du BP =eMêx,
a
where Mº was defined in (A-34) as the matrix that transforms data to mean deviation form:
' 1
Mi=1- ir.
a
Since MÍ is idempotent, the sum of squared deviations from lhe mean has a chi-squared distri-
bution. The degrees of Ireedom equals the rank Mº, which is not obvious except for the useful
result in (A-108). that
* The rank of an idempotent matrix is equal to ils trace. (B-109)
Each diagonal element of Mº is 1 — (1/1); hence, the trace is n[1 — (1/19) = n— 1. Therefore, we
have an application of Thcorem B.8.
Tx MO, DS tu Don]. (B-110)
Wo have already shown (hat the second term in (B-104) has a chi-squarcd di
degree of freedom. It is instructive to set this up as a quadratic form as well;
ribution with one
x [54] x=xljl whcrej= (=) (B-111)
The matrix in brackets is the outer product of a nonzero vector, which always has rank one. Yon
can verify that itis idempotent by multiplication. Thus, x'x is the sum of two chi-squared variables,
APPENDIX € 4 Estimation and Inference 877
THEOREM B.12 Independence of à Linear and a Quadratic Form
Alinear function Lxanda symmetric idempotent quadratic form x Ax in a standard normal
É vector are sratisicaliy independem if LA = 0,
The proof follows the same logic as that for two quadratic forms. Write x Ax as x A'Ax =
(Ax)(Ax). The covariance matrix of the variables Lx and Axis LA = 0, which cstablishes the
independence of these (wo random vectors, The independence of the linear function and the
quadratic form follows since functions of independent random vectors are also independent.
The + distribution is defined as the ratio of a standard normal variable to the square root of
a chi-squared variable divided by its degrees of freedom:
= MO]
feuyry
A particular case is
alrz = ST
Due
whcre s is the standard devialion of the values ol. The distribution of the two variables in 7|n 1]
was shown earlier; we need only show that they are independent. But
1
nã
dn—-U=
jx
and
x'Mºx
="
Tt suffices to show Lhat Mºj = 0, which follows from
Mi
i-idD dD=0.
ipi
APPENDIX C
—mal0/0/ 0
ESTIMATION AND INFERENCE
INTRODUCTION
The probability distributions discussed in Appendix B serve as models for thc underlying data gen-
erating processes that produce our observed data, The goal of statistical inference in econometrics
às to usc the principles of mathematical statistics to combine thesc theoretical distributions and
the observed data into an empirical model of the economy. This analysis takes place in one of two
frameworks, classical or Bayesian. The overwhelming majority of empírical study in cconometrics
878 APPENDIXC + Estimation and Inference
has been done in the classical framework. Our focus, therefore, will be on classical methods of
inference. Bayesian methods arc discussed in Chapter 16.
€C.2 SAMPLES AND RANDOM SAMPLING
The classical theory of statistical inference centers on rules for using the sampled data effectively.
These rules, in turn, are based on the properties of samples and sampling distríbutions.
A sample of n observations on one or more variables, denoted x,.x>...., x, is a random
sample if the 2 observations arc drawn independently from the same population, or probability
distribution, f(x, 8). The sample may be univariate if x, is a single random variable or multi-
variate if cach observation contains several variables. A random sample of observations, denoted
[xi. xo, ..., Ka] Of fx), o. IS Said to be independent, identically distributed, which we denote
id. The vector 8 contains one or more unknown parameters. Data are generally drawn in one
Of two settings. A cross section is a sample of a number of observational units all drawn at the
same point in time. A time series is a set of obscrvations drawn on the same obscrvational unit at
a number of (usually evenly spaccd) points in time. Many recent studies have been based on time
series cross sections, which generally consist of the same cross sectional units obscrved at several
points in time. Since the typical data set of this sort consists of a large number of cross-sectional
units observed at a few points in time, the common term panel data set is usually more fitting for
this sort of study.
C.3 DESCRIPTIVE STATISTICS
Before attempting to estimate parameters of à population or fit models to data, we normally
examinc the data themselves. In raw form, the sample data are a disorganized mass of information.
so we will need some organizing principles 10 distill the information into something meaningful.
Consider, first, examining (he data on a single variable, In most cases, and particularly if the
number of observalions in the sample is large, we shall use some summary statisties to describe
the sample data. Of most interest are measures of location—that is, the center of thc data—and
scale, or the dispersion of the data. A few measures of central tendency are as follows:
median: M = middle ranked observation, (C-1)
maximum — minimum
sumple midrange: midrange = z
The dispersion of the sample observations is nsualty measured by the
Dt +] 12
standard deviation; s, = [
n
(€-2)
Other measures, such as the average absolute deviation from the sample mean, are also used.
although less frequently than the standard deviation. The shape of the distribution of values is
! An excellent reference is Leamer (1978), A summary of the cesults as they apply to econometries is contained
inZellner (1971 )andin Judgeetal. (1985), See, as well, Poirier (1991). A recent textbook witha heavy Bayesian
emphasis is Poirier (1995). -
APPENDIX € + Estimation and Inferençce 879
often of interest as well. Samples of income or expenditure data, for example, tend to be highly
skewed while financial data such as asset Teturns and exchange rate movements are relatively
more symmetrically distributed but arc also more widely dispersed than other variables that might
be observed. Two measures used to quantify these cfiecis are the
Sum
ser — 1)
skewmess = |
, o Eat
| and kurtosis = [Ene D
(Benchmark values for these two measures are zero for a symmetric distribution, and Lhree for
one which is “normally” dispersed. The skewness cocfficient has a bit less of the intuilive appeal
of the mean and standard deviation, and the kurtosis measure has very little at all. The box and
whisker plot is a graphical device which is often used to capture a large amount of information
about the sample in a simple visual display. This plot shows in a figure the median, the range of
values contained in the 25th and 75th percentile, some limits that show the normal range of values
expected, such as the median plus and minus two standard deviations, and in isolation values that
could be viewed as outlicrs. A box and whisker plot is shown in Figure C.1 for the income variable
in Example C.1.
If the sample contains data on more than one variable, we will also be interested in measures
of association among the variables. A seattex diagram is usefulin a bivariate sample if the sample
contains a reasonable number of obscrvations. Figure C.1 shows an example for a small data set.
If the sample is a multivariate one, then the degree of linear association among the variables can
be measured by the pairwise measures
Dia Gu — AM —
n-1
covariance:s,, =
correlation: r,, .
xSy
Tf the sample contains data on several variables, then it is sometimes convenient to arrange the
covariances or corrciations in a
covariance matrix: S = [5,;] (C-4)
or
correlation matrix: R = [r;;].
Some useful algebraic results for any two variables (x, yj), É
bare
.,n, and constants a and
<€.5)
(€-6)
(7
(€-8)
882 APPENDIXC + Estimation and Inference
Kernel Density Estimate for Income
020
MIS
LO
Density
v 20 40 60 so 100
K INCOMF
A gomes Distributin
Range Relutive Frequency Cumutative Frequency
<$10.000 015
10,000-25,000 0.30
25,000-50,000 0,40
>50,000 015
the population, although not perfectly. The precise manner in which thesc quantities reflect the
population values defines the sampling distribution of a sample statistic.
DEFINITION C.l Statistic
A statistic is any function computed from the data in a sample.
Eesreneosoronaanço
Tf another sample were drawn under identical conditions, different values would be obtained
for lhe observations, as cach one is a random variable, Any statistic is a function of these random
values, so itis also a random variable with a probability distribution called a sampling distribution.
For example, the following shows an exact result for Lhe sampling behavior of a widely nsed
statistic.
APPENDIX € + Estimation and Inference 883
as sin a E sereno
THEOREM C.l Sampling Distribution of the Sample Mean
4f xi, ....4y are a random sample from a population with mean ju and variance 02, then
Xis a random variable with mean u and variance 02 jn.
Proofix = Un)Zix ElZ] = (1/m)Ljk = 4. The observations are independem, so
Var(£) = (1/n)? Var[E;x;] = (1/12)£;o
Example C.3 illustrates the behavior of the sample mean in samples of four observations
drawn irom a chi squared population with one degree of freedom. The crucial concepts illus-
trated in this example arc, first, the mean and variancc results in Theorem €..1 and, second, the
phenomenon of sampling variabil
Notice that the lundamental result in Theorem C.! does not assume a distribution for x;.
Indecd, looking back at Section C.3, nothing we have done so far has required any assumption
about a particular distribution.
Example C.3 Sampling Distribution of a Sample Mean
Figure C.3 shows a frequency plot of the means of 1,000 random samples of four observations
drawn from a chi-squared distribution with one degree of freedom, which has mean 1 and
variance 2.
We arc often interested in how a staristic behaves as Lhe sample size increascs. Example C4
illustrates one such case. Figure C.4 shows two sampling distributions, one based on samples of
three and a second, of the same statistic, but based on samples of six. The eficet of increasing
sample size in this figure is unmistakable. It is easy to visualize the behavior of this statistic if we
exirapolate the experiment in Example €.4 to samples ot, say, 100.
Example C.4 Sampling Distribution of the Sample Minimum
lfx,,..., Xp are a random sample from an exponential distribution with f(x) = de **, thenthe
sampling distribution of the sample minimum in a sample of n observations, denoted x is
Fx) = (nojetrism,
Since E [x] = 1/6 and Varix]= 1/82, by analogy E Px] = 1 /(n6) and Varbx,] = 1/(8)2. Thus,
in increasingly larger samples, the minimum will be arbitrarily close to O. [The Chebychev
ineguality in Theorem D.2 can be used to prove this intuitively appealing result.)
Figure G.4 shows the results of a simple sampling experiment you can do to demon-
strate this effect. It requires software that will allow you to produce pseudorandom num-
bers uniformly distributed in the range zero to one and that will let you plot a histogram
and control the axes. (We used EA/LimDep. This can be done with Stata, Excel, or several
other packages.) The experiment consists of drawing 1,000 sets of nine random values,
i 1,000, 7 9. To transform these uniform draws to exponential with pa-
—we used 4 = 1.5, use the inverse probability transform—see Section E.2,3. For
an exponentially distributed variable, the transformation is 2; = —(1,/8) log(1 — Ui). We then
created z1;|3 from the first three draws and z(9|6 from the other six. The two histograms
show clearly the effect on the sampling distribution of increasing sample size from just
3to6.
Sampling distributions ave used to make inferences about the population. [o consider a
perhaps obvious example, because the sampling distribution ot the mean ol a set of normally
distributed obscrvations has mean pt. the sample mean is a natural candidate for an estimate of
4. The observation lhat the sample “mimics” the population is à statement about lhe sampling
884 APPENDIXC 4 Estimation and Inference
sor
Br
7H
sp
Mean = 0.9038
60 Variance = 0.5637
ssh 7 Hi
Frequency
O 12 14 16 18 20 22 24 26 28 30
11/13 15 17 19 21 23 25 27 29 31
Sample mean
istributiorn of Mesús:of 1
APPENDIX € + Estimation and Inference 887
DEFINITION C.4 Mean-Squared Error
The mean-squared error of an estimator is
MSE[8 6] = E[(ô - 037]
= Vartê) + (Biasjá 16) fg is a scalar, (€.9)
MSEI6 16] = Var[8] + Bias[ô | 6]Bias[ê 6] if b isa vector.
Figure CS illustrates thc effect. On average, the biased estimator will be closer to the true
parameter than will the unbiased estimator.
Which of these criteria should be used in a given situation depends on the particulars of that
setting and our objectives in the study. Unfortunately, the MSE criterion is rarely operation:
minimum mcan-squared crror cstimators, when they exist at all, usually depend on unknown
parameters. Thus, we are usually less demanding. A commonly used criterion is minimum variance
unbiasedness.
Example C.5 Mean-Squared Error of the Sample Variance
In sampling from a normal distribution, the most frequently used estimator for o? is
ltis straightforward to show that s? is unbiased, so
254
Varfs?) = E MSEIS [02].
unbiased É
biased à
”
Density
Estimator
8ss APPENDIXC + Estimation and Inference
IA proot is based on the distribution of the idempotent quadratic form (x — i)'Mº(x — ig),
which we discussed in Section B11.4,] A less frequently used estimator is
a
3-5 =Un-1)/ns.
1
This estimator is slightly biased downward:
Ep (n>E(S) (no?
so its bias is
El6? — o?) = Bias[6? |0?] =
But it has a smaller variance than s?:
2
Var[82] = [=| [A « Ver[s?].
n n—1
To compare the two estimators, we can use the difference in their mean-squared errors:
MSE[6? [0?] - MSE[S |0%]= 0? [= 1 - | <0.
The biased estimator is a bit more precise. The difference will be negligible in a large sample,
but, for example, it is about 1.2 percent in a sample of 16.
€.5.2 EFFICIENT UNBIASED ESTIMATION
In a random sample of 4 observations, the density of each observation is f(x;, 8). Since the n
observations are independent, their joint density is
Fixo. tm 8) = HQ) 000,6) ++ Flan)
z (C-10)
= 6.0) = 10 a,
=
This Iunction, denoted L(8 | X), is called the likelihood function for O given the data X. Ttis
frequently abbreviated to 16). Whcre no ambiguity can arise, we shall abbreviate it further
to £.
Example C.6 Likelihood Functions for Exponential
and Normal Distributions
lfx4,..., Xp are a sample of n observations from an exponential distribution with parameter
8, then
º tl
Lt9) =| [oe = ret DM,
A
H'x,,...,X, are a sample of n observations from a normal distribution with mean q and
standard deviation o, then
n
Ltu,0) = | (2x0) -12e-titeonm o
NI (ca)
= (Bro!) -nig-tnzo mom?
APPENDIX C 4 Estimation and Inference 889
The likelihood function is the cornerstone for most oi our theory of parameter estimation. An
important result for cfficient estimation is the following.
nen
THEOREM C.2 Cramér-Rao Lower Bound
Asstning that the density of x satisfies certain regularity conditions, the variance of an
unbiased estimator of à parameter O will always be at least as large as
o sn ço) ama .
o]! = e) = (e(e) ) . (C-12)
The quantity 10) is the information number for the sample. We will prove the result that the
negative of the expected second derivative equais the expected square of the first derivative
in Chapter 17. Proof of the main result of the theorem is quite involved. See, for example,
Stuart and Ord (1989).
meneame
The regularity conditions are technical in nature. (Sec Section 17.4.1.) Loosely, they are con-
ditions imposcd on the density of the random variable that appears in Lhe likelihood function;
these conditions will ensure lhat the Lindberg-Levy central limit thcorem will apply to the sam-
ple of observations on lhe random vector y In ftx |8)/98. Amonp the conditions are finite
moments of x up to order 3. An additional condition normally included in the set is that the rango
of the random variable be independent of the parameters.
In some cases, the second derivative of the log likelihood is a constant, so the Cramér-
Rao bound is simple to obtain. For instance, in sampling from an exponential distribution, from
Example C6,
InL=nIng-95 14,
7
anL n
do O q»
sogiln 1/00? = —n/6? and Lhe variance bound is [I(9)]-! = 62/n. In most situations, the second
derivalive is a random variable with a distribution of its own. The [ollowing examples show two
such cases.
Example C.7 Variance Bound for the Poisson Distribution
For the Poisson distribution,
Int =
aInt
EUA
BêInt
as
892 APPENDIXC + Estimation and Inference
Example C.9 Estimated Confidence Intervais for a Normal Mean
and Variance
In a sample of 25, x = 1.83 and s = 0.51. Construct a 95 percent confidence interval for p.
Assuming that the sample of 25 is from a normal distribution,
5(R = 4)
Prob (2064 < < 2064) =0.95,
where 2.064 is the critical value from a t distribution with 24 degrees of freedom. Thus, the
confidence interval is 1.63 + [2.064(0.51)/5] or [1.4195, 1.8405].
Remark: Had the parent distribution not been specified, it would have been natural to use the
standard normal distribution instead, perhaps relying on the central limit theorem. But a sam-
ple size of 25 is small enough that the more conservative t distribution might still be preferable.
The chi-squared distribution is used to construct a confidence interval for the variance of
a normal distribution. Using the data from Example 4.29, we find that the usual procedure
would use
2
prs(124 « es <a94) =0.95,
a
where 12.4 and 39.4 are the 0.025 and 0.975 cutoff points from the chi-squared (24) distribu-
tion. This procedure leads to the 95 percent confidence interval [0.1581, 0.5032]. By making
use of the asymmetry of the distribution, a narrower interval can be constructed. Allocating 4
percent to the left-hand tail and 1 percentto the right instead of 2.5 percent to each, the two
cutoff points are 13.4 and 42.9, and the resulting 95 percent confidence interval is [0.1455,
0.4659].
Finally, the confidence interval can be manipulated to obtain a confidence interval for
a function of a parameter. For example, based on the preceding, a 95 percent confidence
interval for o would be [V/0.1581, 0.5032] = [0.3976, 0.7094).
C.7 HYPOTHESIS TESTING
The second major group of statistical inference procedures is hypothesis tests. The classical testing
procedures are based on constructing a statistic from a random sample that will enable the
analyst to decide, with reasonabic confidence, whether or not the data in the sample would
have bcen generated by à hypothesized population. The formal procedure involves a statement
of the hypothesis, usually ir tecms of a “null” or maintained hypothesis and an “altemative,”
conventionally denoted Hy and 14, respectively. The procedure itself is a rule, stated in terms
of the data, that dictates whether the null hypothesis should be rejected or not. For example,
the hypolhesis might state a parameter is equal to a specified value. The decision rule might
state thal the hypothesis should be rejected if a sample estimate of thar parameter is too far
away from that value (where “far” remains to be defined). The classical, or Neyman-Pearson,
methodology involves partitioning the sample space into two regions. If the observed data (i.e.,
the test statistic) fall in the rejection region (sometimes called the critical region). then the null
hypothesis is rejected; if they fal in the acceptance region, then it is not.
C.714 CLASSICAL TESTING PROCEDURES
Since the sample is random, the test statistic, however defincd, is also random. The same test
procedure can lead to dillerent conclusions in different samples. As such, there are two ways
such a procedure can be in error:
1. Type F error, The procedure may lead to rejection of the null hypothesis when it is truc.
2, Type M error, The procedure may fail to reject the null hypothesis when it is false.
APPENDIX C + Estimation and Inference 893
To continue the previous example, lhere is some probability that the estimate of the parameter
will be quite far from the hypothesized value, even if the hypothesis is true. This outcome might
cause a type 1 error.
E ns
] E
É DEFINITION C.ó Sizeofa Test Ê
: The probability ofa type I error is the size of the test. This is conventionally denoted wand É
i is also called the significance level. i
e recomeço
The size of the test is under the control of the analyst. Tt can be changed just by changing
the decision rule. Indeed, the type I error could he eliminated altogether just by making the
rejection region very small, but this would come at a cost. By climinating the probability of a
type Lerror—that is, by making it unlikely that the hypothesis is rejecicd—we must increase the
probability ot a type TI error. Ideally, we would like both probabilities to be as small as possible.
It is clear, however, that there is a tradeoft berween the two. The best we can hope for is that for
a given probability of type T error, the procedure we choose will have as small a probability of
type 1 exrot as possible.
gemea
DEFINITION €.7 Power ofa Test
The power of a test is the probability that i will correctly lead to rejection of a false null
hypothesis:
power=1— = 1 — Prob(type II error). (0-16)
eemzevemeen ces
For a given signilicance level «, we would like 8 ta be as small as possible. Since £ is defined
in terms of the alternative hypothesis, it depends on Lhe value of the parameter.
Example C.10 Testing a Hypothesis About a Mean
For testing Ho:u = 44º in a normal distribution with known variance o?, the decision rule is
to reject the hypothesis If the absolute value of the z statistic, VI(X — 1º)/w, exceeds the
predetermined critical value. For a test at the 5 percent significance level, we set the critical
value at 1.96. The power of the test, therefore, is the probability that the absolute value of
the test statistic will exceed 1.96 given that the true value of 4 is, in fact, not 4º. This value
depends on the alternative value of «, as shown in Figure C.6. Notice that for this test the
power is equal to the size at the point where « equals uº. As might be expected, the test
becomes more powerful the farther the true mean is from the hypothesized value.
Testing procedures, like estimators, can be compared using à number of criteria.
DEFINITION €C.8 Most Powerful Test
A test is most powerful if it has greater power than any other test of the same size.
repre E E gs a
894 APPENDIXC + Estimation and Inference
This requirement is very strong. Since the power depends on the alternative hypolhesis, we might
require that the test be unitormly most powerful (UMP), that is, havc greater power than any
other test of the same size for all admissible values of the parameter. There are few situations in
which a UMP testis available. We usually must be less stringent in our requirements. Noncthelcss,
the criteria for comparing hypothesis testing procedures arc gencrally based on their respective
power functions. A common and very modest requirement is thai the test be unbiased.
DEFINITION C.º Unbiased Test
A test is unhiased if its power (1 — 8) is greater than or equal to its size a for all values of
the parameter,
Tf a test is biased, then, for some values of Lhe parameter, we arc more likely to accept the
null hypothesis when it is false than when it is truc.
The use of the term unbiased here is unrelated to the concept of an unbiased estimator.
Fortunately, there is little chance of confusion. Tesis and estimators are clearly connected, how-
ever. The following criterion derives, in general, from Lhe corresponding attribute of a parameter
estimate,
DEFINITION C.10 Consistent Test
A test is consistent ifits power goes to one as the sample size grows to infinity.
APPENDIX D + Large Sample Distribution Theory 897
and results needed for this analysis. A few additional results will be developed in the discussion
of time series analysis later in the book.
D.2 LARGE-SAMPLE DISTRIBUTION THEORY"
In most cases, whether an estimator is exactly unbiascd or what its exact sampling variance is in
samples of a given size will be unknown. But we may be able to obtain approximate results about
the behavior of the distribution of an estimator as the sample becomes large. For example, it is
well known that the distribution of the mean of a sample tends to approximate normality as the
sample size grows, regardless of the distributioa of the individual observations. Knowledge about
the limiting behavior of the distribution of an estimator can bc used to infer an approximate
distribution for the estimator in a finite sample, To describe how this is done, it is necessary, lirst,
to present some results on convergence of random variables.
D.2.1 CONVERGENCE IN PROBABILITY
Limiting arguments in this discussion will be with respect to the sample size 1. Let x, be a sequence
random variable indexed by the sample size.
ppreemsemenme
à DEFINITION D.1 Convergence in Probability
E
“É The random variable x, converges in probability to a constant c if
lima, Prob(|x, — c] > 2) = 0 for any positive e.
es
esvgssrsec nano
Convergence in probability implies that the values that the variable may take that are not
close to « become increasingly unlikely as 1 increases, To consider one example, suppose that the
random variable x, takes two values, zero and 2, with probabilities 1 — (1/1) and (1/n), respec-
tively. As 7 incrcases, Lhe second point will become ever more remote from any constant but, at
the same time, will become increasingly less probable. Tn this example, x, converges in probability
to zero. The crux of this form of convergence is that all the mass of the probability distribution
becomes concentrated at points close to c. If x, converges in probability to c, then we write
plimx, =. D-1)
We will make frequent use of a special case of convergence in probability. convergence in mean
square or convergence in quadratic mean.
en
THEOREM D.1 Convergence in Quadratic Mean
If x, has mean py and variance o) such tha the ordinary limits of un und 02 are c and 0,
respectively, then X, converges in mean square to c, and
1a comprehensive summary of many results in large-sample theory appears in White (2001). “The results
discussed here will appty to samples ol independent observations. Time series cases in which observations
are correlated are analyzed in Chapters 19 and 20.
898 APPENDIXD 4 Large Sample Distribution Theory
A proof of Theorem D.I can be based on another usclu) theorem.
'i THEOREM D.2 Chebychev's nequality
1 xy is a random variable and c and e are constant, then Probilx, — cl>8)<
i Eltxa — cP]/e.
semestres
“To establish the Chebychev incqualiiy wc use another result [see Goldberger (1991, p. 31)).
ger
THEOREM D.3 Murkov's Inequality
Hm is a nomegative random variable and 8 is a positive constant, ihen
Prob[y = 6] < Ely]/8.
Proof: Ely,] = Probly < 6] Ely | vn < 6] Probly; = 5] Ely | yu = 8]. Since yy is nom-
negative, both terms must be nomnegative, so Elya] = Probm > 5] Ely | > 0]
Since Ely | Yu > 8] must be greater than or equal to 5, Ely) > Prob[yy > 8]ô, which is
the result.
pesos cemeevemsmspsnsteserm
erorreomemeessinmennasconenneneaçt
É
Ê
i
.
Ê
i
i
|
Now. to prove Theorem D.L.. let ya be (xy —
implies that |x, — e] > &. Finally, we will use
E = fin, SO lhal we have
2 and 8 be +? in Theorem D.3. Then, (x, «2 > 6
special case of the Chebychey incquality, where
Probclxa — al > 8) < 05/62. (D-2)
Taking the limits of x, and o? in (D-2), we see lhat if
Jim elx] =c ando lim Varfe] =0, (D-3)
then
plim am
We have shown that convergence in mean square implies convergence in probability. Mean-
square convergence implies that the distribution of x, collapses to a spike at plim xy. as Shown in
Figure D.l.
Example D.1 Mean Square Convergence of the Sample Minimum
in Exponential Sampling
As noted in Example G.4, in sampling of n observations from an exponential distribution, tor
the sample minimum xe,
1
=lm L=o
neo nã
den, o)
and
lim Var'x,
iva xo]
Therefore,
plim xy = O.
Note, in particular, that the variance is divided by n?. Thus, this estimator converges very
rapidly to 0.
APPENDIX D + Large Sample Distribution Theory 899
Density
Estimator
onvaraençã to.
sap
Convergence in probability does not imply convergence in mean square. Consider the simple
example given earlier in which x, equals either zero or 1 wilh probabilities 1 — (1/2) and (1/1).
The cxact expected value of x, is 1 for alln, which is not the probability limit. Indeed, if we Ict
Prob(x, = 4º) = (1/4) instead. the mean of the distribution explodes, but the probability limit is
still zero. Again, the point x, = nº becomes ever more extreme but. at Lhc same time, becomes
ever less likely.
The conditions for convergence in mean square are usually easier to verify than those for
the more gencral form. Fortunately, we shall rarely encounter circumstances in which it will be
necessary to show convergence in probability in which we cannot rely upon convergence in mean
square. Our most frequent use of this concept will be in formulating consistent estimators,
É
& DEFINITION D.2 Consistent Estimator
An estimator É, of a parameter O is a consistent estimaior of 6 if and onty if
plim 0, = 8 (D-4)
pssigermessnganermerando
THEOREM D.4 Consistency ot the Sample Mean !
É The mean ofa random sample from any population with finite mean | and finite variance É
02 is q consistent estimator of u. á
Proef: Eli.) = u and Varlio] = 02/n. Therefore, Zx converges in mean squareto mor É
Plim Za =p É