4 Information and Asymptotics 4 Information and Asymptotics Sketch Proofs of Results

Information Matrices

This section introduces the notion of information for a multivariate likelihood. We first recall some results from MATH230.

Consider a vector $\vec{Y}=(Y_{1},\ldots,Y_{d})^{T}$ . Then Var $(\vec{Y})$ is a $d\times d$ matrix:

\mbox{Var}(\vec{Y})=\left[\begin{array}[]{cccc}\mbox{Var}(Y_{1})&\ldots&\ldots% &\mbox{Cov}(Y_{1},Y_{d})\\ \ldots&\ldots&\mbox{Cov}(Y_{i},Y_{j})&\ldots\\ \ldots&\mbox{Cov}(Y_{j},Y_{i})&\ldots&\ldots\\ \mbox{Cov}(Y_{d},Y_{1})&\ldots&\ldots&\mbox{Cov}(Y_{d},Y_{d})\end{array}\right].

The diagonal of the variance matrix consists of the variance of the separate variables and the off-diagonal entries are covariances. The matrix is symmetric and positive definite.

The correlation between variables $Y_{i}$ and $Y_{j}$ is given by:

\mbox{corr}(Y_{i},Y_{j})=\frac{\mbox{cov}(Y_{i},Y_{j})}{\sqrt{\mbox{Var}(Y_{i}% )\mbox{Var}(Y_{j})}}.

Let $\vec{Y}$ be a $d$ -dimensional random vector with $E(\vec{Y})=\vec{0}$ and variance-covariance matrix $\Sigma$ , and let $A$ be a $d\times d$ matrix, then

E(A\vec{Y})=\vec{0}

and

\mbox{Var}(A\vec{Y})=A\Sigma A^{T}.

Notation and Results

As in the one-parameter case, the approximation of log-likelihood or deviance surfaces in the vicinity of $\hat{\vec{\theta}}$ by quadratics — an approximation which improves with increasing sample sizes — forms the basis of an asymptotic distribution theory which can be used to obtain approximate confidence intervals.

The results are similar to those of the one-parameter case. Now, the score function is a vector function, $\vec{U}(\vec{\theta})$ , defined by

	$\displaystyle\vec{U}(\vec{\theta})$	$\displaystyle=$	$\displaystyle(U_{1}(\vec{\theta}),\ldots,U_{d}(\vec{\theta}))^{T}$
		$\displaystyle=$	$\displaystyle\left(\frac{\partial}{\partial\theta_{1}}\ell(\vec{\theta}),% \ldots,\frac{\partial}{\partial\theta_{d}}\ell(\vec{\theta})\right)^{T};$

that is, the gradient vector for the log-likelihood. Thus at the MLE we have $\vec{U}(\vec{\hat{\theta}})=\vec{0}$ .

The information measures now become matrices:

\vec{I}_{O}(\vec{\theta})=\left[\begin{array}[]{cccc}-\frac{\partial^{2}}{% \partial\theta_{1}^{2}}\ell(\vec{\theta})&\ldots&\ldots&-\frac{\partial^{2}}{% \partial\theta_{1}\partial\theta_{d}}\ell(\vec{\theta})\\ \ldots&\ldots&-\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}\ell(% \vec{\theta})&\ldots\\ \ldots&-\frac{\partial^{2}}{\partial\theta_{j}\partial\theta_{i}}\ell(\vec{% \theta})&\ldots&\ldots\\ -\frac{\partial^{2}}{\partial\theta_{d}\partial\theta_{1}}\ell(\vec{\theta})&% \ldots&\ldots&-\frac{\partial^{2}}{\partial\theta_{d}^{2}}\ell(\vec{\theta})% \end{array}\right]

and

\vec{I}_{E}(\vec{\theta})=\left[\!\!\!\begin{array}[]{cccc}E\left\{-\frac{% \partial^{2}}{\partial\theta_{1}^{2}}\ell(\vec{\theta})\right\}&\!\!\!\ldots&% \!\!\!\ldots&\!\!\!\!E\left\{-\frac{\partial^{2}}{\partial\theta_{1}\partial% \theta_{d}}\ell(\vec{\theta})\right\}\\ \ldots&\!\!\!\ldots&\!\!\!E\left\{-\frac{\partial^{2}}{\partial\theta_{i}% \partial\theta_{j}}\ell(\vec{\theta})\right\}&\!\!\!\ldots\\ \ldots&\!\!\!E\left\{-\frac{\partial^{2}}{\partial\theta_{j}\partial\theta_{i}% }\ell(\vec{\theta})\right\}&\!\!\!\ldots&\!\!\!\ldots\\ \ \ E\left\{-\frac{\partial^{2}}{\partial\theta_{d}\partial\theta_{1}}\ell(% \vec{\theta})\right\}&\ldots&\ldots&E\left\{-\frac{\partial^{2}}{\partial% \theta_{d}^{2}}\ell(\vec{\theta})\right\}\end{array}\right].

These are, respectively, the hessian matrix and expected hessian matrix for the log-likelihood function. A general property of $\vec{I}_{O}(\hat{\vec{\theta}})$ and $\vec{I}_{E}(\vec{\theta})$ (for any $\vec{\theta}$ ) is that they are positive definite matrices, measuring respectively the observed curvature at $\hat{\vec{\theta}}$ and the expected curvature, at $\vec{\theta}$ , of the log-likelihood surface.

For the one-parameter case, we had an asymptotic result that said the variance of the MLE was the reciprocal of the Fisher information. The corresponding result for the multi-parameter case is that the variance of the MLE vector is the matrix inverse of the Fisher Information matrix.

In the results that follow, let $\vec{\theta}$ be a $d$ -dimensional unknown parameter and let its true value be $\vec{\theta_{0}}$ .

Proposition: Asymptotic consistency of the MLE.
For regular estimation problems, in the limit as $n\rightarrow\infty$ , if $\vec{\theta_{0}}$ is the true parameter vector, then

\hat{\vec{\theta}}_{n}\overset{p}{\rightarrow}\vec{\theta_{0}}.

Theorem 6: Multivariate asymptotic distribution of the MLE vector.
For regular estimation problems, in the limit as $n\rightarrow\infty$ , if $\vec{\theta_{0}}$ is the true parameter vector, then

\hat{\vec{\theta}}\sim\mbox{MVN}_{d}(\vec{\theta_{0}},\vec{I}_{E}(\vec{\theta_% {0}})^{-1}).

Thus, the asymptotic distribution of $\hat{\vec{\theta}}$ is multivariate normal, with variance-covariance matrix given by the inverse of the expected information matrix.

For practical problems we use alternatives to $\vec{I}_{E}(\vec{\theta_{0}})$ which are asymptotically equivalent:

	$\displaystyle\hat{\vec{\theta}}$	$\displaystyle\sim$	$\displaystyle\mbox{MVN}_{d}(\vec{\theta_{0}},\vec{I}_{E}(\hat{\vec{\theta}})^{% -1})$
	$\displaystyle\hat{\vec{\theta}}$	$\displaystyle\sim$	$\displaystyle\mbox{MVN}_{d}(\vec{\theta_{0}},\vec{I}_{O}(\hat{\vec{\theta}})^{% -1}).$

Theorem 7: Asymptotic distribution of the deviance with a $d$ -dimensional MLE vector.
For a regular estimation problem, the deviance

D(\vec{\theta})=2[\ell(\hat{\vec{\theta}})-\ell(\vec{\theta})]

in the limit as $n\rightarrow\infty$ , $D(\vec{\theta_{0}})\sim\chi_{d}^{2}$ and for $\vec{\theta}\neq\vec{\theta_{0}}$ $D(\vec{\theta})\rightarrow\infty$ .