Technical Documentation

Gaussian Linear Mixed Model Formulation

Consider the linear mixed model

(1)\[\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b} + \boldsymbol{\varepsilon},\]

where:

  • \(\mathbf{Y} \in \mathbb{R}^{n \times 1}\) is the response vector,

  • \(\mathbf{X} \in \mathbb{R}^{n \times p}\) is the fixed-effects design matrix,

  • \(\boldsymbol{\beta} \in \mathbb{R}^{p \times 1}\) is the vector of fixed-effect coefficients,

  • \(\mathbf{Z} \in \mathbb{R}^{n \times q}\) is the random-effects design matrix,

  • \(\mathbf{b} \in \mathbb{R}^{q \times 1}\) is the vector of random effects,

  • \(\boldsymbol{\varepsilon} \in \mathbb{R}^{n \times 1}\) is the vector of residual errors.

We assume

\[\mathbf{b} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}), \quad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{R}), \quad \mathbf{b} \perp \boldsymbol{\varepsilon},\]

where \(\mathbf{G} \in \mathbb{R}^{q \times q}\) and \(\mathbf{R} \in \mathbb{R}^{n \times n}\) are covariance matrices.

Let \(\boldsymbol{\theta}\) denote the collection of variance components that parameterise \(\mathbf{G}\) and \(\mathbf{R}\), i.e.,

\[\mathbf{G} = \mathbf{G}(\boldsymbol{\theta}), \quad \mathbf{R} = \mathbf{R}(\boldsymbol{\theta}).\]

Under these assumptions, the marginal distribution of \(\mathbf{Y}\) is given by

\[\mathbf{Y} \sim \mathcal{N}\left(\mathbf{X}\boldsymbol{\beta},\; \mathbf{V}(\boldsymbol{\theta})\right),\]

where

(2)\[\mathbf{V}(\boldsymbol{\theta}) = \mathbf{Z}\mathbf{G}(\boldsymbol{\theta})\mathbf{Z}^\top + \mathbf{R}(\boldsymbol{\theta}).\]

BLUE

Given \(\boldsymbol{\theta}\), the Best Linear Unbiased Estimator (BLUE) of \(\boldsymbol{\beta}\) is the linear unbiased estimator that minimises the variance among all such estimators. Under the linear mixed model with covariance matrix \(\mathbf{V}(\boldsymbol{\theta})\), the BLUE of \(\boldsymbol{\beta}\) is given by

\[\widehat{\boldsymbol{\beta}}(\boldsymbol{\theta}) = \left(\mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{X}\right)^{-1} \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{Y}.\]

This estimator is the generalised least squares (GLS) estimator of \(\boldsymbol{\beta}\).

BLUP

Given \(\boldsymbol{\theta}\), we often wish to predict the random effects \(\mathbf{b}\). The Best Linear Unbiased Predictor (BLUP) of \(\mathbf{b}\) is given by:

(3)\[\widehat{\mathbf{b}}(\boldsymbol{\theta}) = \mathbf{G}(\boldsymbol{\theta})\mathbf{Z}^\top \mathbf{V}^{-1}(\boldsymbol{\theta}) (\mathbf{Y} - \mathbf{X}\widehat{\boldsymbol{\beta}}).\]

ML Estimation

Maximum likelihood (ML) estimation jointly estimates the fixed effects \(\boldsymbol{\beta}\) and variance components \(\boldsymbol{\theta}\) by maximising the marginal likelihood of the observed data. After integrating out the random effects \(\mathbf{b}\) from the joint model, the marginal distribution of \(\mathbf{Y}\) is

\[\mathbf{Y} \sim \mathcal{N}\!\left(\mathbf{X}\boldsymbol{\beta},\; \mathbf{V}(\boldsymbol{\theta})\right),\]

where \(\mathbf{V}(\boldsymbol{\theta}) = \mathbf{Z}\mathbf{G}(\boldsymbol{\theta})\mathbf{Z}^\top + \mathbf{R}(\boldsymbol{\theta})\). The corresponding log-likelihood is given by

\[\ell(\boldsymbol{\beta}, \boldsymbol{\theta}) = -\frac{1}{2}\left[ n\log(2\pi) + \log\left|\mathbf{V}(\boldsymbol{\theta})\right| + (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^\top \mathbf{V}(\boldsymbol{\theta})^{-1} (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta}) \right].\]

For fixed \(\boldsymbol{\theta}\), the log-likelihood is quadratic in \(\boldsymbol{\beta}\), and the maximiser can be obtained in closed form by setting \(\partial \ell / \partial \boldsymbol{\beta} = \mathbf{0}\). This yields the generalised least squares (GLS) estimator

\[\widehat{\boldsymbol{\beta}}(\boldsymbol{\theta}) = \left( \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{Y}.\]

Substituting this expression back into the log-likelihood eliminates \(\boldsymbol{\beta}\) and produces the profile log-likelihood

\[\ell_p(\boldsymbol{\theta}) = -\frac{1}{2}\left[ n\log(2\pi) + \log\left|\mathbf{V}(\boldsymbol{\theta})\right| + \mathbf{Y}^\top \mathbf{P}(\boldsymbol{\theta})\, \mathbf{Y} \right],\]

where

\[\mathbf{P}(\boldsymbol{\theta}) = \mathbf{V}(\boldsymbol{\theta})^{-1} - \mathbf{V}(\boldsymbol{\theta})^{-1}\mathbf{X} \left( \mathbf{X}^\top\mathbf{V}(\boldsymbol{\theta})^{-1}\mathbf{X} \right)^{-1} \mathbf{X}^\top\mathbf{V}(\boldsymbol{\theta})^{-1}\]

is the residual projection matrix. The resulting optimisation problem depends only on \(\boldsymbol{\theta}\).

Since \(\ell_p(\boldsymbol{\theta})\) is nonlinear in \(\boldsymbol{\theta}\) and does not admit a closed-form maximiser, numerical optimisation methods are required. These methods rely on the score vector and a curvature approximation. Differentiating the profile log-likelihood with respect to a variance component \(\theta_k\) gives

\[\frac{\partial \ell_p}{\partial \theta_k} = -\frac{1}{2}\left[ \operatorname{tr}\!\left( \mathbf{V}^{-1}\frac{\partial \mathbf{V}}{\partial \theta_k} \right) - \mathbf{Y}^\top \mathbf{P} \frac{\partial \mathbf{V}}{\partial \theta_k} \mathbf{P}\,\mathbf{Y} \right],\]

where the identity \(\partial \log|\mathbf{V}|/\partial \theta_k = \operatorname{tr}(\mathbf{V}^{-1}\partial \mathbf{V}/\partial \theta_k)\) has been used, together with \(\partial \mathbf{P}/\partial \theta_k = -\mathbf{P}(\partial \mathbf{V}/\partial \theta_k)\mathbf{P}\).

A common curvature approximation is given by the Fisher information matrix, whose \((k,j)\) entry is

\[\mathcal{I}_{kj}(\boldsymbol{\theta}) = \frac{1}{2}\operatorname{tr}\!\left( \mathbf{V}^{-1}\frac{\partial\mathbf{V}}{\partial\theta_k} \mathbf{V}^{-1}\frac{\partial\mathbf{V}}{\partial\theta_j} \right).\]

Starting from an initial value \(\boldsymbol{\theta}^{(0)}\), the Fisher scoring iteration updates \(\boldsymbol{\theta}\) according to

\[\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \mathcal{I}\!\left(\boldsymbol{\theta}^{(t)}\right)^{-1} \mathbf{s}\!\left(\boldsymbol{\theta}^{(t)}\right),\]

where \(\mathbf{s}(\boldsymbol{\theta})\) denotes the score vector. In practice, variance components are typically reparameterised (e.g. on a logarithmic scale) to ensure positivity during optimisation. Iterations are continued until convergence, for example when \(\|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}\| < \varepsilon\) for a prescribed tolerance \(\varepsilon > 0\).

Upon convergence, the ML estimate of \(\boldsymbol{\theta}\) is \(\widehat{\boldsymbol{\theta}}_{ML}\), and the ML estimate of \(\boldsymbol{\beta}_{ML}\) is recovered analytically as

\[\widehat{\boldsymbol{\beta}}_{ML} = \widehat{\boldsymbol{\beta}}\!\left(\widehat{\boldsymbol{\theta}}_{ML}\right) = \left( \mathbf{X}^\top \mathbf{V}(\widehat{\boldsymbol{\theta}}_{ML})^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}(\widehat{\boldsymbol{\theta}}_{ML})^{-1} \mathbf{Y}.\]

Remark

ML treats \(\boldsymbol{\beta}\) and \(\boldsymbol{\theta}\) symmetrically in the likelihood and is therefore consistent as \(n \to \infty\). However, in finite samples ML underestimates variance components because it does not account for the degrees of freedom consumed by estimating \(\boldsymbol{\beta}\). This bias motivates restricted maximum likelihood (REML), which maximises a likelihood formed from residual contrasts that are orthogonal to the column space of \(\mathbf{X}\), thereby adjusting for the estimation of fixed effects when recovering \(\boldsymbol{\theta}\).

REML Estimation

Restricted maximum likelihood (REML) constructs a likelihood based on linear combinations of \(\mathbf{Y}\) that are free of \(\boldsymbol{\beta}\). Let \(\mathbf{A}\) be an \((n-p)\times n\) matrix satisfying \(\mathbf{A}\mathbf{X}=\mathbf{0}\) and \(\operatorname{rank}(\mathbf{A})=n-p\). Then

\[\mathbf{A}\mathbf{Y} \sim \mathcal{N}\big(\mathbf{0},\; \mathbf{A}\mathbf{V}(\boldsymbol{\theta})\mathbf{A}^\top\big),\]

and the corresponding log-likelihood is

\[\ell_R(\boldsymbol{\theta}) = -\frac{1}{2}\left[ (n-p)\log(2\pi) + \log\big|\mathbf{A}\mathbf{V}(\boldsymbol{\theta})\mathbf{A}^\top\big| + (\mathbf{A}\mathbf{Y})^\top \big(\mathbf{A}\mathbf{V}(\boldsymbol{\theta})\mathbf{A}^\top\big)^{-1} (\mathbf{A}\mathbf{Y}) \right].\]

Using standard matrix identities, this expression can be rewritten as

\[\boxed{ \ell_R(\boldsymbol{\theta}) = -\frac{1}{2}\left[ \log\big|\mathbf{V}(\boldsymbol{\theta})\big| + \log\big|\mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{X}\big| + \mathbf{Y}^\top \mathbf{P}(\boldsymbol{\theta}) \mathbf{Y} \right] + C },\]

where

\[\boxed{ \mathbf{P}(\boldsymbol{\theta}) = \mathbf{V}(\boldsymbol{\theta})^{-1} - \mathbf{V}(\boldsymbol{\theta})^{-1}\mathbf{X} \left( \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1}\mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} },\]

and \(C\) is a constant independent of \(\boldsymbol{\theta}\). Since additive constants do not affect maximisation with respect to \(\boldsymbol{\theta}\), the REML log-likelihood is taken to be the above expression up to an additive constant.

Score Function

Let \(\theta_k\) denote the \(k\)-th component of \(\boldsymbol{\theta}\), and define

\[\dot{\mathbf{V}}_k = \frac{\partial \mathbf{V}(\boldsymbol{\theta})}{\partial \theta_k}.\]

Starting from

\[\ell_R(\boldsymbol{\theta}) = -\frac{1}{2}\left[ \log|\mathbf{V}| + \log\big|\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big| + \mathbf{Y}^\top \mathbf{P}\mathbf{Y} \right] + C,\]

we differentiate each term with respect to \(\theta_k\).

Derivative of \(\log|\mathbf{V}|\)

\[\frac{\partial}{\partial \theta_k}\log|\mathbf{V}| = \operatorname{tr}\!\big(\mathbf{V}^{-1}\dot{\mathbf{V}}_k\big).\]

Derivative of \(\log|\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}|\)

Set \(\mathbf{M} = \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\). Then

\[\frac{\partial}{\partial \theta_k}\log|\mathbf{M}| = \operatorname{tr}\!\big(\mathbf{M}^{-1} \dot{\mathbf{M}}_k\big),\]

where

\[\dot{\mathbf{M}}_k = \frac{\partial}{\partial \theta_k} \big(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big) = -\,\mathbf{X}^\top \mathbf{V}^{-1} \dot{\mathbf{V}}_k \mathbf{V}^{-1}\mathbf{X}.\]

Hence,

\[\frac{\partial}{\partial \theta_k} \log\big|\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big| = -\,\operatorname{tr}\!\Big( \big(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big)^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \dot{\mathbf{V}}_k \mathbf{V}^{-1}\mathbf{X} \Big).\]

Derivative of \(\mathbf{Y}^\top \mathbf{P}\mathbf{Y}\)

\[\frac{\partial}{\partial \theta_k} \big(\mathbf{Y}^\top \mathbf{P}\mathbf{Y}\big) = \mathbf{Y}^\top \frac{\partial \mathbf{P}}{\partial \theta_k} \mathbf{Y}.\]

Using

\[\frac{\partial \mathbf{P}}{\partial \theta_k} = -\,\mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P},\]

we obtain

\[\frac{\partial}{\partial \theta_k} \big(\mathbf{Y}^\top \mathbf{P}\mathbf{Y}\big) = -\,\mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\mathbf{Y}.\]

Compact Form

Combining the three components and noting that \(C\) does not depend on \(\boldsymbol{\theta}\), the score function is

\[S_k(\boldsymbol{\theta}) = \frac{\partial \ell_R(\boldsymbol{\theta})}{\partial \theta_k} = -\frac{1}{2} \left[ \operatorname{tr}\!\big(\mathbf{V}^{-1}\dot{\mathbf{V}}_k\big) - \operatorname{tr}\!\Big( \big(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big)^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \dot{\mathbf{V}}_k \mathbf{V}^{-1}\mathbf{X} \Big) - \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\mathbf{Y} \right].\]

To obtain a more compact form, recall the identity

\[\mathbf{P} = \mathbf{V}^{-1} - \mathbf{V}^{-1}\mathbf{X} \left( \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}^{-1}.\]

Hence,

\[\operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big) = \operatorname{tr}\!\big(\mathbf{V}^{-1}\dot{\mathbf{V}}_k\big) - \operatorname{tr}\!\Big( \mathbf{V}^{-1}\mathbf{X} \left( \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \dot{\mathbf{V}}_k \Big).\]

Using the cyclic property of the trace, \(\operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C}) = \operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D})\), the second term becomes

\[\operatorname{tr}\!\Big( \mathbf{V}^{-1}\mathbf{X} \left( \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \dot{\mathbf{V}}_k \Big) = \operatorname{tr}\!\Big( \left( \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \dot{\mathbf{V}}_k \mathbf{V}^{-1}\mathbf{X} \Big).\]

Therefore,

\[\operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big) = \operatorname{tr}\!\big(\mathbf{V}^{-1}\dot{\mathbf{V}}_k\big) - \operatorname{tr}\!\Big( \big(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big)^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \dot{\mathbf{V}}_k \mathbf{V}^{-1}\mathbf{X} \Big).\]

Substituting this identity into the previous expression for \(S_k(\boldsymbol{\theta})\) yields

(4)\[\boxed{ S_k(\boldsymbol{\theta}) = \frac{1}{2} \left[ \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\mathbf{Y} - \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big) \right] }.\]

Average Information Matrix

The score function for \(\theta_k\) is

\[S_k(\boldsymbol{\theta}) = \frac{1}{2} \left[ \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\mathbf{Y} - \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big) \right],\]

where

\[\dot{\mathbf{V}}_k = \frac{\partial \mathbf{V}(\boldsymbol{\theta})}{\partial \theta_k}.\]

Observed Information

Differentiating \(S_k\) with respect to \(\theta_j\) gives

\[-\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j} = -\frac{1}{2} \left[ \mathbf{Y}^\top \frac{\partial}{\partial \theta_j} \big(\mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\big) \mathbf{Y} - \frac{\partial}{\partial \theta_j} \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big) \right].\]

For the quadratic term, using the product rule and \(\partial \mathbf{P}/\partial \theta_j = -\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\), we obtain

\[\frac{\partial}{\partial \theta_j} \big(\mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\big) = -\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P} - \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} + \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P}.\]

For the trace term,

\[\frac{\partial}{\partial \theta_j} \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big) = -\,\operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big) + \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big).\]

Putting everything together yields the observed information:

\[\mathcal{J}_{kj}(\boldsymbol{\theta}) = -\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j} = \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \mathbf{Y} -\frac{1}{2} \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big) +\frac{1}{2} \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big) - \frac{1}{2} \mathbf{Y}^\top \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P}\mathbf{Y}.\]

Expected Information

We now derive the expected Fisher information matrix by taking expectations of the negative second derivatives. Recall that under the model, \(\mathbb{E}[\mathbf{Y}] = \mathbf{X}\boldsymbol{\beta}\). However, a fundamental property of the projection matrix \(\mathbf{P}\) is that \(\mathbf{P}\mathbf{X} = \mathbf{0}\), which implies \(\mathbf{P}\mathbf{Y} = \mathbf{P}(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})\). Consequently, when computing expectations of expressions involving \(\mathbf{P}\), we may without loss of generality treat \(\mathbf{Y}\) as centered at zero. We therefore take \(\mathbb{E}[\mathbf{Y}] = \mathbf{0}\) and \(\mathbb{E}[\mathbf{Y}\mathbf{Y}^\top] = \mathbf{V}\) for the remainder of this derivation.

The observed information matrix is given by

\[-\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j} = \underbrace{\mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \mathbf{Y}}_{T_1} -\frac12 \underbrace{\operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big)}_{T_2} +\frac12 \underbrace{\operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big)}_{T_3} - \frac12 \underbrace{\mathbf{Y}^\top \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P}\mathbf{Y}}_{T_4}.\]

We evaluate the expectation of each term in turn.

Expectation of \(T_1\)

The quantity \(T_1\) is a quadratic form in \(\mathbf{Y}\). For a zero-mean random vector with covariance \(\mathbf{V}\), the identity \(\mathbb{E}[\mathbf{Y}^\top \mathbf{A} \mathbf{Y}] = \operatorname{tr}(\mathbf{A} \mathbf{V})\) holds for any conformable matrix \(\mathbf{A}\). Applying this with \(\mathbf{A} = \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\) yields

\[\mathbb{E}[T_1] = \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \, \mathbf{V} \big).\]

A key identity satisfied by \(\mathbf{P}\) is \(\mathbf{P}\mathbf{V}\mathbf{P} = \mathbf{P}\), which can be verified by direct substitution of the definition \(\mathbf{P} = \mathbf{V}^{-1} - \mathbf{V}^{-1}\mathbf{X}(\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{V}^{-1}\). Using the cyclic property of the trace,

\begin{align*} \mathbb{E}[T_1] &= \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\, (\mathbf{P}\mathbf{V}) \big) \\ &= \operatorname{tr}\!\big( \dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\mathbf{V}\mathbf{P} \big) \quad \text{(cyclic permutation)} \\ &= \operatorname{tr}\!\big( \dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \big) \quad \text{(since $\mathbf{P}\mathbf{V}\mathbf{P} = \mathbf{P}$)} \\ &= \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big) \quad \text{(cyclic permutation again)}. \end{align*}

Thus

\[\mathbb{E}[T_1] = \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).\]

Expectation of \(T_2\)

The term \(T_2\) contains no random quantities; it is a deterministic function of the parameters. Hence

\[\mathbb{E}[T_2] = \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big).\]

Expectation of \(T_3\)

Similarly, \(T_3\) is non-random, giving

\[\mathbb{E}[T_3] = \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big).\]

Expectation of \(T_4\)

The term \(T_4\) is again a quadratic form. Applying the same identity as for \(T_1\),

\[\mathbb{E}[T_4] = \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P} \, \mathbf{V} \big).\]

Using the cyclic property followed by \(\mathbf{P}\mathbf{V}\mathbf{P} = \mathbf{P}\),

\begin{align*} \mathbb{E}[T_4] &= \operatorname{tr}\!\big( \ddot{\mathbf{V}}_{kj}\,\mathbf{P} \mathbf{V} \mathbf{P} \big) \\ &= \operatorname{tr}\!\big( \ddot{\mathbf{V}}_{kj}\,\mathbf{P} \big) \\ &= \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big). \end{align*}

Combining the expectations

Assembling the four components,

\begin{align*} \mathbb{E}\!\left[ -\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j} \right] &= \mathbb{E}[T_1] - \frac12 \mathbb{E}[T_2] + \frac12 \mathbb{E}[T_3] - \frac12 \mathbb{E}[T_4] \\[4pt] &= \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big) - \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big) \\ &\quad + \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big) - \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big). \end{align*}

The two terms involving the second derivatives \(\ddot{\mathbf{V}}_{kj}\) cancel identically. Moreover, the trace is symmetric in its matrix arguments, so

\[\operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big) = \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).\]

Substituting this into the expression above,

\begin{align*} \mathbb{E}\!\left[ -\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j} \right] &= \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big) - \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big) \\[4pt] &= \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big). \end{align*}

This final expression defines the \((k,j)\)-th entry of the REML Fisher information matrix,

\[\mathcal{I}_{kj}(\boldsymbol{\theta}) = \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).\]

Definition of the Average Information

The expected information (Fisher information) is

\[\mathcal{I}_{kj}(\boldsymbol{\theta}) = \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).\]

The observed information is denoted \(\mathcal{J}_{kj}(\boldsymbol{\theta})\), and the average information matrix is defined as

\[\mathcal{A}_{kj}(\boldsymbol{\theta}) = \frac12 \Big( \mathcal{I}_{kj}(\boldsymbol{\theta}) + \mathcal{J}_{kj}(\boldsymbol{\theta}) \Big).\]

Substituting the expressions and simplifying, the trace terms cancel, yielding

\begin{align*} \mathcal{A}_{kj} &= \frac12 \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \mathbf{Y} \\ &\quad + \frac14 \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big) - \frac14 \mathbf{Y}^\top \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P}\mathbf{Y}. \end{align*}

Simplification for Linear Variance Component Models

In the common case where \(\mathbf{V}(\boldsymbol{\theta}) = \sum_{t=1}^p \theta_t \mathbf{M}_t\) with known matrices \(\mathbf{M}_t\), the second derivatives vanish: \(\ddot{\mathbf{V}}_{kj} = \mathbf{0}\) for all \(k,j\). The average information matrix then reduces to

(5)\[\boxed{ \mathcal{A}_{kj} = \frac12 \,\mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \mathbf{Y} }.\]

This is the form commonly used in the AI-REML algorithm [Gilmour1995], where the average information matrix is computed directly from the data without requiring second derivatives. The algorithm proceeds by iteratively solving

\[\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \big( \mathcal{A}(\boldsymbol{\theta}^{(t)}) \big)^{-1} \mathbf{S}(\boldsymbol{\theta}^{(t)}).\]

References

[Gilmour1995]

Gilmour, A. R., Thompson, R., & Cullis, B. R. (1995). Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51(4), 1440-1450.