Technical Documentation
========================================

Gaussian Linear Mixed Model Formulation
---------------------------------------

Consider the linear mixed model

.. math::
   :label: eq-main

   \mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b} + \boldsymbol{\varepsilon},

where:

- :math:`\mathbf{Y} \in \mathbb{R}^{n \times 1}` is the response vector,
- :math:`\mathbf{X} \in \mathbb{R}^{n \times p}` is the fixed-effects design matrix,
- :math:`\boldsymbol{\beta} \in \mathbb{R}^{p \times 1}` is the vector of fixed-effect coefficients,
- :math:`\mathbf{Z} \in \mathbb{R}^{n \times q}` is the random-effects design matrix,
- :math:`\mathbf{b} \in \mathbb{R}^{q \times 1}` is the vector of random effects,
- :math:`\boldsymbol{\varepsilon} \in \mathbb{R}^{n \times 1}` is the vector of residual errors.

We assume

.. math::

   \mathbf{b} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}), \quad
   \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{R}), \quad
   \mathbf{b} \perp \boldsymbol{\varepsilon},

where :math:`\mathbf{G} \in \mathbb{R}^{q \times q}` and :math:`\mathbf{R} \in \mathbb{R}^{n \times n}` are covariance matrices.

Let :math:`\boldsymbol{\theta}` denote the collection of variance components that parameterise :math:`\mathbf{G}` and :math:`\mathbf{R}`, i.e.,

.. math::

   \mathbf{G} = \mathbf{G}(\boldsymbol{\theta}), \quad
   \mathbf{R} = \mathbf{R}(\boldsymbol{\theta}).

Under these assumptions, the marginal distribution of :math:`\mathbf{Y}` is given by

.. math::

   \mathbf{Y} \sim \mathcal{N}\left(\mathbf{X}\boldsymbol{\beta},\; \mathbf{V}(\boldsymbol{\theta})\right),

where

.. math::
   :label: eq-sigma

   \mathbf{V}(\boldsymbol{\theta}) = \mathbf{Z}\mathbf{G}(\boldsymbol{\theta})\mathbf{Z}^\top + \mathbf{R}(\boldsymbol{\theta}).

BLUE
~~~~

Given :math:`\boldsymbol{\theta}`, the Best Linear Unbiased Estimator (BLUE) of :math:`\boldsymbol{\beta}` is the linear unbiased estimator that minimises the variance among all such estimators. Under the linear mixed model with covariance matrix :math:`\mathbf{V}(\boldsymbol{\theta})`, the BLUE of :math:`\boldsymbol{\beta}` is given by

.. math::

   \widehat{\boldsymbol{\beta}}(\boldsymbol{\theta})
   =
   \left(\mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{X}\right)^{-1}
   \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{Y}.

This estimator is the generalised least squares (GLS) estimator of :math:`\boldsymbol{\beta}`.

BLUP
~~~~

Given :math:`\boldsymbol{\theta}`, we often wish to predict the random effects :math:`\mathbf{b}`. The Best Linear Unbiased Predictor (BLUP) of :math:`\mathbf{b}` is given by:

.. math::
   :label: eq-blup

   \widehat{\mathbf{b}}(\boldsymbol{\theta}) = \mathbf{G}(\boldsymbol{\theta})\mathbf{Z}^\top \mathbf{V}^{-1}(\boldsymbol{\theta}) (\mathbf{Y} - \mathbf{X}\widehat{\boldsymbol{\beta}}).

ML Estimation
~~~~~~~~~~~~~

Maximum likelihood (ML) estimation jointly estimates the fixed effects :math:`\boldsymbol{\beta}` and variance components :math:`\boldsymbol{\theta}` by maximising the marginal likelihood of the observed data. After integrating out the random effects :math:`\mathbf{b}` from the joint model, the marginal distribution of :math:`\mathbf{Y}` is

.. math::

   \mathbf{Y} \sim \mathcal{N}\!\left(\mathbf{X}\boldsymbol{\beta},\; \mathbf{V}(\boldsymbol{\theta})\right),

where :math:`\mathbf{V}(\boldsymbol{\theta}) = \mathbf{Z}\mathbf{G}(\boldsymbol{\theta})\mathbf{Z}^\top + \mathbf{R}(\boldsymbol{\theta})`. The corresponding log-likelihood is given by

.. math::

   \ell(\boldsymbol{\beta}, \boldsymbol{\theta})
   =
   -\frac{1}{2}\left[
   n\log(2\pi)
   + \log\left|\mathbf{V}(\boldsymbol{\theta})\right|
   + (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})^\top
   \mathbf{V}(\boldsymbol{\theta})^{-1}
   (\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})
   \right].

For fixed :math:`\boldsymbol{\theta}`, the log-likelihood is quadratic in :math:`\boldsymbol{\beta}`, and the maximiser can be obtained in closed form by setting :math:`\partial \ell / \partial \boldsymbol{\beta} = \mathbf{0}`. This yields the generalised least squares (GLS) estimator

.. math::

   \widehat{\boldsymbol{\beta}}(\boldsymbol{\theta})
   =
   \left(
   \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{X}
   \right)^{-1}
   \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{Y}.

Substituting this expression back into the log-likelihood eliminates :math:`\boldsymbol{\beta}` and produces the profile log-likelihood

.. math::

   \ell_p(\boldsymbol{\theta})
   =
   -\frac{1}{2}\left[
   n\log(2\pi)
   + \log\left|\mathbf{V}(\boldsymbol{\theta})\right|
   + \mathbf{Y}^\top \mathbf{P}(\boldsymbol{\theta})\, \mathbf{Y}
   \right],

where

.. math::

   \mathbf{P}(\boldsymbol{\theta})
   =
   \mathbf{V}(\boldsymbol{\theta})^{-1}
   -
   \mathbf{V}(\boldsymbol{\theta})^{-1}\mathbf{X}
   \left(
   \mathbf{X}^\top\mathbf{V}(\boldsymbol{\theta})^{-1}\mathbf{X}
   \right)^{-1}
   \mathbf{X}^\top\mathbf{V}(\boldsymbol{\theta})^{-1}

is the residual projection matrix. The resulting optimisation problem depends only on :math:`\boldsymbol{\theta}`.

Since :math:`\ell_p(\boldsymbol{\theta})` is nonlinear in :math:`\boldsymbol{\theta}` and does not admit a closed-form maximiser, numerical optimisation methods are required. These methods rely on the score vector and a curvature approximation. Differentiating the profile log-likelihood with respect to a variance component :math:`\theta_k` gives

.. math::

   \frac{\partial \ell_p}{\partial \theta_k}
   =
   -\frac{1}{2}\left[
   \operatorname{tr}\!\left(
   \mathbf{V}^{-1}\frac{\partial \mathbf{V}}{\partial \theta_k}
   \right)
   -
   \mathbf{Y}^\top \mathbf{P}
   \frac{\partial \mathbf{V}}{\partial \theta_k}
   \mathbf{P}\,\mathbf{Y}
   \right],

where the identity :math:`\partial \log|\mathbf{V}|/\partial \theta_k = \operatorname{tr}(\mathbf{V}^{-1}\partial \mathbf{V}/\partial \theta_k)` has been used, together with :math:`\partial \mathbf{P}/\partial \theta_k = -\mathbf{P}(\partial \mathbf{V}/\partial \theta_k)\mathbf{P}`.

A common curvature approximation is given by the Fisher information matrix, whose :math:`(k,j)` entry is

.. math::

   \mathcal{I}_{kj}(\boldsymbol{\theta})
   =
   \frac{1}{2}\operatorname{tr}\!\left(
   \mathbf{V}^{-1}\frac{\partial\mathbf{V}}{\partial\theta_k}
   \mathbf{V}^{-1}\frac{\partial\mathbf{V}}{\partial\theta_j}
   \right).

Starting from an initial value :math:`\boldsymbol{\theta}^{(0)}`, the Fisher scoring iteration updates :math:`\boldsymbol{\theta}` according to

.. math::

   \boldsymbol{\theta}^{(t+1)}
   =
   \boldsymbol{\theta}^{(t)}
   +
   \mathcal{I}\!\left(\boldsymbol{\theta}^{(t)}\right)^{-1}
   \mathbf{s}\!\left(\boldsymbol{\theta}^{(t)}\right),

where :math:`\mathbf{s}(\boldsymbol{\theta})` denotes the score vector. In practice, variance components are typically reparameterised (e.g. on a logarithmic scale) to ensure positivity during optimisation. Iterations are continued until convergence, for example when :math:`\|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}\| < \varepsilon` for a prescribed tolerance :math:`\varepsilon > 0`.

Upon convergence, the ML estimate of :math:`\boldsymbol{\theta}` is :math:`\widehat{\boldsymbol{\theta}}_{ML}`, and the ML estimate of :math:`\boldsymbol{\beta}_{ML}` is recovered analytically as

.. math::

   \widehat{\boldsymbol{\beta}}_{ML}
   = \widehat{\boldsymbol{\beta}}\!\left(\widehat{\boldsymbol{\theta}}_{ML}\right)
   =
   \left(
   \mathbf{X}^\top
   \mathbf{V}(\widehat{\boldsymbol{\theta}}_{ML})^{-1}
   \mathbf{X}
   \right)^{-1}
   \mathbf{X}^\top
   \mathbf{V}(\widehat{\boldsymbol{\theta}}_{ML})^{-1}
   \mathbf{Y}.

.. admonition:: Remark

   ML treats :math:`\boldsymbol{\beta}` and :math:`\boldsymbol{\theta}` symmetrically in the likelihood and is therefore consistent as :math:`n \to \infty`. However, in finite samples ML underestimates variance components because it does not account for the degrees of freedom consumed by estimating :math:`\boldsymbol{\beta}`. This bias motivates restricted maximum likelihood (REML), which maximises a likelihood formed from residual contrasts that are orthogonal to the column space of :math:`\mathbf{X}`, thereby adjusting for the estimation of fixed effects when recovering :math:`\boldsymbol{\theta}`.

REML Estimation
---------------

Restricted maximum likelihood (REML) constructs a likelihood based on linear combinations of :math:`\mathbf{Y}` that are free of :math:`\boldsymbol{\beta}`. Let :math:`\mathbf{A}` be an :math:`(n-p)\times n` matrix satisfying :math:`\mathbf{A}\mathbf{X}=\mathbf{0}` and :math:`\operatorname{rank}(\mathbf{A})=n-p`. Then

.. math::

   \mathbf{A}\mathbf{Y} \sim \mathcal{N}\big(\mathbf{0},\; \mathbf{A}\mathbf{V}(\boldsymbol{\theta})\mathbf{A}^\top\big),

and the corresponding log-likelihood is

.. math::

   \ell_R(\boldsymbol{\theta})
   =
   -\frac{1}{2}\left[
   (n-p)\log(2\pi)
   + \log\big|\mathbf{A}\mathbf{V}(\boldsymbol{\theta})\mathbf{A}^\top\big|
   + (\mathbf{A}\mathbf{Y})^\top \big(\mathbf{A}\mathbf{V}(\boldsymbol{\theta})\mathbf{A}^\top\big)^{-1} (\mathbf{A}\mathbf{Y})
   \right].

Using standard matrix identities, this expression can be rewritten as

.. math::

   \boxed{
   \ell_R(\boldsymbol{\theta})
   =
   -\frac{1}{2}\left[
   \log\big|\mathbf{V}(\boldsymbol{\theta})\big|
   + \log\big|\mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1} \mathbf{X}\big|
   + \mathbf{Y}^\top \mathbf{P}(\boldsymbol{\theta}) \mathbf{Y}
   \right]
   + C
   },

where

.. math::

   \boxed{
   \mathbf{P}(\boldsymbol{\theta})
   =
   \mathbf{V}(\boldsymbol{\theta})^{-1}
   -
   \mathbf{V}(\boldsymbol{\theta})^{-1}\mathbf{X}
   \left(
   \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1}\mathbf{X}
   \right)^{-1}
   \mathbf{X}^\top \mathbf{V}(\boldsymbol{\theta})^{-1}
   },

and :math:`C` is a constant independent of :math:`\boldsymbol{\theta}`. Since additive constants do not affect maximisation with respect to :math:`\boldsymbol{\theta}`, the REML log-likelihood is taken to be the above expression up to an additive constant.

Score Function
~~~~~~~~~~~~~~

Let :math:`\theta_k` denote the :math:`k`-th component of :math:`\boldsymbol{\theta}`, and define

.. math::

   \dot{\mathbf{V}}_k
   =
   \frac{\partial \mathbf{V}(\boldsymbol{\theta})}{\partial \theta_k}.

Starting from

.. math::

   \ell_R(\boldsymbol{\theta})
   =
   -\frac{1}{2}\left[
   \log|\mathbf{V}|
   + \log\big|\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big|
   + \mathbf{Y}^\top \mathbf{P}\mathbf{Y}
   \right]
   + C,

we differentiate each term with respect to :math:`\theta_k`.

**Derivative of** :math:`\log|\mathbf{V}|`


.. math::

   \frac{\partial}{\partial \theta_k}\log|\mathbf{V}|
   =
   \operatorname{tr}\!\big(\mathbf{V}^{-1}\dot{\mathbf{V}}_k\big).

**Derivative of** :math:`\log|\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}|`

Set :math:`\mathbf{M} = \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}`. Then

.. math::

   \frac{\partial}{\partial \theta_k}\log|\mathbf{M}|
   =
   \operatorname{tr}\!\big(\mathbf{M}^{-1} \dot{\mathbf{M}}_k\big),

where

.. math::

   \dot{\mathbf{M}}_k
   =
   \frac{\partial}{\partial \theta_k}
   \big(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big)
   =
   -\,\mathbf{X}^\top \mathbf{V}^{-1}
   \dot{\mathbf{V}}_k
   \mathbf{V}^{-1}\mathbf{X}.

Hence,

.. math::

   \frac{\partial}{\partial \theta_k}
   \log\big|\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big|
   =
   -\,\operatorname{tr}\!\Big(
   \big(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big)^{-1}
   \mathbf{X}^\top \mathbf{V}^{-1}
   \dot{\mathbf{V}}_k
   \mathbf{V}^{-1}\mathbf{X}
   \Big).

**Derivative of** :math:`\mathbf{Y}^\top \mathbf{P}\mathbf{Y}`

.. math::

   \frac{\partial}{\partial \theta_k}
   \big(\mathbf{Y}^\top \mathbf{P}\mathbf{Y}\big)
   =
   \mathbf{Y}^\top
   \frac{\partial \mathbf{P}}{\partial \theta_k}
   \mathbf{Y}.

Using

.. math::

   \frac{\partial \mathbf{P}}{\partial \theta_k}
   =
   -\,\mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P},

we obtain

.. math::

   \frac{\partial}{\partial \theta_k}
   \big(\mathbf{Y}^\top \mathbf{P}\mathbf{Y}\big)
   =
   -\,\mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\mathbf{Y}.

**Compact Form**

Combining the three components and noting that :math:`C` does not depend on :math:`\boldsymbol{\theta}`, the score function is

.. math::

   S_k(\boldsymbol{\theta})
   =
   \frac{\partial \ell_R(\boldsymbol{\theta})}{\partial \theta_k}
   =
   -\frac{1}{2}
   \left[
   \operatorname{tr}\!\big(\mathbf{V}^{-1}\dot{\mathbf{V}}_k\big)
   -
   \operatorname{tr}\!\Big(
   \big(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big)^{-1}
   \mathbf{X}^\top \mathbf{V}^{-1}
   \dot{\mathbf{V}}_k
   \mathbf{V}^{-1}\mathbf{X}
   \Big)
   -
   \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\mathbf{Y}
   \right].

To obtain a more compact form, recall the identity

.. math::

   \mathbf{P}
   =
   \mathbf{V}^{-1}
   -
   \mathbf{V}^{-1}\mathbf{X}
   \left(
   \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}
   \right)^{-1}
   \mathbf{X}^\top \mathbf{V}^{-1}.

Hence,

.. math::

   \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big)
   =
   \operatorname{tr}\!\big(\mathbf{V}^{-1}\dot{\mathbf{V}}_k\big)
   -
   \operatorname{tr}\!\Big(
   \mathbf{V}^{-1}\mathbf{X}
   \left(
   \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}
   \right)^{-1}
   \mathbf{X}^\top \mathbf{V}^{-1}
   \dot{\mathbf{V}}_k
   \Big).

Using the cyclic property of the trace, :math:`\operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C}) = \operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D})`, the second term becomes

.. math::

   \operatorname{tr}\!\Big(
   \mathbf{V}^{-1}\mathbf{X}
   \left(
   \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}
   \right)^{-1}
   \mathbf{X}^\top \mathbf{V}^{-1}
   \dot{\mathbf{V}}_k
   \Big)
   =
   \operatorname{tr}\!\Big(
   \left(
   \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}
   \right)^{-1}
   \mathbf{X}^\top \mathbf{V}^{-1}
   \dot{\mathbf{V}}_k
   \mathbf{V}^{-1}\mathbf{X}
   \Big).

Therefore,

.. math::

   \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big)
   =
   \operatorname{tr}\!\big(\mathbf{V}^{-1}\dot{\mathbf{V}}_k\big)
   -
   \operatorname{tr}\!\Big(
   \big(\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\big)^{-1}
   \mathbf{X}^\top \mathbf{V}^{-1}
   \dot{\mathbf{V}}_k
   \mathbf{V}^{-1}\mathbf{X}
   \Big).

Substituting this identity into the previous expression for :math:`S_k(\boldsymbol{\theta})` yields

.. math::
   :label: eq-score

   \boxed{
   S_k(\boldsymbol{\theta})
   =
   \frac{1}{2}
   \left[
   \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\mathbf{Y}
   -
   \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big)
   \right]
   }.

Average Information Matrix
~~~~~~~~~~~~~~~~~~~~~~~~~~

The score function for :math:`\theta_k` is

.. math::

   S_k(\boldsymbol{\theta})
   =
   \frac{1}{2}
   \left[
   \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\mathbf{Y}
   -
   \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big)
   \right],

where

.. math::

   \dot{\mathbf{V}}_k
   =
   \frac{\partial \mathbf{V}(\boldsymbol{\theta})}{\partial \theta_k}.

Observed Information
~~~~~~~~~~~~~~~~~~~~

Differentiating :math:`S_k` with respect to :math:`\theta_j` gives

.. math::

   -\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j}
   =
   -\frac{1}{2}
   \left[
   \mathbf{Y}^\top
   \frac{\partial}{\partial \theta_j}
   \big(\mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\big)
   \mathbf{Y}
   -
   \frac{\partial}{\partial \theta_j}
   \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big)
   \right].

For the quadratic term, using the product rule and :math:`\partial \mathbf{P}/\partial \theta_j = -\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}`, we obtain

.. math::

   \frac{\partial}{\partial \theta_j}
   \big(\mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\big)
   =
   -\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}
   -
   \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}
   +
   \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P}.

For the trace term,

.. math::

   \frac{\partial}{\partial \theta_j}
   \operatorname{tr}\!\big(\mathbf{P}\,\dot{\mathbf{V}}_k\big)
   =
   -\,\operatorname{tr}\!\big(
   \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k
   \big)
   +
   \operatorname{tr}\!\big(
   \mathbf{P}\,\ddot{\mathbf{V}}_{kj}
   \big).

Putting everything together yields the observed information:

.. math::

   \mathcal{J}_{kj}(\boldsymbol{\theta}) = -\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j}
   =
   \mathbf{Y}^\top
   \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}
   \mathbf{Y}
   -\frac{1}{2}
   \operatorname{tr}\!\big(
   \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k
   \big)
   +\frac{1}{2}
   \operatorname{tr}\!\big(
   \mathbf{P}\,\ddot{\mathbf{V}}_{kj}
   \big)
   - \frac{1}{2} \mathbf{Y}^\top \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P}\mathbf{Y}.

Expected Information
~~~~~~~~~~~~~~~~~~~~

We now derive the expected Fisher information matrix by taking expectations of the negative second derivatives. Recall that under the model, :math:`\mathbb{E}[\mathbf{Y}] = \mathbf{X}\boldsymbol{\beta}`. However, a fundamental property of the projection matrix :math:`\mathbf{P}` is that :math:`\mathbf{P}\mathbf{X} = \mathbf{0}`, which implies :math:`\mathbf{P}\mathbf{Y} = \mathbf{P}(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})`. Consequently, when computing expectations of expressions involving :math:`\mathbf{P}`, we may without loss of generality treat :math:`\mathbf{Y}` as centered at zero. We therefore take :math:`\mathbb{E}[\mathbf{Y}] = \mathbf{0}` and :math:`\mathbb{E}[\mathbf{Y}\mathbf{Y}^\top] = \mathbf{V}` for the remainder of this derivation.

The observed information matrix is given by

.. math::

   -\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j}
   =
   \underbrace{\mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \mathbf{Y}}_{T_1}
   -\frac12 \underbrace{\operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big)}_{T_2}
   +\frac12 \underbrace{\operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big)}_{T_3}
   - \frac12 \underbrace{\mathbf{Y}^\top \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P}\mathbf{Y}}_{T_4}.

We evaluate the expectation of each term in turn.

**Expectation of** :math:`T_1`

The quantity :math:`T_1` is a quadratic form in :math:`\mathbf{Y}`. For a zero-mean random vector with covariance :math:`\mathbf{V}`, the identity :math:`\mathbb{E}[\mathbf{Y}^\top \mathbf{A} \mathbf{Y}] = \operatorname{tr}(\mathbf{A} \mathbf{V})` holds for any conformable matrix :math:`\mathbf{A}`. Applying this with :math:`\mathbf{A} = \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}` yields

.. math::

   \mathbb{E}[T_1] = \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \, \mathbf{V} \big).

A key identity satisfied by :math:`\mathbf{P}` is :math:`\mathbf{P}\mathbf{V}\mathbf{P} = \mathbf{P}`, which can be verified by direct substitution of the definition :math:`\mathbf{P} = \mathbf{V}^{-1} - \mathbf{V}^{-1}\mathbf{X}(\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{V}^{-1}`. Using the cyclic property of the trace,

.. math::
   :nowrap:

   \begin{align*}
   \mathbb{E}[T_1] &= \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\, (\mathbf{P}\mathbf{V}) \big) \\
   &= \operatorname{tr}\!\big( \dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\mathbf{V}\mathbf{P} \big) \quad \text{(cyclic permutation)} \\
   &= \operatorname{tr}\!\big( \dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \big) \quad \text{(since $\mathbf{P}\mathbf{V}\mathbf{P} = \mathbf{P}$)} \\
   &= \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big) \quad \text{(cyclic permutation again)}.
   \end{align*}

Thus

.. math::

   \mathbb{E}[T_1] = \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).

**Expectation of** :math:`T_2`

The term :math:`T_2` contains no random quantities; it is a deterministic function of the parameters. Hence

.. math::

   \mathbb{E}[T_2] = \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big).

**Expectation of** :math:`T_3`

Similarly, :math:`T_3` is non-random, giving

.. math::

   \mathbb{E}[T_3] = \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big).

**Expectation of** :math:`T_4`

The term :math:`T_4` is again a quadratic form. Applying the same identity as for :math:`T_1`,

.. math::

   \mathbb{E}[T_4] = \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P} \, \mathbf{V} \big).

Using the cyclic property followed by :math:`\mathbf{P}\mathbf{V}\mathbf{P} = \mathbf{P}`,

.. math::
   :nowrap:

   \begin{align*}
   \mathbb{E}[T_4] &= \operatorname{tr}\!\big( \ddot{\mathbf{V}}_{kj}\,\mathbf{P} \mathbf{V} \mathbf{P} \big) \\
   &= \operatorname{tr}\!\big( \ddot{\mathbf{V}}_{kj}\,\mathbf{P} \big) \\
   &= \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big).
   \end{align*}

**Combining the expectations**

Assembling the four components,

.. math::
   :nowrap:

   \begin{align*}
   \mathbb{E}\!\left[ -\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j} \right]
   &= \mathbb{E}[T_1] - \frac12 \mathbb{E}[T_2] + \frac12 \mathbb{E}[T_3] - \frac12 \mathbb{E}[T_4] \\[4pt]
   &= \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big)
   - \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big) \\
   &\quad + \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big)
   - \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big).
   \end{align*}

The two terms involving the second derivatives :math:`\ddot{\mathbf{V}}_{kj}` cancel identically. Moreover, the trace is symmetric in its matrix arguments, so

.. math::

   \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\,\dot{\mathbf{V}}_k \big)
   = \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).

Substituting this into the expression above,

.. math::
   :nowrap:

   \begin{align*}
   \mathbb{E}\!\left[ -\frac{\partial^2 \ell_R}{\partial \theta_k \partial \theta_j} \right]
   &= \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big)
   - \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big) \\[4pt]
   &= \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).
   \end{align*}

This final expression defines the :math:`(k,j)`-th entry of the REML Fisher information matrix,

.. math::

   \mathcal{I}_{kj}(\boldsymbol{\theta}) = \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).

Definition of the Average Information
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The expected information (Fisher information) is

.. math::

   \mathcal{I}_{kj}(\boldsymbol{\theta}) = \frac12 \operatorname{tr}\!\big( \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j \big).

The observed information is denoted :math:`\mathcal{J}_{kj}(\boldsymbol{\theta})`, and the *average information* matrix is defined as

.. math::

   \mathcal{A}_{kj}(\boldsymbol{\theta}) = \frac12 \Big( \mathcal{I}_{kj}(\boldsymbol{\theta}) + \mathcal{J}_{kj}(\boldsymbol{\theta}) \Big).

Substituting the expressions and simplifying, the trace terms cancel, yielding

.. math::
   :nowrap:

   \begin{align*}
   \mathcal{A}_{kj} &= \frac12 \mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \mathbf{Y} \\
   &\quad + \frac14 \operatorname{tr}\!\big( \mathbf{P}\,\ddot{\mathbf{V}}_{kj} \big)
   - \frac14 \mathbf{Y}^\top \mathbf{P}\,\ddot{\mathbf{V}}_{kj}\,\mathbf{P}\mathbf{Y}.
   \end{align*}

Simplification for Linear Variance Component Models
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

In the common case where :math:`\mathbf{V}(\boldsymbol{\theta}) = \sum_{t=1}^p \theta_t \mathbf{M}_t` with known matrices :math:`\mathbf{M}_t`, the second derivatives vanish: :math:`\ddot{\mathbf{V}}_{kj} = \mathbf{0}` for all :math:`k,j`. The average information matrix then reduces to

.. math::
   :label: eq-AI

   \boxed{
   \mathcal{A}_{kj} = \frac12 \,\mathbf{Y}^\top \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P} \mathbf{Y}
   }.

This is the form commonly used in the AI-REML algorithm [Gilmour1995]_, where the average information matrix is computed directly from the data without requiring second derivatives. The algorithm proceeds by iteratively solving

.. math::

   \boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \big( \mathcal{A}(\boldsymbol{\theta}^{(t)}) \big)^{-1} \mathbf{S}(\boldsymbol{\theta}^{(t)}).

.. rubric:: References

.. [Gilmour1995] Gilmour, A. R., Thompson, R., & Cullis, B. R. (1995). Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. *Biometrics*, 51(4), 1440-1450.