Technical Documentation¶
Gaussian Linear Mixed Model Formulation¶
Consider the linear mixed model
where:
\(\mathbf{Y} \in \mathbb{R}^{n \times 1}\) is the response vector,
\(\mathbf{X} \in \mathbb{R}^{n \times p}\) is the fixed-effects design matrix,
\(\boldsymbol{\beta} \in \mathbb{R}^{p \times 1}\) is the vector of fixed-effect coefficients,
\(\mathbf{Z} \in \mathbb{R}^{n \times q}\) is the random-effects design matrix,
\(\mathbf{b} \in \mathbb{R}^{q \times 1}\) is the vector of random effects,
\(\boldsymbol{\varepsilon} \in \mathbb{R}^{n \times 1}\) is the vector of residual errors.
We assume
where \(\mathbf{G} \in \mathbb{R}^{q \times q}\) and \(\mathbf{R} \in \mathbb{R}^{n \times n}\) are covariance matrices.
Let \(\boldsymbol{\theta}\) denote the collection of variance components that parameterise \(\mathbf{G}\) and \(\mathbf{R}\), i.e.,
Under these assumptions, the marginal distribution of \(\mathbf{Y}\) is given by
where
BLUE¶
Given \(\boldsymbol{\theta}\), the Best Linear Unbiased Estimator (BLUE) of \(\boldsymbol{\beta}\) is the linear unbiased estimator that minimises the variance among all such estimators. Under the linear mixed model with covariance matrix \(\mathbf{V}(\boldsymbol{\theta})\), the BLUE of \(\boldsymbol{\beta}\) is given by
This estimator is the generalised least squares (GLS) estimator of \(\boldsymbol{\beta}\).
BLUP¶
Given \(\boldsymbol{\theta}\), we often wish to predict the random effects \(\mathbf{b}\). The Best Linear Unbiased Predictor (BLUP) of \(\mathbf{b}\) is given by:
ML Estimation¶
Maximum likelihood (ML) estimation jointly estimates the fixed effects \(\boldsymbol{\beta}\) and variance components \(\boldsymbol{\theta}\) by maximising the marginal likelihood of the observed data. After integrating out the random effects \(\mathbf{b}\) from the joint model, the marginal distribution of \(\mathbf{Y}\) is
where \(\mathbf{V}(\boldsymbol{\theta}) = \mathbf{Z}\mathbf{G}(\boldsymbol{\theta})\mathbf{Z}^\top + \mathbf{R}(\boldsymbol{\theta})\). The corresponding log-likelihood is given by
For fixed \(\boldsymbol{\theta}\), the log-likelihood is quadratic in \(\boldsymbol{\beta}\), and the maximiser can be obtained in closed form by setting \(\partial \ell / \partial \boldsymbol{\beta} = \mathbf{0}\). This yields the generalised least squares (GLS) estimator
Substituting this expression back into the log-likelihood eliminates \(\boldsymbol{\beta}\) and produces the profile log-likelihood
where
is the residual projection matrix. The resulting optimisation problem depends only on \(\boldsymbol{\theta}\).
Since \(\ell_p(\boldsymbol{\theta})\) is nonlinear in \(\boldsymbol{\theta}\) and does not admit a closed-form maximiser, numerical optimisation methods are required. These methods rely on the score vector and a curvature approximation. Differentiating the profile log-likelihood with respect to a variance component \(\theta_k\) gives
where the identity \(\partial \log|\mathbf{V}|/\partial \theta_k = \operatorname{tr}(\mathbf{V}^{-1}\partial \mathbf{V}/\partial \theta_k)\) has been used, together with \(\partial \mathbf{P}/\partial \theta_k = -\mathbf{P}(\partial \mathbf{V}/\partial \theta_k)\mathbf{P}\).
A common curvature approximation is given by the Fisher information matrix, whose \((k,j)\) entry is
Starting from an initial value \(\boldsymbol{\theta}^{(0)}\), the Fisher scoring iteration updates \(\boldsymbol{\theta}\) according to
where \(\mathbf{s}(\boldsymbol{\theta})\) denotes the score vector. In practice, variance components are typically reparameterised (e.g. on a logarithmic scale) to ensure positivity during optimisation. Iterations are continued until convergence, for example when \(\|\boldsymbol{\theta}^{(t+1)} - \boldsymbol{\theta}^{(t)}\| < \varepsilon\) for a prescribed tolerance \(\varepsilon > 0\).
Upon convergence, the ML estimate of \(\boldsymbol{\theta}\) is \(\widehat{\boldsymbol{\theta}}_{ML}\), and the ML estimate of \(\boldsymbol{\beta}_{ML}\) is recovered analytically as
Remark
ML treats \(\boldsymbol{\beta}\) and \(\boldsymbol{\theta}\) symmetrically in the likelihood and is therefore consistent as \(n \to \infty\). However, in finite samples ML underestimates variance components because it does not account for the degrees of freedom consumed by estimating \(\boldsymbol{\beta}\). This bias motivates restricted maximum likelihood (REML), which maximises a likelihood formed from residual contrasts that are orthogonal to the column space of \(\mathbf{X}\), thereby adjusting for the estimation of fixed effects when recovering \(\boldsymbol{\theta}\).
REML Estimation¶
Restricted maximum likelihood (REML) constructs a likelihood based on linear combinations of \(\mathbf{Y}\) that are free of \(\boldsymbol{\beta}\). Let \(\mathbf{A}\) be an \((n-p)\times n\) matrix satisfying \(\mathbf{A}\mathbf{X}=\mathbf{0}\) and \(\operatorname{rank}(\mathbf{A})=n-p\). Then
and the corresponding log-likelihood is
Using standard matrix identities, this expression can be rewritten as
where
and \(C\) is a constant independent of \(\boldsymbol{\theta}\). Since additive constants do not affect maximisation with respect to \(\boldsymbol{\theta}\), the REML log-likelihood is taken to be the above expression up to an additive constant.
Score Function¶
Let \(\theta_k\) denote the \(k\)-th component of \(\boldsymbol{\theta}\), and define
Starting from
we differentiate each term with respect to \(\theta_k\).
Derivative of \(\log|\mathbf{V}|\)
Derivative of \(\log|\mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}|\)
Set \(\mathbf{M} = \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X}\). Then
where
Hence,
Derivative of \(\mathbf{Y}^\top \mathbf{P}\mathbf{Y}\)
Using
we obtain
Compact Form
Combining the three components and noting that \(C\) does not depend on \(\boldsymbol{\theta}\), the score function is
To obtain a more compact form, recall the identity
Hence,
Using the cyclic property of the trace, \(\operatorname{tr}(\mathbf{D}\mathbf{A}\mathbf{B}\mathbf{C}) = \operatorname{tr}(\mathbf{A}\mathbf{B}\mathbf{C}\mathbf{D})\), the second term becomes
Therefore,
Substituting this identity into the previous expression for \(S_k(\boldsymbol{\theta})\) yields
Average Information Matrix¶
The score function for \(\theta_k\) is
where
Observed Information¶
Differentiating \(S_k\) with respect to \(\theta_j\) gives
For the quadratic term, using the product rule and \(\partial \mathbf{P}/\partial \theta_j = -\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\), we obtain
For the trace term,
Putting everything together yields the observed information:
Expected Information¶
We now derive the expected Fisher information matrix by taking expectations of the negative second derivatives. Recall that under the model, \(\mathbb{E}[\mathbf{Y}] = \mathbf{X}\boldsymbol{\beta}\). However, a fundamental property of the projection matrix \(\mathbf{P}\) is that \(\mathbf{P}\mathbf{X} = \mathbf{0}\), which implies \(\mathbf{P}\mathbf{Y} = \mathbf{P}(\mathbf{Y} - \mathbf{X}\boldsymbol{\beta})\). Consequently, when computing expectations of expressions involving \(\mathbf{P}\), we may without loss of generality treat \(\mathbf{Y}\) as centered at zero. We therefore take \(\mathbb{E}[\mathbf{Y}] = \mathbf{0}\) and \(\mathbb{E}[\mathbf{Y}\mathbf{Y}^\top] = \mathbf{V}\) for the remainder of this derivation.
The observed information matrix is given by
We evaluate the expectation of each term in turn.
Expectation of \(T_1\)
The quantity \(T_1\) is a quadratic form in \(\mathbf{Y}\). For a zero-mean random vector with covariance \(\mathbf{V}\), the identity \(\mathbb{E}[\mathbf{Y}^\top \mathbf{A} \mathbf{Y}] = \operatorname{tr}(\mathbf{A} \mathbf{V})\) holds for any conformable matrix \(\mathbf{A}\). Applying this with \(\mathbf{A} = \mathbf{P}\,\dot{\mathbf{V}}_k\,\mathbf{P}\,\dot{\mathbf{V}}_j\,\mathbf{P}\) yields
A key identity satisfied by \(\mathbf{P}\) is \(\mathbf{P}\mathbf{V}\mathbf{P} = \mathbf{P}\), which can be verified by direct substitution of the definition \(\mathbf{P} = \mathbf{V}^{-1} - \mathbf{V}^{-1}\mathbf{X}(\mathbf{X}^\top\mathbf{V}^{-1}\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{V}^{-1}\). Using the cyclic property of the trace,
Thus
Expectation of \(T_2\)
The term \(T_2\) contains no random quantities; it is a deterministic function of the parameters. Hence
Expectation of \(T_3\)
Similarly, \(T_3\) is non-random, giving
Expectation of \(T_4\)
The term \(T_4\) is again a quadratic form. Applying the same identity as for \(T_1\),
Using the cyclic property followed by \(\mathbf{P}\mathbf{V}\mathbf{P} = \mathbf{P}\),
Combining the expectations
Assembling the four components,
The two terms involving the second derivatives \(\ddot{\mathbf{V}}_{kj}\) cancel identically. Moreover, the trace is symmetric in its matrix arguments, so
Substituting this into the expression above,
This final expression defines the \((k,j)\)-th entry of the REML Fisher information matrix,
Definition of the Average Information¶
The expected information (Fisher information) is
The observed information is denoted \(\mathcal{J}_{kj}(\boldsymbol{\theta})\), and the average information matrix is defined as
Substituting the expressions and simplifying, the trace terms cancel, yielding
Simplification for Linear Variance Component Models¶
In the common case where \(\mathbf{V}(\boldsymbol{\theta}) = \sum_{t=1}^p \theta_t \mathbf{M}_t\) with known matrices \(\mathbf{M}_t\), the second derivatives vanish: \(\ddot{\mathbf{V}}_{kj} = \mathbf{0}\) for all \(k,j\). The average information matrix then reduces to
This is the form commonly used in the AI-REML algorithm [Gilmour1995], where the average information matrix is computed directly from the data without requiring second derivatives. The algorithm proceeds by iteratively solving
References
Gilmour, A. R., Thompson, R., & Cullis, B. R. (1995). Average information REML: An efficient algorithm for variance parameter estimation in linear mixed models. Biometrics, 51(4), 1440-1450.