Introduction to REML Estimation

The REML class implements restricted maximum likelihood (REML) estimation for linear mixed models using the average information (AI) algorithm. The AI algorithm combines the stability of the expected information matrix with the curvature information of the observed information matrix, leading to efficient and robust estimation of variance components.

This vignette introduces the REML model formulation, explains how to construct covariance models, demonstrates optimisation workflows, and covers post-estimation utilities such as BLUEs, BLUPs, predictions, residuals, and convergence diagnostics.

The model

The REML class assumes the linear mixed model

\[\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{Z}\mathbf{b} + \boldsymbol{\varepsilon},\]

where

  • \(\mathbf{y}\) is the response vector of length \(n\),

  • \(\mathbf{X}\) is the fixed-effects design matrix,

  • \(\boldsymbol{\beta}\) is the vector of fixed effects,

  • \(\mathbf{Z}\) is the random-effects design matrix,

  • \(\mathbf{b}\) is the vector of random effects,

  • \(\boldsymbol{\varepsilon}\) is the residual error vector.

The random effects and residuals are assumed independent with

\[\mathbf{b} \sim \mathcal{N}(\mathbf{0}, \mathbf{G}), \qquad \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{R}).\]

The marginal covariance of \(\mathbf{y}\) is therefore

\[\mathbf{V}(\boldsymbol{\theta}) = \mathbf{Z}\mathbf{G}(\boldsymbol{\theta})\mathbf{Z}^\top + \mathbf{R}(\boldsymbol{\theta}),\]

where \(\boldsymbol{\theta}\) denotes the variance-component parameters.

The REML estimator maximises the restricted log-likelihood

\[\ell_R(\boldsymbol{\theta}) = -\frac{1}{2} \Bigl( \log |\mathbf{V}| + \log |\mathbf{X}^\top \mathbf{V}^{-1} \mathbf{X}| + \mathbf{y}^\top \mathbf{P} \mathbf{y} \Bigr),\]

where

\[\mathbf{P} = \mathbf{V}^{-1} - \mathbf{V}^{-1}\mathbf{X} \left( \mathbf{X}^\top \mathbf{V}^{-1}\mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}^{-1}\]

is the residual projection matrix.

Important

Internally, REML repeatedly evaluates

  • the covariance matrix \(\mathbf{V}\),

  • its derivatives \(\partial \mathbf{V} / \partial \theta_k\),

  • the score vector,

  • and the average information matrix,

until convergence criteria are satisfied.

Constructing the REML object

The REML constructor supports multiple ways to define the covariance structure. The appropriate approach depends on the complexity of the model and the desired level of control.

The most important constructor arguments are:

map_theta_to_v

Callable mapping \(\boldsymbol{\theta}\) to the marginal covariance matrix \(\mathbf{V}\).

map_theta_to_dv

Callable returning the derivatives \(\partial\mathbf{V}/\partial\theta_k\).

v_builder

A covariance Matrix object representing the covariance structure.

map_theta_to_g

Optional mapping from parameters to the random-effects covariance matrix \(\mathbf{G}\). Required for BLUPs and conditional predictions.

mask_theta_to_g

Boolean mask selecting which parameters are passed to map_theta_to_g.

Exactly one of map_theta_to_v or v_builder must be supplied.

Via map_theta_to_v only

The minimal interface requires only a function mapping parameters to the marginal covariance matrix.

import torch
from torch_openreml import REML

n = 50

def map_theta_to_v(theta):

    sigma2 = theta[0].exp()
    rho = torch.sigmoid(theta[1])

    I = torch.eye(n)
    J = torch.ones(n, n)

    return sigma2 * (I + rho * J)

reml = REML(map_theta_to_v=map_theta_to_v)

In this example:

  • theta[0] controls the variance scale,

  • theta[1] controls a correlation parameter,

  • exponential and sigmoid transforms enforce parameter constraints.

The covariance matrix is

\[\mathbf{V} = \sigma^2 (\mathbf{I} + \rho \mathbf{J}),\]

where \(\mathbf{J}\) is the all-ones matrix.

When only map_theta_to_v is supplied, REML computes derivatives automatically using torch.func.jacrev().

This approach is convenient for prototyping, but automatic differentiation can become expensive for large covariance matrices. At the same time, it offers essentially unrestricted flexibility: the marginal covariance can be constructed through arbitrary differentiable PyTorch operations, ranging from simple parameterizations to highly sophisticated matrix algebra, iterative procedures, decompositions, simulation-based constructions, dynamically assembled covariance components, or even neural networks that directly output or parameterize covariance structure.

Because the covariance construction is defined directly in Python/PyTorch code rather than through a fixed covariance specification, users are free to incorporate conditional branching, stochastic generation, adaptive logic, external modules, or even entirely different covariance structures across optimization iterations (although such behavior is usually not statistically meaningful in practice). In effect, any covariance model that can be expressed as a differentiable computational graph in PyTorch can be used within this framework.

Via map_theta_to_v and map_theta_to_dv

For better performance, analytical derivatives can be supplied explicitly.

import torch
from torch_openreml import REML

n = 50

def map_theta_to_v(theta):

    sigma2 = theta[0].exp()
    rho = torch.sigmoid(theta[1])

    I = torch.eye(n)
    J = torch.ones(n, n)

    return sigma2 * (I + rho * J)

def map_theta_to_dv(theta):

    sigma2 = theta[0].exp()
    rho = torch.sigmoid(theta[1])

    dsigma2 = sigma2
    drho = rho * (1 - rho)

    I = torch.eye(n)
    J = torch.ones(n, n)

    dV_dtheta0 = dsigma2 * (I + rho * J)
    dV_dtheta1 = sigma2 * drho * J

    return torch.stack([dV_dtheta0, dV_dtheta1])

reml = REML(
    map_theta_to_v=map_theta_to_v,
    map_theta_to_dv=map_theta_to_dv,
)

The derivative function must return a tensor of shape (num_free_params, n, n), where slice k corresponds to

\[\frac{\partial \mathbf{V}} {\partial \theta_k}.\]

Providing analytical derivatives is strongly recommended for large models because derivative evaluation is typically one of the most computationally expensive parts of REML optimisation.

Via a covariance Matrix builder

For realistic mixed models, the recommended workflow is to construct the covariance structure using the torch_openreml.covariance matrix system.

The covariance builder API provides reusable covariance components such as:

along with composition operators including:

Example:

import torch

from torch_openreml import REML
from torch_openreml.covariance import (
    DummyMatrix,
    ScalarMatrix,
    CovariancePropagation,
    Sum,
)

n, p = 50, 2

y = torch.randn(n)
X = torch.randn(n, p)

Z = DummyMatrix(["a", "b"] * 25)()

V = Sum(
    CovariancePropagation(
        Z,
        ScalarMatrix(2),
    ),
    ScalarMatrix(n),
)

reml = REML(v_builder=V)

The builder automatically manages:

  • parameter bookkeeping,

  • parameter transforms,

  • covariance assembly,

  • covariance derivatives.

Internally, REML calls

  • map_theta_to_v()

  • map_theta_to_dv()

provided by the builder.

The derivative calculation uses grad(), which attempts a closed-form manual_grad() implementation first and falls back to automatic differentiation when necessary.

Providing map_theta_to_g

Some methods require access to the random-effects covariance matrix \(\mathbf{G}\) separately from the marginal covariance matrix \(\mathbf{V}\).

These include:

The map_theta_to_g argument may be:

  • a callable,

  • a covariance Matrix,

  • or None.

Example:

G = ScalarMatrix(2)

reml = REML(
    v_builder=V,
    map_theta_to_g=G,
    mask_theta_to_g=torch.tensor([True, False]),
)

The mask specifies which elements of \(\boldsymbol{\theta}\) are passed into G.

This is useful when the full parameter vector contains both random-effect and residual variance parameters.

Running the optimiser

Once the REML object is constructed, optimisation is performed using optimize().

import torch

from torch_openreml import REML
from torch_openreml.covariance import (
    DummyMatrix,
    ScalarMatrix,
    CovariancePropagation,
    Sum,
)

n, p = 50, 2

y = torch.randn(n)
X = torch.randn(n, p)

Z = DummyMatrix(["a", "b"] * 25)()

V = Sum(
    CovariancePropagation(
        Z,
        ScalarMatrix(2),
    ),
    ScalarMatrix(n),
)

reml = REML(v_builder=V)

theta_start = torch.zeros(V.num_free_params)

theta_hat, beta_hat, n_iter = reml.optimize(
    y,
    X,
    theta_start,
    verbose=2,
)

∥∇∥:       9.8709, ∥Δ∥: 23.0035, η: 1.00, ∥Δᶜ∥: 23.0035, log 𝓛: -25.2022
∥∇∥:       0.7636, ∥Δ∥: 0.0078, η: 1.00, ∥Δᶜ∥: 0.0078, log 𝓛: -21.8203 (+3.3819)
∥∇∥:       0.0060, ∥Δ∥: 0.0001, η: 1.00, ∥Δᶜ∥: 0.0001, log 𝓛: -21.8173 (+0.0030)
∥∇∥:       0.0000, ∥Δ∥: 0.0000, η: 1.00, ∥Δᶜ∥: 0.0000, log 𝓛: -21.8173 (+0.0000)

[∇: score, Δ: 𝐉⁻¹∇, η: learning rate, Δᶜ: clip(𝛉 + ηΔ, lb, ub) - 𝛉, 𝓛: restricted likelihood]

✓ Converged at iteration 4

Inputs

y

Response vector of shape (n,).

X

Fixed-effects design matrix of shape (n, p).

theta_start

Initial parameter vector.

Choosing sensible starting values can substantially improve convergence, especially for complex covariance structures.

Returned values

theta_hat

Final variance-component estimates.

beta_hat

Estimated fixed effects evaluated at theta_hat.

n_iter

Number of completed optimisation iterations.

Optimisation controls

The optimisation routine supports several keyword arguments.

max_iter (default 200)

Maximum number of iterations.

eta (default 1.0)

Step-size multiplier applied to the AI update.

The parameter update is

\[\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} + \eta \, \mathbf{AI}^{-1}\mathbf{s}.\]

Smaller values of eta can improve stability when the optimisation oscillates or diverges.

lb and ub

Optional lower and upper parameter bounds applied after each update.

verbose (default 0)

Controls optimisation output.

  • 0: silent

  • 1: progress bar

  • 2: detailed iteration diagnostics

Detailed diagnostics include:

  • score norm,

  • update norm,

  • log-likelihood,

  • learning rate,

  • iteration count.

Convergence criteria

At each iteration, convergence is evaluated using three criteria.

Score norm

The score vector must satisfy

\[\|\mathbf{s}\| < \texttt{tol\_score}.\]

This checks whether the gradient is close to zero.

Parameter update norm

The parameter update must satisfy

\[\|\Delta\| < \texttt{tol\_delta}.\]

This checks whether the parameters have stabilised.

Log-likelihood change

The restricted log-likelihood change must satisfy

\[|\ell_R^{(t)} - \ell_R^{(t-1)}| < \texttt{tol\_loglik}.\]

Default tolerances

The default tolerances are:

  • tol_score = 1e-4

  • tol_delta = 1e-4

  • tol_loglik = 1e-4

All enabled criteria must be satisfied simultaneously.

Criteria may be individually disabled using:

  • check_score=False

  • check_delta=False

  • check_loglik=False

At least two iterations are required before convergence can be declared.

Post-estimation methods

After optimisation, the fitted REML object provides several utilities for extracting estimates and predictions.

Retrieving parameter estimates

theta_last = reml.get_theta(select="last")
theta_best = reml.get_theta(select="best")

beta_last = reml.get_beta(select="last")
beta_best = reml.get_beta(select="best")
select="last"

Returns the final optimisation iterate.

select="best"

Returns the iterate with the highest restricted log-likelihood.

The "best" option is useful if optimisation temporarily overshoots or oscillates near convergence.

BLUE — fixed effects

The best linear unbiased estimator (BLUE) of the fixed effects is

\[\widehat{\boldsymbol{\beta}} = \left( \mathbf{X}^\top \mathbf{V}^{-1} \mathbf{X} \right)^{-1} \mathbf{X}^\top \mathbf{V}^{-1} \mathbf{y}.\]

Compute it using:

beta_hat = reml.blue(y, X, theta_hat)

BLUP — random effects

The best linear unbiased predictor (BLUP) of the random effects is

\[\widehat{\mathbf{b}} = \mathbf{G} \mathbf{Z}^\top \mathbf{V}^{-1} \left( \mathbf{y} - \mathbf{X}\widehat{\boldsymbol{\beta}} \right).\]

Compute it using:

b_hat = reml.blup(
    y,
    X,
    Z,
    theta_hat,
    map_theta_to_g=ScalarMatrix(2),
    mask_theta_to_g=torch.tensor([True, False]),
)

This method requires map_theta_to_g.

Predictions

Marginal predictions use only fixed effects:

\[\widehat{\mathbf{y}}_{\mathrm{marginal}} = \mathbf{X}\widehat{\boldsymbol{\beta}}.\]

Conditional predictions additionally include random effects:

\[\widehat{\mathbf{y}}_{\mathrm{conditional}} = \mathbf{X}\widehat{\boldsymbol{\beta}} + \mathbf{Z}\widehat{\mathbf{b}}.\]

Example:

y_hat_marginal = reml.marginal_predict(
    y,
    X,
    theta_hat,
)

y_hat_conditional = reml.predict(
    y,
    X,
    Z,
    theta_hat,
    map_theta_to_g=ScalarMatrix(2),
    mask_theta_to_g=torch.tensor([True, False]),
)

Residuals

Marginal residuals:

\[\mathbf{e}_{\mathrm{marginal}} = \mathbf{y} - \widehat{\mathbf{y}}_{\mathrm{marginal}}.\]

Conditional residuals:

\[\mathbf{e}_{\mathrm{conditional}} = \mathbf{y} - \widehat{\mathbf{y}}_{\mathrm{conditional}}.\]

Example:

e_marginal = reml.marginal_residual(
    y,
    X,
    theta_hat,
)

e_conditional = reml.residual(
    y,
    X,
    Z,
    theta_hat,
    map_theta_to_g=ScalarMatrix(2),
    mask_theta_to_g=torch.tensor([True, False]),
)

Evaluating the log-likelihood

The REML log-likelihood can be evaluated directly without optimisation.

loglik = reml.loglik(y, X, theta_hat)

This is useful for:

  • debugging,

  • likelihood profiling,

  • model comparison,

  • monitoring optimisation trajectories.

Complete example

The following example demonstrates a mixed model with:

  • genotype random effects,

  • replicate-block random effects,

  • residual variance.

import torch

from torch_openreml import REML
from torch_openreml.utils import augment, n_distinct

from torch_openreml.covariance import (
    DummyMatrix,
    IdentityMatrix,
    ScalarMatrix,
    Sum,
    CovariancePropagation,
    KroneckerProduct,
)

from torch_openreml.example_data import john_alpha

# --- response ---
y = torch.tensor(john_alpha["yield"].values)

# --- fixed effects ---
X = augment(
    torch.ones(len(john_alpha), 1),
    DummyMatrix(john_alpha["rep"], drop_first=True)()
)

# --- random effect design matrices ---
Z_gen = DummyMatrix(john_alpha["gen"])
Z_rep_block = DummyMatrix(john_alpha["rep"], john_alpha["block"])

# --- covariance components ---
G_gen = ScalarMatrix(n_distinct(john_alpha["gen"]))
G_rep = IdentityMatrix(n_distinct(john_alpha["rep"]))
G_block = ScalarMatrix(n_distinct(john_alpha["block"]))

R = ScalarMatrix(len(john_alpha))

# --- marginal covariance ---
V = Sum(
    CovariancePropagation(Z_gen, G_gen),
    CovariancePropagation(
        Z_rep_block,
        KroneckerProduct(G_rep, G_block)
    ),
    R
)

# --- REML fit ---
reml = REML(v_builder=V)

theta_start = torch.zeros(V.num_free_params)

theta_hat, beta_hat, n_iter = reml.optimize(
    y,
    X,
    theta_start,
    verbose=2,
)

# --- results ---
print("theta:", theta_hat)

print("variance components:", V.build_params(theta_hat))

print("fixed effects:", beta_hat)

print("loglik:", reml.loglik(y, X, theta_hat))

∥∇∥:      41.8197, ∥Δ∥: 8.9438, η: 1.00, ∥Δᶜ∥: 8.9438, log 𝓛: -34.3129
∥∇∥:  315227.4375, ∥Δ∥: 0.8838, η: 1.00, ∥Δᶜ∥: 0.8838, log 𝓛: -201185.3906 (-201151.0777)
∥∇∥:  119229.9453, ∥Δ∥: 0.8520, η: 1.00, ∥Δᶜ∥: 0.8520, log 𝓛: -71844.4531 (+129340.9375)
∥∇∥:   44273.4570, ∥Δ∥: 0.8143, η: 1.00, ∥Δᶜ∥: 0.8143, log 𝓛: -26065.4551 (+45778.9980)
∥∇∥:   16330.5117, ∥Δ∥: 0.7587, η: 1.00, ∥Δᶜ∥: 0.7587, log 𝓛: -9457.1396 (+16608.3154)
∥∇∥:    6001.4102, ∥Δ∥: 0.7142, η: 1.00, ∥Δᶜ∥: 0.7142, log 𝓛: -3392.8696 (+6064.2700)
∥∇∥:    2194.8989, ∥Δ∥: 0.7031, η: 1.00, ∥Δᶜ∥: 0.7031, log 𝓛: -1181.2805 (+2211.5891)
∥∇∥:     793.7851, ∥Δ∥: 0.6952, η: 1.00, ∥Δᶜ∥: 0.6952, log 𝓛: -382.4908 (+798.7897)
∥∇∥:     278.6875, ∥Δ∥: 0.6715, η: 1.00, ∥Δᶜ∥: 0.6715, log 𝓛: -102.2992 (+280.1916)
∥∇∥:      90.4116, ∥Δ∥: 0.5957, η: 1.00, ∥Δᶜ∥: 0.5957, log 𝓛: -11.4274 (+90.8718)
∥∇∥:      23.9216, ∥Δ∥: 0.4006, η: 1.00, ∥Δᶜ∥: 0.4006, log 𝓛:  12.6792 (+24.1066)
∥∇∥:       3.8674, ∥Δ∥: 0.1392, η: 1.00, ∥Δᶜ∥: 0.1392, log 𝓛:  16.5908 (+3.9116)
∥∇∥:       0.2448, ∥Δ∥: 0.0157, η: 1.00, ∥Δᶜ∥: 0.0157, log 𝓛:  16.8078 (+0.2170)
∥∇∥:       0.0104, ∥Δ∥: 0.0007, η: 1.00, ∥Δᶜ∥: 0.0007, log 𝓛:  16.8099 (+0.0021)
∥∇∥:       0.0004, ∥Δ∥: 0.0000, η: 1.00, ∥Δᶜ∥: 0.0000, log 𝓛:  16.8100 (+0.0001)
∥∇∥:       0.0000, ∥Δ∥: 0.0000, η: 1.00, ∥Δᶜ∥: 0.0000, log 𝓛:  16.8099 (-0.0001)

[∇: score, Δ: 𝐉⁻¹∇, η: learning rate, Δᶜ: clip(𝛉 + ηΔ, lb, ub) - 𝛉, 𝓛: restricted likelihood]

✓ Converged at iteration 16
theta: tensor([-0.9728, -1.3281, -1.2529])
variance components: tensor([0.1429, 0.0702, 0.0816])
fixed effects: tensor([ 4.5183,  0.2978, -0.4140])
loglik: tensor(16.8098)

Optimisation history

After optimisation, the full iteration history is stored in reml.history.

This dictionary contains per-iteration records including:

  • theta

  • beta

  • loglik

  • score

  • ai

  • delta

  • update

This history is useful for:

  • diagnosing convergence problems,

  • inspecting optimisation trajectories,

  • plotting likelihood curves,

  • monitoring parameter stability.

Example:

scores = [
    torch.norm(s).item()
    for s in reml.history["score"]
]

logliks = [
    ll.item()
    for ll in reml.history["loglik"]
]

print(
    "Score norms:",
    [f"{s:.6f}" for s in scores],
)

print(
    "Log-likelihoods:",
    [f"{ll:.4f}" for ll in logliks],
)
Score norms: ['41.819736', '315227.437500', '119229.945312', '44273.457031', '16330.511719', '6001.410156', '2194.898926', '793.785095', '278.687469', '90.411598', '23.921585', '3.867368', '0.244757', '0.010371', '0.000418', '0.000021']
Log-likelihoods: ['-34.3129', '-201185.3906', '-71844.4531', '-26065.4551', '-9457.1396', '-3392.8696', '-1181.2805', '-382.4908', '-102.2992', '-11.4274', '12.6792', '16.5908', '16.8078', '16.8099', '16.8100', '16.8099']