Chapter 13: Penalised and Bayesian Regression for Small Samples

Learning Objectives

By the end of this chapter, you will be able to explain why ordinary maximum likelihood can fail with sparse regression data, recognise separation in logistic regression, use penalised estimates to stabilise small-sample models, describe how weakly informative Bayesian priors regularise estimates, and report sensitivity checks without presenting regularisation as a substitute for information.

The Problem of Sparse Data in Regression

Classical maximum likelihood estimation can become unstable when sample sizes are small, events are rare, or predictors nearly separate outcome groups. In logistic regression, separation occurs when a predictor or predictor combination nearly or perfectly predicts the binary outcome. The fitted probabilities then approach 0 or 1, coefficient estimates become very large, and Wald standard errors stop being useful. A practical diagnostic is to inspect simple predictor-by-outcome tables for zero cells and to check whether the logistic model warns that fitted probabilities are numerically 0 or 1.

Penalised regression and Bayesian regression respond to the same problem in different language. Penalised regression adds a constraint to the likelihood so that extreme coefficients are pulled back toward more stable values. Bayesian regression combines the likelihood with prior distributions. Weakly informative priors serve as regularisation when the data alone are too thin to support precise estimates. Neither approach creates information that is not in the data. Both make the modelling assumptions more explicit.

Choosing Among Regularisation Strategies

Regularisation covers a family of strategies, and the appropriate choice depends on the outcome, the modelling goal, and the specific source of instability.

Problem in the small dataset	Better starting point	Why
Sparse binary outcome or separation warnings in logistic regression	Firth logistic regression	Produces finite bias-reduced estimates when ordinary maximum likelihood breaks down
Continuous outcome with correlated predictors	Ridge regression	Shrinks all slopes and reduces instability from collinearity without selecting a single “winner”
Continuous outcome with many candidate predictors and a screening goal	LASSO	Can set weak or unstable coefficients to zero, but selection should be treated as exploratory
Strong prior knowledge or a need to express assumptions directly	Bayesian regression with weakly informative priors	Regularises estimates through explicit prior distributions and requires convergence checks
Main goal is explanation of one pre-specified effect	Simpler pre-specified model plus sensitivity analysis	Penalised selection can obscure the estimand if the target effect was already known
Main goal is prediction	Penalised model with transparent tuning and validation	Prediction requires checking out-of-sample behaviour, not only coefficient significance

For small samples, the safest workflow is to fit the simplest scientifically meaningful model first, then use regularised estimates as sensitivity checks or as explicitly labelled prediction tools. If regularisation changes the substantive conclusion, report that instability rather than hiding it behind a single preferred model.

Firth-Penalised Logistic Regression

Firth’s method reduces small-sample bias in likelihood estimation and is especially useful for sparse binary outcomes or separation (Firth 1993; Heinze and Schemper 2002). Table 13.1 shows the structure of a small project-success example. Most low-planning projects fail and most high-planning projects succeed, which creates a separation risk.

Table 13.1

Sparse project-success data used for the logistic-regression example

Planning band	No success	Success	Success rate
Planning score 1-5	11	1	8%
Planning score 6-9	0	8	100%

Note. The outcome is nearly separated by planning score. This is the setting where ordinary logistic regression can produce unstable estimates.

The ordinary logistic model fits in R, but the fitted probabilities are close to 0 or 1. That is a warning sign even if the software returns coefficients. Table 13.2 compares ordinary maximum-likelihood estimates with Firth-penalised estimates when the logistf package is available.

Table 13.2

Standard and Firth-penalised logistic-regression estimates

Method	Term	Estimate	Std. error	p-value
Standard ML	Planning score	43.85	42168.42	0.999
Standard ML	Prior experience	44.72	57464.52	0.999
Firth	Planning score	1.02	0.80	0.218
Firth	Prior experience	1.66	1.54	0.328

Note. The standard-model standard errors are Wald estimates from glm(); under separation they can become extremely large or effectively unbounded, especially when R warns that fitted probabilities are numerically 0 or 1. In this example, fitted probabilities close to 0 or 1 occurred: yes. Use logistf or another penalised method immediately when that warning appears. Firth's method adds a penalty proportional to the log determinant of the Fisher information matrix, reducing small-sample bias and preventing infinite estimates under separation. Coefficients remain on the log-odds scale and can be exponentiated for odds-ratio interpretation.

Interpretation should focus on direction, uncertainty, and model fragility. A finite Firth coefficient is a more stable estimate under a penalised likelihood, not confirmation that the effect size is precisely known. In small samples, report the event counts, variables in the model, penalisation method, confidence intervals, and whether ordinary logistic regression showed separation warnings.

Ridge Regression as Shrinkage

Ridge regression shrinks regression coefficients toward zero by adding a penalty proportional to the squared coefficient size. It is useful when predictors are correlated, sample size is modest, or the goal is prediction rather than unpenalised coefficient interpretation (Harrell 2015). Table 13.3 and Figure 13.1 show the same small customer-satisfaction regression under increasing ridge penalties.

Table 13.3

Ridge coefficient estimates under increasing penalty strength

Lambda	Intercept	Wait time	Friendliness
0	5.94	-0.24	1.14
1	5.94	-0.42	0.91
5	5.94	-0.51	0.68
20	5.94	-0.40	0.45

Note. Predictors were standardised to mean = 0 and SD = 1 before fitting because the L2 penalty is scale-dependent. Lambda = 0 is the ordinary least-squares solution. Coefficients apply to standardised predictors unless back-transformed.

Figure 13.1: Ridge coefficient paths for the customer-satisfaction example.

As lambda increases, the slope estimates move toward zero. That shrinkage can reduce overfitting and improve prediction, but it also changes the estimand: the coefficients are penalised estimates, not ordinary least-squares coefficients. Report the penalty-selection method, whether predictors were standardised, and whether the model was used for prediction or interpretation.

Choosing the Ridge Penalty with `glmnet`

The code below shows the same ridge idea using glmnet, which is the package readers are most likely to use in practice. The predictors are standardised internally, alpha = 0 requests ridge rather than lasso, and cross-validation selects a penalty. The sample is deliberately small, so the cross-validation curve should be read as a tuning aid rather than as a precise estimate of out-of-sample performance.

Table 13.4

Ridge estimates from glmnet at the selected penalty

Term	Estimate
(Intercept)	2.183
wait_time	-0.180
friendliness	0.740

Note. The selected lambda is the value minimising cross-validated prediction error. In small samples, repeat the analysis under plausible modelling choices rather than treating one cross-validation split as definitive.

Figure 13.2: Cross-validation curve for ridge penalty selection with glmnet.

LASSO for Predictor Screening

The LASSO uses an L1 penalty rather than the squared L2 penalty used by ridge regression. This means that, as the penalty increases, some coefficients can be shrunk exactly to zero. That property makes LASSO useful for cautious predictor screening when a small dataset contains more candidate predictors than the sample can estimate reliably. It should not be treated as proof that excluded variables are irrelevant. With small samples, selected predictors can change under modest resampling or under a different set of candidate variables.

The example below uses 30 observations and five candidate predictors. Cross-validation chooses two common penalty values: lambda.min, which minimises the cross-validated error, and lambda.1se, which chooses a more parsimonious model within one standard error of that minimum.

Table 13.5

LASSO coefficients under two cross-validated penalty choices

Term	lambda.min	lambda.1se
Intercept	60.21	60.30
Service quality	3.10	2.27
Response speed	-1.61	-1.08
Price fairness	-0.81	0.00
Staff training	1.14	0.06
Waiting-room comfort	-0.19	0.00

Note. Coefficients were estimated with standardised predictors. Values equal to 0 indicate variables removed by the LASSO penalty at that tuning value.

Figure 13.3: LASSO coefficient paths for five candidate predictors.

The main reporting point is the penalty rule, not just the final coefficients. If lambda.1se removes a predictor that lambda.min retains, describe that predictor as unstable rather than definitively absent. For explanatory work, LASSO is best used as a sensitivity analysis or screening tool before a simpler, pre-specified model is reported.

Bayesian Priors as Regularisation

Weakly informative priors are regularisation tools, not a way to force a preferred conclusion. A prior such as Normal(0, 2.5) on a logistic-regression coefficient implies that odds ratios between exp(-5) = 0.007 and exp(5) = 148 are plausible a priori while still regularising extreme estimates. Stronger priors such as Normal(0, 0.5) exert more shrinkage and should be justified by substantive knowledge or prior evidence (Gelman, Simpson, and Betancourt 2017).

Table 13.6 illustrates prior sensitivity for the wait-time slope in the customer-satisfaction example using a normal approximation to the likelihood. The table is not a replacement for full Bayesian computation, but it shows the main principle: when the prior is tight, the posterior moves toward zero. When the prior is weak, the posterior resembles the data-driven estimate.

Table 13.6

Prior-sensitivity illustration for a Bayesian regression slope

Prior on wait-time slope	Posterior mean	Posterior SD	95% credible interval
Normal(0, 0.5)	-0.17	0.27	-0.69 to 0.36
Normal(0, 1)	-0.21	0.30	-0.81 to 0.38
Normal(0, 2.5)	-0.23	0.32	-0.85 to 0.39
Normal(0, 10)	-0.24	0.32	-0.86 to 0.39

Note. The approximation uses the ordinary least-squares wait-time slope and standard error as a normal likelihood. Full Bayesian analyses should still check convergence diagnostics, such as R-hat < 1.01 and effective sample size > 400, and should inspect posterior predictive fit.

For a full Bayesian fit, use an MCMC package and report diagnostics. The following code is intentionally not evaluated in the book render because brms requires a working Stan toolchain, but it is the minimal workflow expected in a manuscript: specify priors, sample, check R-hat and effective sample size, and inspect posterior predictive fit. Before running it locally, verify the Stan toolchain with cmdstanr::check_cmdstan_toolchain() if using CmdStan, or confirm the installed rstan version with rstan::stan_version() before fitting a small test model.

Bayesian regression reports posterior intervals rather than frequentist confidence intervals. A 95% credible interval describes the range containing 95% of posterior probability given the model, data, and priors. That statement is conditional on the prior choice, so small-sample Bayesian reports should include the priors, convergence diagnostics, posterior predictive checks, and at least one plausible prior-sensitivity analysis. For leave-one-out cross-validation or WAIC in Bayesian models, use them as approximate predictive checks, not as automatic proof that one small-sample model is correct (Vehtari, Gelman, and Gabry 2017).

Reporting Regularised Models

A regularised model report should make the stabilising assumption visible. State the outcome, sample size, event count where relevant, candidate predictors, standardisation, penalty or prior, tuning method and sensitivity checks. Do not report a penalised coefficient as if it were an ordinary unpenalised estimate.

For Firth logistic regression, report the sparse event table, the separation warning or diagnostic that motivated the method, coefficient scale, odds ratios if used, confidence intervals and software. For ridge and LASSO, report whether predictors were standardised, the value of lambda, how lambda was chosen, and whether conclusions change under lambda.min versus lambda.1se or under a simpler unpenalised model. For Bayesian models, report priors, seed, chains, iterations, R-hat, effective sample size, posterior intervals and posterior predictive checks.

A concise reporting sentence could read: “Because the ordinary logistic model produced fitted probabilities close to 0 and 1, we estimated a Firth-penalised logistic regression. The model included planning score and prior experience, reported coefficients on the log-odds scale with confidence intervals, and was interpreted as a stabilised sensitivity analysis rather than as precise evidence from a large sample.” That level of detail is enough for readers to see both the method and the limitation.

Key Takeaways

Penalised and Bayesian regression methods are valuable in small samples because they make unstable estimates finite and reduce overfitting. Firth logistic regression is especially useful for sparse binary outcomes and separation. Ridge regression shrinks correlated or noisy linear-model coefficients. LASSO can screen candidate predictors by setting unstable coefficients to zero. Bayesian priors regularise estimates and express assumptions directly. The reporting obligation is the same in all cases: show the data structure, state the penalty or prior, check sensitivity, and avoid treating regularised estimates as more precise than the sample supports.

```{r} #| include: false suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(knitr)) source(normalizePath(file.path(dirname(knitr::current_input(dir = TRUE)), "..", "R", "chapter_table_helpers.R"), mustWork = TRUE)) format_p <- function(p, digits = 3) { threshold <- 10^(-digits) ifelse( is.na(p), "", ifelse( p < threshold, paste0("< ", formatC(threshold, format = "f", digits = digits)), formatC(p, format = "f", digits = digits) ) ) } set.seed(2025) # Sparse binary example with quasi-complete separation. project_data <- tibble( project_id = 1:20, planning_score = c(1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9), prior_experience = c(0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 2, 1, 2, 1, 2, 3, 2, 3, 3, 4), success = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1) ) event_table <- project_data %>% mutate(planning_band = if_else(planning_score <= 5, "Planning score 1-5", "Planning score 6-9")) %>% count(planning_band, success, name = "Projects") %>% mutate(success = if_else(success == 1, "Success", "No success")) %>% pivot_wider(names_from = success, values_from = Projects, values_fill = 0) %>% transmute( `Planning band` = planning_band, `No success`, Success, `Success rate` = sprintf("%.0f%%", 100 * Success / (`No success` + Success)) ) glm_sparse <- suppressWarnings(glm(success ~ planning_score + prior_experience, data = project_data, family = binomial())) glm_coef <- summary(glm_sparse)$coefficients glm_fit_warning <- any(fitted(glm_sparse) < 0.01 | fitted(glm_sparse) > 0.99) if (requireNamespace("logistf", quietly = TRUE)) { firth_sparse <- logistf::logistf(success ~ planning_score + prior_experience, data = project_data) firth_terms <- names(coef(firth_sparse)) firth_coef <- tibble( Term = firth_terms, Estimate = unname(coef(firth_sparse)), `Std. error` = sqrt(diag(firth_sparse$var)), `p-value` = unname(firth_sparse$prob), Method = "Firth" ) } else { firth_coef <- tibble( Term = rownames(glm_coef), Estimate = NA_real_, `Std. error` = NA_real_, `p-value` = NA_real_, Method = "Firth not available" ) } glm_coef_table <- tibble( Term = rownames(glm_coef), Estimate = glm_coef[, "Estimate"], `Std. error` = glm_coef[, "Std. Error"], `p-value` = glm_coef[, "Pr(>|z|)"], Method = "Standard ML" ) logistic_comparison <- bind_rows(glm_coef_table, firth_coef) %>% dplyr::filter(Term != "(Intercept)") %>% dplyr::mutate( Term = recode(Term, planning_score = "Planning score", prior_experience = "Prior experience"), Estimate = if_else(is.na(Estimate), "Not computed", formatC(Estimate, format = "f", digits = 2)), `Std. error` = if_else(is.na(`Std. error`), "Not computed", formatC(`Std. error`, format = "f", digits = 2)), `p-value` = format_p(`p-value`) ) %>% dplyr::select(Method, Term, Estimate, `Std. error`, `p-value`) # Ridge linear regression example. customer_data <- tibble( satisfaction = c(7, 5, 6, 4, 8, 3, 8, 6, 5, 6, 4, 8, 6, 7, 5, 7, 6, 6), wait_time = c(5, 10, 8, 12, 7, 15, 6, 9, 11, 8, 13, 7, 10, 9, 12, 8, 11, 10), friendliness = c(8, 7, 8, 6, 9, 5, 9, 8, 7, 8, 6, 9, 7, 8, 6, 8, 7, 7) ) x_mat <- model.matrix(~ scale(wait_time) + scale(friendliness), data = customer_data) y_vec <- customer_data$satisfaction ridge_fit <- function(lambda) { penalty <- diag(c(0, rep(lambda, ncol(x_mat) - 1))) solve(t(x_mat) %*% x_mat + penalty, t(x_mat) %*% y_vec) } ridge_lambdas <- c(0, 1, 5, 20) ridge_coefficients <- map_dfr(ridge_lambdas, function(lambda) { coefs <- ridge_fit(lambda) tibble( Lambda = lambda, Intercept = coefs[1], `Wait time` = coefs[2], Friendliness = coefs[3] ) }) ridge_display <- ridge_coefficients %>% mutate(across(c(Intercept, `Wait time`, Friendliness), ~ formatC(.x, format = "f", digits = 2))) ridge_plot_data <- ridge_coefficients %>% pivot_longer(c(`Wait time`, Friendliness), names_to = "Coefficient", values_to = "Estimate") ridge_plot <- ggplot(ridge_plot_data, aes(x = Lambda, y = Estimate, colour = Coefficient)) + geom_hline(yintercept = 0, colour = "grey70", linewidth = 0.4) + geom_line(linewidth = 0.9) + geom_point(size = 2.2) + scale_x_continuous(breaks = ridge_lambdas) + labs( x = "Ridge penalty lambda", y = "Coefficient estimate", title = "Ridge penalties shrink unstable coefficients toward zero" ) + theme_classic(base_size = 12) + theme(legend.position = "top", plot.title = element_text(size = 13)) if (requireNamespace("glmnet", quietly = TRUE)) { set.seed(2025) x_glmnet <- model.matrix(satisfaction ~ wait_time + friendliness, data = customer_data)[, -1] cv_ridge <- glmnet::cv.glmnet( x = x_glmnet, y = y_vec, alpha = 0, nfolds = 6, standardize = TRUE ) glmnet_ridge <- glmnet::glmnet( x = x_glmnet, y = y_vec, alpha = 0, lambda = cv_ridge$lambda.min, standardize = TRUE ) glmnet_coef <- as.matrix(stats::coef(glmnet_ridge)) glmnet_display <- tibble( Term = rownames(glmnet_coef), Estimate = formatC(as.numeric(glmnet_coef[, 1]), format = "f", digits = 3) ) cv_plot_data <- tibble( Lambda = cv_ridge$lambda, `Cross-validated error` = cv_ridge$cvm, `Lower bound` = cv_ridge$cvlo, `Upper bound` = cv_ridge$cvup ) cv_ridge_plot <- ggplot(cv_plot_data, aes(x = Lambda, y = `Cross-validated error`)) + geom_ribbon(aes(ymin = `Lower bound`, ymax = `Upper bound`), alpha = 0.18, fill = "#2E7D6E") + geom_line(colour = "#2E7D6E", linewidth = 0.8) + geom_vline(xintercept = cv_ridge$lambda.min, linetype = "dashed", colour = "#A65E2E") + scale_x_log10() + labs( x = "Penalty lambda (log scale)", y = "Cross-validated mean squared error", title = "Ridge penalty selection with glmnet" ) + theme_classic(base_size = 12) + theme(plot.title = element_text(size = 13)) } else { glmnet_display <- tibble( Term = "glmnet not installed", Estimate = "Run install.packages('glmnet')" ) cv_ridge_plot <- ggplot() + annotate("text", x = 0, y = 0, label = "glmnet is not installed") + theme_void() } # LASSO example with more candidate predictors than the small sample can support comfortably. set.seed(2025) lasso_data <- tibble( service_quality = rnorm(30), response_speed = rnorm(30), price_fairness = rnorm(30), staff_training = rnorm(30), waiting_room = rnorm(30) ) %>% mutate( satisfaction = 60 + 3.0 * service_quality - 2.2 * response_speed + 1.4 * staff_training + rnorm(30, 0, 3.0) ) if (requireNamespace("glmnet", quietly = TRUE)) { set.seed(2025) x_lasso <- model.matrix( satisfaction ~ service_quality + response_speed + price_fairness + staff_training + waiting_room, data = lasso_data )[, -1] y_lasso <- lasso_data$satisfaction lasso_path_fit <- glmnet::glmnet( x = x_lasso, y = y_lasso, alpha = 1, standardize = TRUE ) cv_lasso <- glmnet::cv.glmnet( x = x_lasso, y = y_lasso, alpha = 1, nfolds = 5, standardize = TRUE ) lasso_min <- as.matrix(stats::coef(cv_lasso, s = "lambda.min")) lasso_1se <- as.matrix(stats::coef(cv_lasso, s = "lambda.1se")) lasso_display <- tibble( Term = rownames(lasso_min), `lambda.min` = as.numeric(lasso_min[, 1]), `lambda.1se` = as.numeric(lasso_1se[, 1]) ) %>% mutate( Term = recode( Term, `(Intercept)` = "Intercept", service_quality = "Service quality", response_speed = "Response speed", price_fairness = "Price fairness", staff_training = "Staff training", waiting_room = "Waiting-room comfort" ), across(c(`lambda.min`, `lambda.1se`), ~ formatC(.x, format = "f", digits = 2)) ) lasso_path_data <- as_tibble(t(as.matrix(lasso_path_fit$beta)), .name_repair = "minimal") %>% mutate(`log(lambda)` = log(lasso_path_fit$lambda)) %>% pivot_longer(-`log(lambda)`, names_to = "Predictor", values_to = "Coefficient") %>% mutate( Predictor = recode( Predictor, service_quality = "Service quality", response_speed = "Response speed", price_fairness = "Price fairness", staff_training = "Staff training", waiting_room = "Waiting-room comfort" ) ) lasso_path_plot <- ggplot(lasso_path_data, aes(x = `log(lambda)`, y = Coefficient, colour = Predictor)) + geom_hline(yintercept = 0, colour = "grey75", linewidth = 0.4) + geom_line(linewidth = 0.8) + geom_vline(xintercept = log(cv_lasso$lambda.min), linetype = "dashed", colour = "#A65E2E") + labs( x = "log(lambda)", y = "Coefficient estimate", title = "LASSO coefficient paths" ) + theme_classic(base_size = 12) + theme(legend.position = "bottom", plot.title = element_text(size = 13)) } else { lasso_display <- tibble( Term = "glmnet not installed", `lambda.min` = "Run install.packages('glmnet')", `lambda.1se` = "Run install.packages('glmnet')" ) lasso_path_plot <- ggplot() + annotate("text", x = 0, y = 0, label = "glmnet is not installed") + theme_void() } # Approximate Bayesian prior sensitivity for one slope using the OLS normal approximation. ols_fit <- lm(satisfaction ~ scale(wait_time) + scale(friendliness), data = customer_data) ols_summary <- summary(ols_fit)$coefficients ols_wait <- ols_summary["scale(wait_time)", "Estimate"] ols_wait_se <- ols_summary["scale(wait_time)", "Std. Error"] prior_sds <- c(0.5, 1, 2.5, 10) prior_table <- tibble( `Prior on wait-time slope` = paste0("Normal(0, ", prior_sds, ")"), `Prior SD` = prior_sds ) %>% mutate( `Posterior mean` = (ols_wait / ols_wait_se^2) / (1 / ols_wait_se^2 + 1 / `Prior SD`^2), `Posterior SD` = sqrt(1 / (1 / ols_wait_se^2 + 1 / `Prior SD`^2)), `95% credible interval` = sprintf( "%.2f to %.2f", `Posterior mean` - 1.96 * `Posterior SD`, `Posterior mean` + 1.96 * `Posterior SD` ) ) %>% transmute( `Prior on wait-time slope`, `Posterior mean` = formatC(`Posterior mean`, format = "f", digits = 2), `Posterior SD` = formatC(`Posterior SD`, format = "f", digits = 2), `95% credible interval` ) ``` # Chapter 13: Penalised and Bayesian Regression for Small Samples ### Learning Objectives By the end of this chapter, you will be able to explain why ordinary maximum likelihood can fail with sparse regression data, recognise separation in logistic regression, use penalised estimates to stabilise small-sample models, describe how weakly informative Bayesian priors regularise estimates, and report sensitivity checks without presenting regularisation as a substitute for information. ### The Problem of Sparse Data in Regression Classical maximum likelihood estimation can become unstable when sample sizes are small, events are rare, or predictors nearly separate outcome groups. In logistic regression, separation occurs when a predictor or predictor combination nearly or perfectly predicts the binary outcome. The fitted probabilities then approach 0 or 1, coefficient estimates become very large, and Wald standard errors stop being useful. A practical diagnostic is to inspect simple predictor-by-outcome tables for zero cells and to check whether the logistic model warns that fitted probabilities are numerically 0 or 1. Penalised regression and Bayesian regression respond to the same problem in different language. Penalised regression adds a constraint to the likelihood so that extreme coefficients are pulled back toward more stable values. Bayesian regression combines the likelihood with prior distributions. Weakly informative priors serve as regularisation when the data alone are too thin to support precise estimates. Neither approach creates information that is not in the data. Both make the modelling assumptions more explicit. ### Choosing Among Regularisation Strategies Regularisation covers a family of strategies, and the appropriate choice depends on the outcome, the modelling goal, and the specific source of instability. | Problem in the small dataset | Better starting point | Why | |---|---|---| | Sparse binary outcome or separation warnings in logistic regression | Firth logistic regression | Produces finite bias-reduced estimates when ordinary maximum likelihood breaks down | | Continuous outcome with correlated predictors | Ridge regression | Shrinks all slopes and reduces instability from collinearity without selecting a single "winner" | | Continuous outcome with many candidate predictors and a screening goal | LASSO | Can set weak or unstable coefficients to zero, but selection should be treated as exploratory | | Strong prior knowledge or a need to express assumptions directly | Bayesian regression with weakly informative priors | Regularises estimates through explicit prior distributions and requires convergence checks | | Main goal is explanation of one pre-specified effect | Simpler pre-specified model plus sensitivity analysis | Penalised selection can obscure the estimand if the target effect was already known | | Main goal is prediction | Penalised model with transparent tuning and validation | Prediction requires checking out-of-sample behaviour, not only coefficient significance | For small samples, the safest workflow is to fit the simplest scientifically meaningful model first, then use regularised estimates as sensitivity checks or as explicitly labelled prediction tools. If regularisation changes the substantive conclusion, report that instability rather than hiding it behind a single preferred model. ### Firth-Penalised Logistic Regression Firth's method reduces small-sample bias in likelihood estimation and is especially useful for sparse binary outcomes or separation [@firth1993; @heinze2002]. Table 13.1 shows the structure of a small project-success example. Most low-planning projects fail and most high-planning projects succeed, which creates a separation risk. ```{r} #| label: ch13-event-table #| echo: false #| results: asis smallsamplelab_apa_table( "13.1", "Sparse project-success data used for the logistic-regression example", event_table, note = "The outcome is nearly separated by planning score. This is the setting where ordinary logistic regression can produce unstable estimates.", align = c("l", "r", "r", "r") ) ``` The ordinary logistic model fits in R, but the fitted probabilities are close to 0 or 1. That is a warning sign even if the software returns coefficients. Table 13.2 compares ordinary maximum-likelihood estimates with Firth-penalised estimates when the `logistf` package is available. ```{r} #| label: ch13-firth-table #| echo: false #| results: asis smallsamplelab_apa_table( "13.2", "Standard and Firth-penalised logistic-regression estimates", logistic_comparison, note = sprintf( "The standard-model standard errors are Wald estimates from glm(); under separation they can become extremely large or effectively unbounded, especially when R warns that fitted probabilities are numerically 0 or 1. In this example, fitted probabilities close to 0 or 1 occurred: %s. Use logistf or another penalised method immediately when that warning appears. Firth's method adds a penalty proportional to the log determinant of the Fisher information matrix, reducing small-sample bias and preventing infinite estimates under separation. Coefficients remain on the log-odds scale and can be exponentiated for odds-ratio interpretation.", if_else(glm_fit_warning, "yes", "no") ), align = c("l", "l", "r", "r", "r") ) ``` Interpretation should focus on direction, uncertainty, and model fragility. A finite Firth coefficient is a more stable estimate under a penalised likelihood, not confirmation that the effect size is precisely known. In small samples, report the event counts, variables in the model, penalisation method, confidence intervals, and whether ordinary logistic regression showed separation warnings. ### Ridge Regression as Shrinkage Ridge regression shrinks regression coefficients toward zero by adding a penalty proportional to the squared coefficient size. It is useful when predictors are correlated, sample size is modest, or the goal is prediction rather than unpenalised coefficient interpretation [@harrell2015]. Table 13.3 and Figure 13.1 show the same small customer-satisfaction regression under increasing ridge penalties. ```{r} #| label: ch13-ridge-table #| echo: false #| results: asis smallsamplelab_apa_table( "13.3", "Ridge coefficient estimates under increasing penalty strength", ridge_display, note = "Predictors were standardised to mean = 0 and SD = 1 before fitting because the L2 penalty is scale-dependent. Lambda = 0 is the ordinary least-squares solution. Coefficients apply to standardised predictors unless back-transformed.", align = c("r", "r", "r", "r") ) ``` ```{r} #| label: ch13-ridge-figure #| echo: false #| fig-cap: "Figure 13.1: Ridge coefficient paths for the customer-satisfaction example." #| alt: "Ridge coefficient paths for the customer-satisfaction example." ridge_plot ``` As lambda increases, the slope estimates move toward zero. That shrinkage can reduce overfitting and improve prediction, but it also changes the estimand: the coefficients are penalised estimates, not ordinary least-squares coefficients. Report the penalty-selection method, whether predictors were standardised, and whether the model was used for prediction or interpretation. ### Choosing the Ridge Penalty with `glmnet` The code below shows the same ridge idea using `glmnet`, which is the package readers are most likely to use in practice. The predictors are standardised internally, `alpha = 0` requests ridge rather than lasso, and cross-validation selects a penalty. The sample is deliberately small, so the cross-validation curve should be read as a tuning aid rather than as a precise estimate of out-of-sample performance. :::: {.content-visible when-format="html"} ::::: {.panel-tabset group="chapter-13-glmnet-ridge"} #### Rendered Output ```{r} #| label: ch13-glmnet-table-html #| echo: false #| results: asis smallsamplelab_apa_table( "13.4", "Ridge estimates from glmnet at the selected penalty", glmnet_display, note = "The selected lambda is the value minimising cross-validated prediction error. In small samples, repeat the analysis under plausible modelling choices rather than treating one cross-validation split as definitive.", align = c("l", "r") ) ``` #### Cell Code ```{webr-r} #| label: ch13-glmnet-code #| context: interactive set.seed(2025) customer_data <- data.frame( satisfaction = c(7, 5, 6, 4, 8, 3, 8, 6, 5, 6, 4, 8, 6, 7, 5, 7, 6, 6), wait_time = c(5, 10, 8, 12, 7, 15, 6, 9, 11, 8, 13, 7, 10, 9, 12, 8, 11, 10), friendliness = c(8, 7, 8, 6, 9, 5, 9, 8, 7, 8, 6, 9, 7, 8, 6, 8, 7, 7) ) if (requireNamespace("glmnet", quietly = TRUE)) { x_glmnet <- model.matrix(satisfaction ~ wait_time + friendliness, data = customer_data)[, -1] cv_ridge <- glmnet::cv.glmnet(x_glmnet, customer_data$satisfaction, alpha = 0, nfolds = 6) coef(glmnet::glmnet( x_glmnet, customer_data$satisfaction, alpha = 0, lambda = cv_ridge$lambda.min )) } else { message("This browser session does not have glmnet available. Run the same code locally after installing glmnet.") } ``` ::::: :::: :::: {.content-visible unless-format="html"} ```{r} #| label: ch13-glmnet-table #| echo: false #| results: asis smallsamplelab_apa_table( "13.4", "Ridge estimates from glmnet at the selected penalty", glmnet_display, note = "The selected lambda is the value minimising cross-validated prediction error. In small samples, repeat the analysis under plausible modelling choices rather than treating one cross-validation split as definitive.", align = c("l", "r") ) ``` :::: ```{r} #| label: qfig-ch13-glmnet-cv #| echo: false #| fig-cap: "Figure 13.2: Cross-validation curve for ridge penalty selection with glmnet." #| alt: "Cross-validation curve for ridge penalty selection with glmnet." #| fig-align: "center" cv_ridge_plot ``` ### LASSO for Predictor Screening The LASSO uses an L1 penalty rather than the squared L2 penalty used by ridge regression. This means that, as the penalty increases, some coefficients can be shrunk exactly to zero. That property makes LASSO useful for cautious predictor screening when a small dataset contains more candidate predictors than the sample can estimate reliably. It should not be treated as proof that excluded variables are irrelevant. With small samples, selected predictors can change under modest resampling or under a different set of candidate variables. The example below uses 30 observations and five candidate predictors. Cross-validation chooses two common penalty values: `lambda.min`, which minimises the cross-validated error, and `lambda.1se`, which chooses a more parsimonious model within one standard error of that minimum. ```{r} #| label: ch13-lasso-table #| echo: false #| results: asis smallsamplelab_apa_table( "13.5", "LASSO coefficients under two cross-validated penalty choices", lasso_display, note = "Coefficients were estimated with standardised predictors. Values equal to 0 indicate variables removed by the LASSO penalty at that tuning value.", align = c("l", "r", "r") ) ``` ```{r} #| label: qfig-ch13-lasso-path #| echo: false #| fig-cap: "Figure 13.3: LASSO coefficient paths for five candidate predictors." #| alt: "LASSO coefficient paths for five candidate predictors." #| fig-align: "center" lasso_path_plot ``` The main reporting point is the penalty rule, not just the final coefficients. If `lambda.1se` removes a predictor that `lambda.min` retains, describe that predictor as unstable rather than definitively absent. For explanatory work, LASSO is best used as a sensitivity analysis or screening tool before a simpler, pre-specified model is reported. ### Bayesian Priors as Regularisation Weakly informative priors are regularisation tools, not a way to force a preferred conclusion. A prior such as Normal(0, 2.5) on a logistic-regression coefficient implies that odds ratios between exp(-5) = 0.007 and exp(5) = 148 are plausible a priori while still regularising extreme estimates. Stronger priors such as Normal(0, 0.5) exert more shrinkage and should be justified by substantive knowledge or prior evidence [@gelman2017]. Table 13.6 illustrates prior sensitivity for the wait-time slope in the customer-satisfaction example using a normal approximation to the likelihood. The table is not a replacement for full Bayesian computation, but it shows the main principle: when the prior is tight, the posterior moves toward zero. When the prior is weak, the posterior resembles the data-driven estimate. ```{r} #| label: ch13-prior-table #| echo: false #| results: asis smallsamplelab_apa_table( "13.6", "Prior-sensitivity illustration for a Bayesian regression slope", prior_table, note = "The approximation uses the ordinary least-squares wait-time slope and standard error as a normal likelihood. Full Bayesian analyses should still check convergence diagnostics, such as R-hat < 1.01 and effective sample size > 400, and should inspect posterior predictive fit.", align = c("l", "r", "r", "l") ) ``` For a full Bayesian fit, use an MCMC package and report diagnostics. The following code is intentionally not evaluated in the book render because `brms` requires a working Stan toolchain, but it is the minimal workflow expected in a manuscript: specify priors, sample, check R-hat and effective sample size, and inspect posterior predictive fit. Before running it locally, verify the Stan toolchain with `cmdstanr::check_cmdstan_toolchain()` if using CmdStan, or confirm the installed `rstan` version with `rstan::stan_version()` before fitting a small test model. :::: {.content-visible when-format="html"} ```{webr-r} #| label: ch13-brms-template #| context: interactive if (!requireNamespace("brms", quietly = TRUE)) { message("This template requires brms and a local Stan toolchain. Use it in local R, not browser WebR. Check CmdStan with cmdstanr::check_cmdstan_toolchain(), or confirm rstan with rstan::stan_version() before fitting.") } else { library(brms) customer_data <- data.frame( satisfaction = c(7, 5, 6, 4, 8, 3, 8, 6, 5, 6, 4, 8, 6, 7, 5, 7, 6, 6), wait_time = c(5, 10, 8, 12, 7, 15, 6, 9, 11, 8, 13, 7, 10, 9, 12, 8, 11, 10), friendliness = c(8, 7, 8, 6, 9, 5, 9, 8, 7, 8, 6, 9, 7, 8, 6, 8, 7, 7) ) small_prior <- c( prior(normal(0, 1), class = "b"), prior(student_t(3, 0, 2.5), class = "Intercept"), prior(exponential(1), class = "sigma") ) bayes_fit <- brm( satisfaction ~ scale(wait_time) + scale(friendliness), data = customer_data, family = gaussian(), prior = small_prior, chains = 4, iter = 2000, seed = 2025 ) summary(bayes_fit) # Check R-hat < 1.01 and bulk/tail ESS > 400 pp_check(bayes_fit) # Inspect posterior predictive fit } ``` :::: :::: {.content-visible unless-format="html"} ```{r} #| eval: false #| label: ch13-brms-template-pdf library(brms) small_prior <- c( prior(normal(0, 1), class = "b"), prior(student_t(3, 0, 2.5), class = "Intercept"), prior(exponential(1), class = "sigma") ) bayes_fit <- brm( satisfaction ~ scale(wait_time) + scale(friendliness), data = customer_data, family = gaussian(), prior = small_prior, chains = 4, iter = 2000, seed = 2025 ) summary(bayes_fit) # Check R-hat < 1.01 and bulk/tail ESS > 400 pp_check(bayes_fit) # Inspect posterior predictive fit ``` :::: Bayesian regression reports posterior intervals rather than frequentist confidence intervals. A 95% credible interval describes the range containing 95% of posterior probability given the model, data, and priors. That statement is conditional on the prior choice, so small-sample Bayesian reports should include the priors, convergence diagnostics, posterior predictive checks, and at least one plausible prior-sensitivity analysis. For leave-one-out cross-validation or WAIC in Bayesian models, use them as approximate predictive checks, not as automatic proof that one small-sample model is correct [@vehtari2017]. ### Reporting Regularised Models A regularised model report should make the stabilising assumption visible. State the outcome, sample size, event count where relevant, candidate predictors, standardisation, penalty or prior, tuning method and sensitivity checks. Do not report a penalised coefficient as if it were an ordinary unpenalised estimate. For Firth logistic regression, report the sparse event table, the separation warning or diagnostic that motivated the method, coefficient scale, odds ratios if used, confidence intervals and software. For ridge and LASSO, report whether predictors were standardised, the value of lambda, how lambda was chosen, and whether conclusions change under `lambda.min` versus `lambda.1se` or under a simpler unpenalised model. For Bayesian models, report priors, seed, chains, iterations, R-hat, effective sample size, posterior intervals and posterior predictive checks. A concise reporting sentence could read: "Because the ordinary logistic model produced fitted probabilities close to 0 and 1, we estimated a Firth-penalised logistic regression. The model included planning score and prior experience, reported coefficients on the log-odds scale with confidence intervals, and was interpreted as a stabilised sensitivity analysis rather than as precise evidence from a large sample." That level of detail is enough for readers to see both the method and the limitation. ### Key Takeaways Penalised and Bayesian regression methods are valuable in small samples because they make unstable estimates finite and reduce overfitting. Firth logistic regression is especially useful for sparse binary outcomes and separation. Ridge regression shrinks correlated or noisy linear-model coefficients. LASSO can screen candidate predictors by setting unstable coefficients to zero. Bayesian priors regularise estimates and express assumptions directly. The reporting obligation is the same in all cases: show the data structure, state the penalty or prior, check sensitivity, and avoid treating regularised estimates as more precise than the sample supports. ### Self-Assessment Quiz ```{r} #| echo: false #| results: asis source(normalizePath(file.path(dirname(knitr::current_input(dir = TRUE)), "..", "R", "quiz_helpers.R"), mustWork = TRUE)) smallsamplelab_render_quiz(list( list( prompt = "What does separation mean in logistic regression?", options = c("Predictors are measured on different scales", "A predictor or predictor combination nearly perfectly classifies the binary outcome", "The residuals are normally distributed", "The model has no intercept"), answer = 2L, explanation = "Separation occurs when the outcome can be nearly or perfectly predicted from the covariates. Standard logistic maximum likelihood can then produce extremely large or infinite coefficients." ), list( prompt = "Why is Firth logistic regression useful with sparse binary outcomes?", options = c("It removes the binary outcome", "It reduces small-sample bias and produces finite estimates under separation", "It guarantees statistical significance", "It replaces the need to report event counts"), answer = 2L, explanation = "Firth's method uses a penalised likelihood that reduces small-sample bias and avoids infinite estimates. Event counts and uncertainty still need to be reported." ), list( prompt = "What does ridge regression do to coefficients?", options = c("It forces all coefficients to be exactly zero", "It shrinks coefficients toward zero by penalising large values", "It converts a binary outcome into a count", "It removes the need to standardise predictors"), answer = 2L, explanation = "Ridge regression adds a squared-coefficient penalty. Larger penalties pull estimates toward zero and can reduce overfitting." ), list( prompt = "In a small-sample Bayesian regression, why should prior sensitivity be reported?", options = c("Because priors have no effect", "Because small samples can leave the posterior sensitive to reasonable prior choices", "Because it replaces convergence diagnostics", "Because credible intervals are always narrow"), answer = 2L, explanation = "When the data are thin, different plausible priors can lead to noticeably different posterior estimates. Reporting sensitivity shows how much the conclusion depends on modelling assumptions." ), list( prompt = "Which statement is the safest interpretation of a regularised estimate?", options = c("It is automatically unbiased", "It is stabilised by an explicit penalty or prior and should be interpreted with that assumption stated", "It proves that the effect exists", "It needs no confidence or credible interval"), answer = 2L, explanation = "Regularisation stabilises estimates by adding assumptions. Those assumptions, along with intervals and diagnostics, must be reported." ) )) ```