Chapter 2: Questions and Outcomes that Fit Small n

Small-sample studies are most effective when the research question, outcome, and design are chosen to match the information the data can realistically support. This chapter shows how to narrow broad ideas into answerable questions, how outcome scales affect what can be learned from modest samples, and why estimation often matters more than a binary significant or non-significant result. The goal is practical: to help you design studies that extract the most from limited data and communicate findings at a level the evidence can honestly support.

Learning Objectives

By the end of this chapter, you will be able to distinguish exploratory from confirmatory aims, formulate focused research questions that fit limited data, choose outcome measures that preserve useful information with small n, and calculate the minimum detectable effects implied by a realistic design.

Framing Realistic Research Questions

Small-sample studies work best when the research question is narrow. Questions that ask the data to estimate many predictors, interactions, mediation paths, or measurement parameters such as factor loadings for a long questionnaire usually require far more observations. Focused questions about a single outcome or a few key comparisons are much more realistic with modest samples.

When planning a small-sample study, prioritise clarity and specificity. A question such as “Does a brief reminder intervention improve adherence compared to standard care?” is focused, has a clear comparison, and can be tested in a small randomised trial. A question like “What are all the factors that influence patient adherence?” spreads the available information across too many unknowns.

Similarly, consider whether the study is exploratory or confirmatory. Exploratory studies generate hypotheses, describe patterns, and refine measurement instruments. They can be useful with modest samples, provided the findings are framed as provisional and replication is expected. Because exploratory work often examines several patterns at once, apparent findings may reflect chance, especially when researchers inspect several outcomes or subgroup patterns without adjustment. Confirmatory studies test prespecified hypotheses and therefore require enough power to support that stronger claim. With small samples, confirmatory aims should be modest and carefully justified.

From Objective to Hypothesis

A small-sample study benefits from a clear hierarchy. In a confirmatory design, that usually means one primary objective, one primary research question, and one primary hypothesis. In an exploratory or pilot design, the hypothesis is often replaced with an estimation or feasibility objective because the data cannot support several formal confirmatory claims well. The objective states what the study is trying to learn, the research question identifies the population, comparison, outcome, and timeframe, and the hypothesis states the expected difference or association on that specific outcome.

For example, a focused small-sample study might use the following sequence:

Objective: To assess whether a brief reminder intervention improves medication adherence over four weeks compared with standard care.
Research question: Among adults attending a primary-care clinic, do participants receiving the reminder intervention have higher four-week adherence scores than those receiving standard care?
Confirmatory hypothesis: Participants assigned to the reminder intervention will have higher mean adherence scores at four weeks than participants assigned to standard care.
Exploratory objective: To estimate the difference in adherence score between groups and assess recruitment, retention, and intervention uptake in preparation for a larger confirmatory trial.

When writing hypotheses for small-sample studies, keep them narrow and defensible. A good small-sample hypothesis names one primary outcome, one main comparison, and a plausible expected pattern. Directional hypotheses are best reserved for situations where prior theory or evidence is strong enough to justify specifying the direction in advance. Avoid omnibus statements that bundle several outcomes, subgroup effects, mediators, and interactions into a single claim. Avoid phrasing hypotheses around achieving statistical significance. The hypothesis should describe the expected substantive pattern, while the analysis later evaluates its uncertainty.

Choosing Appropriate Outcomes

The type of outcome variable influences which methods are feasible and how much information can be extracted from limited data. Binary outcomes (yes/no, success/failure) are common but carry less information per observation than continuous or ordinal measures. If your sample is small, consider whether a continuous or ordinal outcome might capture more variation and yield more precise inferences.

For example, rather than dichotomising patient improvement into “improved” versus “not improved”, use a continuous measure of symptom severity or an ordinal scale with several levels. This preserves information and increases statistical efficiency. When the outcome is inherently binary, such as survival within 30 days, keep it in that form.

Count outcomes (number of adverse events, number of customer complaints) are also informative but may be sparse when samples are small. Exact Poisson tests and negative binomial models can handle low counts, but very sparse data (many zeros, few events) may require careful interpretation or resampling methods.

Outcome Selection Decision Guide

Figure 2.1 turns outcome selection into a sequence of questions. It starts with the construct itself, then separates continuous, count, ordinal, and binary outcomes in the order that preserves the most defensible information from small samples.

Figure 2.1: Outcome selection guide for small-sample studies.

Read Figure 2.1 from top to bottom. The ambiguous cases usually arise between ordinal and binary coding, or between count outcomes and simpler event/no-event summaries. The guiding principle is to keep the scale that preserves the most defensible information. Binary coding is still appropriate when the construct genuinely has only two meaningful states, or when the substantive decision itself is binary. After choosing the outcome family, explain why that scale balances information, measurement burden, and feasibility for the study.

Effect Sizes and Estimation

In small-sample research, point estimates of effect sizes (differences in means, odds ratios, correlation coefficients) are often more useful than p-values alone. Even when a small sample has limited power, the estimated effect size and its confidence interval indicate the likely magnitude and precision of the effect.

A non-significant result in a small study does not by itself imply that the effect is trivial or absent. It may simply indicate that the data are not precise enough to distinguish a moderate effect from zero with confidence.

When reporting results, emphasise effect sizes and uncertainty intervals. For example, “The mean difference in satisfaction scores was 1.5 points (95% CI: 0.5 to 2.5)” is more informative than “The difference was statistically significant (p = 0.03)”. Effect size estimates help readers judge practical importance and facilitate meta-analysis or future sample size planning. When the sample size is fixed in advance, it is also useful to report the minimum detectable effect: the smallest effect your study would be well-positioned to detect under the planned design.

For example, if a two-group study is limited to 15 participants per group and targets 80% power under a two-sided $\alpha = 0.05$, the minimum detectable standardised effect is approximately d = 1.06. That means the study is only sensitive to very large differences. Under the same two-sided $\alpha = 0.05$ and 80% power assumptions, detecting a small effect such as d = 0.2 would require about 393 participants per group. Thinking in terms of minimum detectable effects helps researchers decide whether a question is realistically answerable with the sample size they can obtain.

As a practical planning check, suppose your budget allows n = 20 per group and the planned analysis is a two-sample t-test with two-sided $\alpha = 0.05$ and 80% power. The design is only well positioned to detect about d = 0.91 or larger. The next question is substantive rather than computational: would a difference of nearly one pooled standard deviation be the smallest effect worth detecting in your field? If the answer is no, the honest options are to narrow the question, improve measurement precision, use a more efficient paired or stratified design, or frame the study as exploratory rather than confirmatory.

Example: Outcome Selection in a Pilot Study

Suppose you are evaluating a pilot training programme with 18 participants. You have two outcome options: (1) binary pass/fail on a final assessment, or (2) a continuous score (0–100) on the same assessment.

With the continuous score, the sample mean is 69.9 points and the standard deviation is 8.4, with a 95% confidence interval from 65.7 to 74.0. If we dichotomise the same data, the pass rate is 17 out of 18, or 94.4%, with an exact 95% confidence interval from 72.7% to 99.9%. The binary summary still gives a pass rate and its uncertainty, but it no longer shows how far above or below the threshold participants scored.

Interpretation

The continuous outcome lets us estimate average performance and quantify uncertainty directly. If the goal is to understand typical performance rather than only whether participants crossed a cut-point, the continuous measure is more informative.

Research Design Considerations

Small-sample studies benefit from tight experimental control. Paired or matched designs (before–after, crossover, matched-pair comparisons) reduce variability by comparing each unit to itself or a closely matched control. This within-unit comparison can yield precise inferences even when the number of units is small.

Stratification and blocking can also improve efficiency by accounting for known sources of variation. For example, if you are comparing two teaching methods in a small class, stratify by prior achievement level to reduce heterogeneity within each comparison.

Finally, consider sequential or adaptive designs if feasible. Rather than committing to a fixed sample size in advance, you might prespecify an interim review to decide whether recruitment is working as planned, whether variance estimates are much larger than expected, or whether the study should stop early because the signal is already clear. Bayesian methods are well-suited to this style of design because posterior distributions update naturally as data accumulate: the posterior after the first wave becomes the evidence base that is updated when the next wave arrives. These designs still require advance decision rules and transparent reporting so that flexibility remains planned rather than ad hoc.

Designing Pilot Studies

Pilot studies serve specific purposes: assessing feasibility (recruitment rates, attrition, protocol adherence), refining measurement instruments, and estimating variability to inform future sample size calculations. With very small n (often 10–30 participants), focus on collecting process metrics and precision estimates rather than hypothesis testing. Report:

Primary feasibility outcomes (e.g., proportion screened who consent, time to complete assessments).
Preliminary effect estimates with wide confidence intervals, making clear that they are exploratory.
Adaptations for the main study, especially where procedures proved onerous or data quality issues emerged; describe what was changed and why so that reviewers can see how the pilot informed the main design.

In planning terms, choose a pilot sample large enough to detect major logistical problems (often 12–20 per arm is sufficient for estimating key feasibility parameters rather than testing efficacy), prespecify success criteria such as an acceptable recruitment rate, and plan in advance how you will decide whether to proceed to a full trial (Teare et al. 2014).

Key Takeaways

Small-sample studies work best when the question is narrow, the design is realistic, and the outcome preserves as much defensible information as possible. Exploratory aims, continuous or ordinal measures, and efficient designs such as paired or stratified comparisons often make limited data more informative than an overly ambitious confirmatory plan would. Throughout the design process, report effect sizes and confidence intervals alongside power or feasibility considerations so readers can judge what the study could genuinely show.

```{r} #| include: false suppressPackageStartupMessages({ library(dplyr) library(ggplot2) library(tibble) }) ``` # Chapter 2: Questions and Outcomes that Fit Small n Small-sample studies are most effective when the research question, outcome, and design are chosen to match the information the data can realistically support. This chapter shows how to narrow broad ideas into answerable questions, how outcome scales affect what can be learned from modest samples, and why estimation often matters more than a binary significant or non-significant result. The goal is practical: to help you design studies that extract the most from limited data and communicate findings at a level the evidence can honestly support. ### Learning Objectives By the end of this chapter, you will be able to distinguish exploratory from confirmatory aims, formulate focused research questions that fit limited data, choose outcome measures that preserve useful information with small *n*, and calculate the minimum detectable effects implied by a realistic design. ### Framing Realistic Research Questions Small-sample studies work best when the research question is narrow. Questions that ask the data to estimate many predictors, interactions, mediation paths, or measurement parameters such as factor loadings for a long questionnaire usually require far more observations. Focused questions about a single outcome or a few key comparisons are much more realistic with modest samples. When planning a small-sample study, prioritise clarity and specificity. A question such as "Does a brief reminder intervention improve adherence compared to standard care?" is focused, has a clear comparison, and can be tested in a small randomised trial. A question like "What are all the factors that influence patient adherence?" spreads the available information across too many unknowns. Similarly, consider whether the study is exploratory or confirmatory. Exploratory studies generate hypotheses, describe patterns, and refine measurement instruments. They can be useful with modest samples, provided the findings are framed as provisional and replication is expected. Because exploratory work often examines several patterns at once, apparent findings may reflect chance, especially when researchers inspect several outcomes or subgroup patterns without adjustment. Confirmatory studies test prespecified hypotheses and therefore require enough power to support that stronger claim. With small samples, confirmatory aims should be modest and carefully justified. ### From Objective to Hypothesis A small-sample study benefits from a clear hierarchy. In a confirmatory design, that usually means one primary objective, one primary research question, and one primary hypothesis. In an exploratory or pilot design, the hypothesis is often replaced with an estimation or feasibility objective because the data cannot support several formal confirmatory claims well. The objective states what the study is trying to learn, the research question identifies the population, comparison, outcome, and timeframe, and the hypothesis states the expected difference or association on that specific outcome. For example, a focused small-sample study might use the following sequence: - **Objective:** To assess whether a brief reminder intervention improves medication adherence over four weeks compared with standard care. - **Research question:** Among adults attending a primary-care clinic, do participants receiving the reminder intervention have higher four-week adherence scores than those receiving standard care? - **Confirmatory hypothesis:** Participants assigned to the reminder intervention will have higher mean adherence scores at four weeks than participants assigned to standard care. - **Exploratory objective:** To estimate the difference in adherence score between groups and assess recruitment, retention, and intervention uptake in preparation for a larger confirmatory trial. When writing hypotheses for small-sample studies, keep them narrow and defensible. A good small-sample hypothesis names one primary outcome, one main comparison, and a plausible expected pattern. Directional hypotheses are best reserved for situations where prior theory or evidence is strong enough to justify specifying the direction in advance. Avoid omnibus statements that bundle several outcomes, subgroup effects, mediators, and interactions into a single claim. Avoid phrasing hypotheses around achieving statistical significance. The hypothesis should describe the expected substantive pattern, while the analysis later evaluates its uncertainty. ### Choosing Appropriate Outcomes The type of outcome variable influences which methods are feasible and how much information can be extracted from limited data. Binary outcomes (yes/no, success/failure) are common but carry less information per observation than continuous or ordinal measures. If your sample is small, consider whether a continuous or ordinal outcome might capture more variation and yield more precise inferences. For example, rather than dichotomising patient improvement into "improved" versus "not improved", use a continuous measure of symptom severity or an ordinal scale with several levels. This preserves information and increases statistical efficiency. When the outcome is inherently binary, such as survival within 30 days, keep it in that form. Count outcomes (number of adverse events, number of customer complaints) are also informative but may be sparse when samples are small. Exact Poisson tests and negative binomial models can handle low counts, but very sparse data (many zeros, few events) may require careful interpretation or resampling methods. #### Outcome Selection Decision Guide Figure 2.1 turns outcome selection into a sequence of questions. It starts with the construct itself, then separates continuous, count, ordinal, and binary outcomes in the order that preserves the most defensible information from small samples. ```{r} #| label: qfig-ch2-outcome-selection-guide #| echo: false #| fig-cap: "Figure 2.1: Outcome selection guide for small-sample studies." #| alt: "Outcome selection guide for small-sample studies." #| fig-width: 12 #| fig-height: 10 #| out-width: "100%" library(grid) wrap_label <- function(text, width) { paste(strwrap(text, width = width), collapse = "\n") } pick_flowchart_font <- function() { if (knitr::is_latex_output()) { return("Times") } if (requireNamespace("systemfonts", quietly = TRUE)) { available_families <- unique(systemfonts::system_fonts()$family) for (candidate in c("Georgia", "Times New Roman", "Liberation Serif", "DejaVu Serif")) { if (candidate %in% available_families) { return(candidate) } } } "serif" } flowchart_font <- pick_flowchart_font() if (.Platform$OS.type == "windows" && identical(flowchart_font, "Times New Roman")) { grDevices::windowsFonts(`Times New Roman` = grDevices::windowsFont("Times New Roman")) } nodes <- tibble( id = c("start", "q1", "q2", "q3", "cont", "count", "ordinal", "binary", "document"), type = c("start", "question", "question", "question", "terminal", "terminal", "terminal", "terminal", "document"), x = c(0, 0, 0, 0, -7.5, -2.5, 2.5, 7.5, 0), y = c(11.2, 8.5, 5.6, 2.7, -1.0, -1.0, -1.0, -1.0, -4.2), label = c( wrap_label("Start with the construct of interest", 28), wrap_label("Can it be measured on an approximately continuous numeric scale?", 30), wrap_label("Is it a count of events that can exceed one per subject?", 30), wrap_label("Can respondents make ordered distinctions beyond yes/no?", 30), "Choose a continuous outcome\nReport mean or median\nwith a confidence interval", "Choose a count outcome\nUse Poisson, negative binomial,\nor exact count methods", "Choose an ordinal outcome\nUse rank-based or\nordinal models", "Choose a binary outcome\nUse exact or penalised\nmethods", wrap_label("Document why this scale balances information, burden, and feasibility", 40) ), fill = c("#F3F3F3", "#E7EEF8", "#E7EEF8", "#E7EEF8", "#FFFFFF", "#FFFFFF", "#FFFFFF", "#FFFFFF", "#F3F3F3") ) decision_segments <- tribble( ~x, ~y, ~xend, ~yend, ~label, 0.0, 10.4, 0.0, 9.3, "", 0.0, 7.4, 0.0, 6.0, "", 0.0, 6.0, -7.5, 6.0, "Yes", -7.5, 6.0, -7.5, -0.2, "", 0.0, 7.4, 0.0, 6.4, "No", 0.0, 4.5, 0.0, 3.7, "", 0.0, 3.7, -2.5, 3.7, "Yes", -2.5, 3.7, -2.5, -0.2, "", 0.0, 4.5, 0.0, 3.5, "No", 0.0, 1.6, 0.0, 0.8, "", 0.0, 0.8, 2.5, 0.8, "Yes", 2.5, 0.8, 2.5, -0.2, "", 0.0, 0.8, 7.5, 0.8, "No", 7.5, 0.8, 7.5, -0.2, "" ) collector_segments <- tribble( ~x, ~y, ~xend, ~yend, ~arrowed, -7.5, -1.85, -7.5, -2.0, FALSE, -2.5, -1.85, -2.5, -2.0, FALSE, 2.5, -1.85, 2.5, -2.0, FALSE, 7.5, -1.85, 7.5, -2.0, FALSE, -7.5, -2.0, 7.5, -2.0, FALSE, 0.0, -2.0, 0.0, -3.35, TRUE ) edge_labels <- decision_segments %>% dplyr::filter(label != "") %>% dplyr::mutate( label_x = dplyr::if_else(abs(y - yend) < 1e-9, (x + xend) / 2, x + 0.35), label_y = dplyr::if_else(abs(x - xend) < 1e-9, (y + yend) / 2, y + 0.35) ) ggplot() + geom_segment( data = decision_segments, aes(x = x, y = y, xend = xend, yend = yend), linewidth = 0.55, colour = "#4A4A4A", arrow = arrow(length = unit(0.14, "inches"), type = "closed") ) + geom_segment( data = dplyr::filter(collector_segments, !arrowed), aes(x = x, y = y, xend = xend, yend = yend), linewidth = 0.55, colour = "#4A4A4A" ) + geom_segment( data = dplyr::filter(collector_segments, arrowed), aes(x = x, y = y, xend = xend, yend = yend), linewidth = 0.55, colour = "#4A4A4A", arrow = arrow(length = unit(0.14, "inches"), type = "closed") ) + geom_text( data = edge_labels, aes(x = label_x, y = label_y, label = label), family = flowchart_font, size = 4.6, colour = "#333333" ) + geom_label( data = dplyr::filter(nodes, type %in% c("start", "question")), aes(x = x, y = y, label = label, fill = fill), label.size = 0.45, label.padding = unit(0.42, "lines"), label.r = unit(0.18, "lines"), family = flowchart_font, size = 4.9, lineheight = 1.08, colour = "#111111" ) + geom_label( data = dplyr::filter(nodes, type == "terminal"), aes(x = x, y = y, label = label, fill = fill), label.size = 0.45, label.padding = unit(0.32, "lines"), label.r = unit(0.18, "lines"), family = flowchart_font, size = 4.2, lineheight = 1.08, colour = "#111111" ) + geom_label( data = dplyr::filter(nodes, type == "document"), aes(x = x, y = y, label = label, fill = fill), label.size = 0.45, label.padding = unit(0.32, "lines"), label.r = unit(0.18, "lines"), family = flowchart_font, size = 4.3, lineheight = 1.08, colour = "#111111" ) + scale_fill_identity() + coord_cartesian( xlim = c(-10, 10), ylim = c(-5.3, 12.6), clip = "off" ) + theme_void() + theme( plot.margin = margin(10, 18, 10, 18), plot.caption = element_text(family = flowchart_font, size = 10, hjust = 0) ) ``` Read Figure 2.1 from top to bottom. The ambiguous cases usually arise between ordinal and binary coding, or between count outcomes and simpler event/no-event summaries. The guiding principle is to keep the scale that preserves the most defensible information. Binary coding is still appropriate when the construct genuinely has only two meaningful states, or when the substantive decision itself is binary. After choosing the outcome family, explain why that scale balances information, measurement burden, and feasibility for the study. ### Effect Sizes and Estimation In small-sample research, point estimates of effect sizes (differences in means, odds ratios, correlation coefficients) are often more useful than p-values alone. Even when a small sample has limited power, the estimated effect size and its confidence interval indicate the likely magnitude and precision of the effect. A non-significant result in a small study does not by itself imply that the effect is trivial or absent. It may simply indicate that the data are not precise enough to distinguish a moderate effect from zero with confidence. When reporting results, emphasise effect sizes and uncertainty intervals. For example, "The mean difference in satisfaction scores was 1.5 points (95% CI: 0.5 to 2.5)" is more informative than "The difference was statistically significant (p = 0.03)". Effect size estimates help readers judge practical importance and facilitate meta-analysis or future sample size planning. When the sample size is fixed in advance, it is also useful to report the minimum detectable effect: the smallest effect your study would be well-positioned to detect under the planned design. For example, if a two-group study is limited to 15 participants per group and targets 80% power under a two-sided $\alpha = 0.05$, the minimum detectable standardised effect is approximately *d* = 1.06. That means the study is only sensitive to very large differences. Under the same two-sided $\alpha = 0.05$ and 80% power assumptions, detecting a small effect such as *d* = 0.2 would require about 393 participants per group. Thinking in terms of minimum detectable effects helps researchers decide whether a question is realistically answerable with the sample size they can obtain. As a practical planning check, suppose your budget allows *n* = 20 per group and the planned analysis is a two-sample t-test with two-sided $\alpha = 0.05$ and 80% power. The design is only well positioned to detect about *d* = 0.91 or larger. The next question is substantive rather than computational: would a difference of nearly one pooled standard deviation be the smallest effect worth detecting in your field? If the answer is no, the honest options are to narrow the question, improve measurement precision, use a more efficient paired or stratified design, or frame the study as exploratory rather than confirmatory. ### Example: Outcome Selection in a Pilot Study Suppose you are evaluating a pilot training programme with 18 participants. You have two outcome options: (1) binary pass/fail on a final assessment, or (2) a continuous score (0–100) on the same assessment. ```{r} #| label: pilot-outcome-data #| include: false set.seed(2025) n <- 18 scores <- round(rnorm(n, mean = 68, sd = 12)) scores <- pmax(0, pmin(100, scores)) pass_fail <- ifelse(scores >= 60, "Pass", "Fail") data_pilot <- tibble( participant = 1:n, score = scores, outcome = pass_fail ) pilot_mean <- mean(data_pilot$score) pilot_sd <- sd(data_pilot$score) pilot_se <- pilot_sd / sqrt(n) pilot_t_crit <- qt(0.975, df = n - 1) pilot_ci_lower <- pilot_mean - pilot_t_crit * pilot_se pilot_ci_upper <- pilot_mean + pilot_t_crit * pilot_se pilot_pass_n <- sum(data_pilot$outcome == "Pass") pilot_pass_rate <- pilot_pass_n / n pilot_pass_ci <- binom.test(pilot_pass_n, n)$conf.int pilot_summary_table <- tibble( Representation = c("Continuous score", "Pass/fail"), Estimate = c( sprintf("Mean = %.1f; SD = %.1f", pilot_mean, pilot_sd), sprintf("%d/%d passed (%.1f%%)", pilot_pass_n, n, 100 * pilot_pass_rate) ), `Uncertainty / detail` = c( sprintf("95%% CI for mean: %.1f to %.1f", pilot_ci_lower, pilot_ci_upper), sprintf("Exact 95%% CI for pass rate: %.1f%% to %.1f%%", 100 * pilot_pass_ci[1], 100 * pilot_pass_ci[2]) ) ) ``` :::: {.content-visible when-format="html"} ::::: {.panel-tabset group="part-a-foundations-chapter-2-questions-and-outcomes-that-fit-small-n-cell-1"} #### Output ```{webr-r} #| label: pilot-outcome-simulation-html #| context: output library(dplyr) library(tibble) library(knitr) library(htmltools) set.seed(2025) # Inline values quoted in the prose are computed in a separate hidden chunk using the same seed. n <- 18 scores <- round(rnorm(n, mean = 68, sd = 12)) scores <- pmax(0, pmin(100, scores)) pass_fail <- ifelse(scores >= 60, "Pass", "Fail") data_pilot <- tibble( participant = 1:n, score = scores, outcome = pass_fail ) mean_score <- mean(data_pilot$score) sd_score <- sd(data_pilot$score) se_score <- sd_score / sqrt(n) t_crit <- qt(0.975, df = n - 1) ci_lower <- mean_score - t_crit * se_score ci_upper <- mean_score + t_crit * se_score pass_n <- sum(data_pilot$outcome == "Pass") pass_rate <- pass_n / n pass_ci <- binom.test(pass_n, n)$conf.int pilot_summary_table <- tibble( Representation = c("Continuous score", "Pass/fail"), Estimate = c( sprintf("Mean = %.1f; SD = %.1f", mean_score, sd_score), sprintf("%d/%d passed (%.1f%%)", pass_n, n, 100 * pass_rate) ), `Uncertainty / detail` = c( sprintf("95%% CI for mean: %.1f to %.1f", ci_lower, ci_upper), sprintf("Exact 95%% CI for pass rate: %.1f%% to %.1f%%", 100 * pass_ci[1], 100 * pass_ci[2]) ) ) pilot_summary_view <- tagList( tags$style(HTML(" .pilot-apa-table-block { font-family: 'Times New Roman', Georgia, serif; color: #111; max-width: 52rem; } .pilot-apa-table-caption { margin: 0 0 0.08rem 0; text-align: center; font-size: 1rem; font-weight: 700; } .pilot-apa-table-title { margin: 0 0 0.6rem 0; text-align: center; font-size: 1rem; font-style: italic; } .pilot-apa-table { width: 100%; border-collapse: collapse; font-size: 1rem; line-height: 1.3; } .pilot-apa-table th, .pilot-apa-table td { padding: 0.28rem 0.45rem; border-left: none !important; border-right: none !important; background: transparent !important; text-align: left; vertical-align: top; } .pilot-apa-table thead th { border-top: 2px solid #000; border-bottom: 1px solid #000; font-weight: 400; } .pilot-apa-table tbody tr:last-child td { border-bottom: 2px solid #000; } .pilot-apa-table-note { margin: 0.55rem 0 0 0; font-size: 0.98rem; line-height: 1.3; } ")), tags$div( class = "pilot-apa-table-block", tags$p( class = "pilot-apa-table-caption", "Table 2.1" ), tags$p( class = "pilot-apa-table-title", "Information retained under two outcome representations." ), HTML( knitr::kable( pilot_summary_table, format = "html", align = c("l", "l", "l"), table.attr = "class='pilot-apa-table'" ) ), tags$p( class = "pilot-apa-table-note", HTML("<em>Note.</em>"), " The pass/fail summary still estimates a pass rate and its uncertainty, but it no longer shows how far above or below the threshold each participant scored." ) ) ) html_print(pilot_summary_view) ``` #### R Code ```{webr-r} #| label: pilot-outcome-simulation-code #| context: interactive library(dplyr) library(tibble) library(knitr) library(htmltools) set.seed(2025) # Inline values quoted in the prose are computed in a separate hidden chunk using the same seed. n <- 18 scores <- round(rnorm(n, mean = 68, sd = 12)) scores <- pmax(0, pmin(100, scores)) pass_fail <- ifelse(scores >= 60, "Pass", "Fail") data_pilot <- tibble( participant = 1:n, score = scores, outcome = pass_fail ) mean_score <- mean(data_pilot$score) sd_score <- sd(data_pilot$score) se_score <- sd_score / sqrt(n) t_crit <- qt(0.975, df = n - 1) ci_lower <- mean_score - t_crit * se_score ci_upper <- mean_score + t_crit * se_score pass_n <- sum(data_pilot$outcome == "Pass") pass_rate <- pass_n / n pass_ci <- binom.test(pass_n, n)$conf.int pilot_summary_table <- tibble( Representation = c("Continuous score", "Pass/fail"), Estimate = c( sprintf("Mean = %.1f; SD = %.1f", mean_score, sd_score), sprintf("%d/%d passed (%.1f%%)", pass_n, n, 100 * pass_rate) ), `Uncertainty / detail` = c( sprintf("95%% CI for mean: %.1f to %.1f", ci_lower, ci_upper), sprintf("Exact 95%% CI for pass rate: %.1f%% to %.1f%%", 100 * pass_ci[1], 100 * pass_ci[2]) ) ) pilot_summary_view <- tagList( tags$style(HTML(" .pilot-apa-table-block { font-family: 'Times New Roman', Georgia, serif; color: #111; max-width: 52rem; } .pilot-apa-table-caption { margin: 0 0 0.08rem 0; text-align: center; font-size: 1rem; font-weight: 700; } .pilot-apa-table-title { margin: 0 0 0.6rem 0; text-align: center; font-size: 1rem; font-style: italic; } .pilot-apa-table { width: 100%; border-collapse: collapse; font-size: 1rem; line-height: 1.3; } .pilot-apa-table th, .pilot-apa-table td { padding: 0.28rem 0.45rem; border-left: none !important; border-right: none !important; background: transparent !important; text-align: left; vertical-align: top; } .pilot-apa-table thead th { border-top: 2px solid #000; border-bottom: 1px solid #000; font-weight: 400; } .pilot-apa-table tbody tr:last-child td { border-bottom: 2px solid #000; } .pilot-apa-table-note { margin: 0.55rem 0 0 0; font-size: 0.98rem; line-height: 1.3; } ")), tags$div( class = "pilot-apa-table-block", tags$p( class = "pilot-apa-table-caption", "Table 2.1" ), tags$p( class = "pilot-apa-table-title", "Information retained under two outcome representations." ), HTML( knitr::kable( pilot_summary_table, format = "html", align = c("l", "l", "l"), table.attr = "class='pilot-apa-table'" ) ), tags$p( class = "pilot-apa-table-note", HTML("<em>Note.</em>"), " The pass/fail summary still estimates a pass rate and its uncertainty, but it no longer shows how far above or below the threshold each participant scored." ) ) ) html_print(pilot_summary_view) ``` ::::: :::: :::: {.content-visible unless-format="html"} ```{r} #| label: pilot-outcome-simulation-print #| results: asis #| tbl-cap: "Information retained under two outcome representations." library(dplyr) library(tibble) set.seed(2025) # Inline values quoted in the prose are computed in a separate hidden chunk using the same seed. n <- 18 scores <- round(rnorm(n, mean = 68, sd = 12)) scores <- pmax(0, pmin(100, scores)) pass_fail <- ifelse(scores >= 60, "Pass", "Fail") data_pilot <- tibble( participant = 1:n, score = scores, outcome = pass_fail ) mean_score <- mean(data_pilot$score) sd_score <- sd(data_pilot$score) se_score <- sd_score / sqrt(n) t_crit <- qt(0.975, df = n - 1) ci_lower <- mean_score - t_crit * se_score ci_upper <- mean_score + t_crit * se_score pass_n <- sum(data_pilot$outcome == "Pass") pass_rate <- pass_n / n pass_ci <- binom.test(pass_n, n)$conf.int pilot_summary_table <- tibble( Representation = c("Continuous score", "Pass/fail"), Estimate = c( sprintf("Mean = %.1f; SD = %.1f", mean_score, sd_score), sprintf("%d/%d passed (%.1f%%)", pass_n, n, 100 * pass_rate) ), `Uncertainty / detail` = c( sprintf("95%% CI for mean: %.1f to %.1f", ci_lower, ci_upper), sprintf("Exact 95%% CI for pass rate: %.1f%% to %.1f%%", 100 * pass_ci[1], 100 * pass_ci[2]) ) ) knitr::kable( pilot_summary_table, align = c("l", "l", "l"), booktabs = TRUE ) ``` :::: With the continuous score, the sample mean is `r sprintf("%.1f", pilot_mean)` points and the standard deviation is `r sprintf("%.1f", pilot_sd)`, with a 95% confidence interval from `r sprintf("%.1f", pilot_ci_lower)` to `r sprintf("%.1f", pilot_ci_upper)`. If we dichotomise the same data, the pass rate is `r pilot_pass_n` out of `r n`, or `r sprintf("%.1f", 100 * pilot_pass_rate)`%, with an exact 95% confidence interval from `r sprintf("%.1f", 100 * pilot_pass_ci[1])`% to `r sprintf("%.1f", 100 * pilot_pass_ci[2])`%. The binary summary still gives a pass rate and its uncertainty, but it no longer shows how far above or below the threshold participants scored. ::: {.callout-note appearance="simple" icon=false} ## Interpretation The continuous outcome lets us estimate average performance and quantify uncertainty directly. If the goal is to understand typical performance rather than only whether participants crossed a cut-point, the continuous measure is more informative. ::: ### Research Design Considerations Small-sample studies benefit from tight experimental control. Paired or matched designs (before–after, crossover, matched-pair comparisons) reduce variability by comparing each unit to itself or a closely matched control. This within-unit comparison can yield precise inferences even when the number of units is small. Stratification and blocking can also improve efficiency by accounting for known sources of variation. For example, if you are comparing two teaching methods in a small class, stratify by prior achievement level to reduce heterogeneity within each comparison. Finally, consider sequential or adaptive designs if feasible. Rather than committing to a fixed sample size in advance, you might prespecify an interim review to decide whether recruitment is working as planned, whether variance estimates are much larger than expected, or whether the study should stop early because the signal is already clear. Bayesian methods are well-suited to this style of design because posterior distributions update naturally as data accumulate: the posterior after the first wave becomes the evidence base that is updated when the next wave arrives. These designs still require advance decision rules and transparent reporting so that flexibility remains planned rather than ad hoc. ### Designing Pilot Studies Pilot studies serve specific purposes: assessing feasibility (recruitment rates, attrition, protocol adherence), refining measurement instruments, and estimating variability to inform future sample size calculations. With very small *n* (often 10–30 participants), focus on collecting process metrics and precision estimates rather than hypothesis testing. Report: - **Primary feasibility outcomes** (e.g., proportion screened who consent, time to complete assessments). - **Preliminary effect estimates** with wide confidence intervals, making clear that they are exploratory. - **Adaptations for the main study**, especially where procedures proved onerous or data quality issues emerged; describe what was changed and why so that reviewers can see how the pilot informed the main design. In planning terms, choose a pilot sample large enough to detect major logistical problems (often 12–20 per arm is sufficient for estimating key feasibility parameters rather than testing efficacy), prespecify success criteria such as an acceptable recruitment rate, and plan in advance how you will decide whether to proceed to a full trial [@teare2014]. ### Key Takeaways Small-sample studies work best when the question is narrow, the design is realistic, and the outcome preserves as much defensible information as possible. Exploratory aims, continuous or ordinal measures, and efficient designs such as paired or stratified comparisons often make limited data more informative than an overly ambitious confirmatory plan would. Throughout the design process, report effect sizes and confidence intervals alongside power or feasibility considerations so readers can judge what the study could genuinely show. --- ### Self-Assessment Quiz Test your understanding of the key concepts from Chapter 2. ```{r} #| echo: false #| results: asis quiz_helpers_path <- normalizePath(file.path(dirname(knitr::current_input(dir = TRUE)), "..", "R", "quiz_helpers.R"), mustWork = FALSE) if (file.exists(quiz_helpers_path)) { source(quiz_helpers_path) } else { cat("Quiz helper file not found at:", quiz_helpers_path, "\nPlease ensure 'quiz_helpers.R' exists in the R directory.\n") } smallsamplelab_render_quiz(list( list( prompt = "Which research question is better suited to small samples?", options = c("\"What is the relationship between 20 personality traits and job performance?\"", "\"Does a brief mindfulness intervention reduce test anxiety compared to control?\"", "\"Can we predict customer churn using all available behavioral data?\"", "\"How do socioeconomic factors interact to predict health outcomes?\""), answer = 2L, explanation = "Focused comparative questions with a single primary outcome are feasible with small samples. Multivariate questions (A, C, D) require large samples to estimate many parameters reliably. The chapter emphasizes: \"focused questions about a single outcome or a few key comparisons can often be addressed with modest samples.\"" ), list( prompt = "An exploratory study (n=20) finds that meditation reduces anxiety (p=0.04, d=0.7). How should this be framed?", options = c("\"Meditation is proven effective\"", "\"Preliminary evidence suggests meditation may reduce anxiety; replication needed\"", "\"No conclusions can be drawn from n=20\"", "\"The effect is definitely due to chance\""), answer = 2L, explanation = "Exploratory studies with small samples are useful for generating hypotheses, but their findings should be treated as provisional. The chapter emphasizes that such results should be interpreted cautiously, especially because exploratory work can surface patterns that reflect chance and therefore require replication." ), list( prompt = "A researcher dichotomizes a continuous outcome (0–100 scale) into \"high\" (70 or above) vs \"low\" (<70). With n=25, what is the consequence?", options = c("Power increases because binary outcomes are simpler", "Power decreases because information is discarded", "No effect on statistical power", "Analysis becomes impossible"), answer = 2L, explanation = "Dichotomizing continuous variables discards information about the magnitude of differences, reduces statistical power, and can create spurious findings at arbitrary cut-points. The chapter clearly states: \"rather than dichotomising patient improvement into 'improved' versus 'not improved', use a continuous measure... This preserves information and increases statistical efficiency.\"" ), list( prompt = "A study aims to detect a \"small\" effect (d=0.2) with 80% power. Approximately how many participants per group are needed?", options = c("n=20 per group", "n=50 per group", "n=200 per group", "approximately n=393 per group"), answer = 4L, explanation = "Detecting small effects requires large samples. For d=0.2 with 80% power and alpha = 0.05 (two-tailed), approximately n=393 per group is needed. With n = 15 per group, the chapter shows that a study is only positioned to detect very large effects, approximately d = 1.06 or larger. This aligns with the power curve in Chapter 1, where the d = 0.3 curve remains well below the conventional power target across typical small-sample settings." ), list( prompt = "Which statement about pilot studies is CORRECT?", options = c("Pilot studies should always test hypotheses", "Pilot studies assess feasibility and refine procedures", "Pilot studies require the same sample size as main studies", "Pilot studies never provide useful effect size estimates"), answer = 2L, explanation = "Pilot studies (typically n=10-30) assess feasibility (recruitment rates, protocol adherence, measurement properties), refine procedures, and provide preliminary effect size estimates for sample size planning, but are not sufficiently large for definitive hypothesis testing. The chapter's \"Designing Pilot Studies\" section explicitly states: \"focus on collecting process metrics and precision estimates rather than hypothesis testing.\"" ), list( prompt = "A researcher plans a study with n=15 per group but calculates they need n=50 per group for 80% power. What should they do?", options = c("Proceed exactly as planned and treat the study as confirmatory", "Reframe the study as exploratory and report the minimum detectable effect for n=15", "Drop the study because useful information cannot be learned from n=15", "Keep the original design and add more predictors to recover power"), answer = 2L, explanation = "When the feasible sample size is much smaller than the confirmatory target, the chapter's guidance is to narrow the claim, treat the work as exploratory or pilot-based, and report what effect sizes the study can realistically detect. That combines the framing guidance from the opening section with the emphasis on minimum detectable effects in the estimation section." ), list( prompt = "Which outcome is LEAST appropriate for n=20?", options = c("Binary outcome (success/failure)", "Ordinal outcome (1-7 Likert scale)", "Continuous outcome (0–100 scale)", "50-item questionnaire with subscale factor analysis"), answer = 4L, explanation = "A 50-item factor analysis asks the data to estimate far too many relationships for n=20. That makes the model unstable and the results difficult to trust. This follows the chapter's broader point that complex multivariate questions are usually unrealistic with very small samples." ), list( prompt = "A study comparing two teaching methods (n=12 per class) finds no significant difference (p=0.18, d=0.45). The conclusion should be:", options = c("\"The two methods are equally effective\"", "\"The study found no evidence of a difference, but was underpowered to detect medium effects\"", "\"Teaching method has no effect on learning\"", "\"The null hypothesis is confirmed\""), answer = 2L, explanation = "With small samples, a non-significant result means the study did not provide clear evidence of a difference. An observed effect of d = 0.45 may still be practically meaningful, and the study lacked power to detect it definitively. The chapter emphasizes: \"Even when a small sample has limited power, the estimated effect size and its confidence interval indicate the likely magnitude and precision of the effect.\"" ), list( prompt = "When choosing between a paired and independent-groups design with small samples, which is generally preferable?", options = c("Always use independent groups—pairing is only for large samples", "Paired designs reduce within-subject variability and increase power", "The choice makes no difference statistically", "Paired designs require larger samples than independent designs"), answer = 2L, explanation = "Paired designs reduce within-subject variability and increase power. The chapter's \"Research Design Considerations\" section states: \"Paired or matched designs (before–after, crossover, matched-pair comparisons) reduce variability by comparing each unit to itself or a closely matched control. This within-unit comparison can yield precise inferences even when the number of units is small.\"" ), list( prompt = "A pilot study with n=18 yields a mean difference of 5 points (95% CI: [0.2, 9.8]). What is the appropriate next step?", options = c("Conclude the intervention is effective and implement widely", "Use this estimate to plan a fully-powered confirmatory study", "Abandon the research because the sample was too small", "Report only the p-value and ignore the confidence interval"), answer = 2L, explanation = "Use this estimate to plan a fully-powered confirmatory study. Pilot studies provide preliminary effect estimates and variability information needed for sample size planning. The chapter recommends reporting preliminary effect estimates with wide confidence intervals, making clear that they are exploratory, and using pilots to estimate variability for future sample size calculations." ) )) ```