Chapter 1: Why Small-Sample Research Matters

Small samples appear routinely across many substantive disciplines: in clinical piloting, ethnographic research, specialised occupational settings, and programme evaluations. In each of these contexts, the productive response to limited n is careful reasoning about power, uncertainty, and method choice rather than apology or retreat. It explains why large-sample approximations can mislead when data are modest, shows how sample size changes what can realistically be detected, and introduces the families of methods that remain useful when information is scarce. The broader aim is to set up the rest of the book: choosing analyses that fit the data you actually have, rather than the data you wish you had.

Learning Objectives

By the end of this chapter, you will be able to explain why small samples are common in applied research, identify where large-sample approximations become unreliable, construct and interpret power curves for different effect sizes, and select methods that fit the data and question at hand rather than the sample size you might have preferred.

Why Small Samples Are Often Unavoidable

Many textbooks assume that researchers can collect hundreds or thousands of observations. In practice, however, numerous research contexts yield small samples. Clinical studies of rare diseases, evaluations of pilot programmes, classroom-based educational interventions, community-based participatory research, and studies in Small Island Developing States (SIDS) often involve fewer than 100 participants. Resource constraints, logistical barriers, and ethical considerations (such as minimising burden on vulnerable populations) make small samples the norm rather than the exception.

For this book, a “small sample” usually means a dataset where routine large-sample approximations cannot be taken for granted: often n ≤ 50 per group for two-group comparisons, n ≤ 100 total for simple regression, or fewer than about 20 outcome events for binary or count models. These are working boundaries rather than universal cut-offs. A sample of 80 can still be small for a multivariable logistic model, while a sample of 24 paired observations may be informative for a tightly controlled within-person comparison.

In health sciences, rare-disease trials, feasibility studies, and single-site hospital evaluations may each involve only a few dozen participants. In education, a classroom-based intervention may be tested in one class of 15 to 25 students, while in business and the social sciences the accessible population may be small from the outset, as in niche-market A/B tests, studies of remote communities, or work with specialised occupational groups. Across these settings, the statistical problem runs deeper than a shortfall in recruitment: the population available for study may itself be limited.

Despite their ubiquity, small samples are often treated as deficient or temporary. In practice, this apologetic framing appears when authors describe modest n as a weakness by default, or when reviewers demand larger samples without asking whether a larger pool actually exists, whether recruitment would be ethical, or whether the research question is already well matched to a small but carefully analysed dataset. That response is methodologically unwarranted because sample size is not a free-floating quality marker: its adequacy depends on the population, the effect size, the design, and the inferential goal. In some settings, focused questions about a single outcome or a few key comparisons can often be addressed with modest samples, and even small studies can be informative when effects are large and variability is low. This mindset overlooks the fact that many important questions can only be addressed with small datasets. Rather than apologising, researchers should select methods that are appropriate for the sample size at hand.

When Large-Sample Approximations Fail

Classical parametric tests (t-tests, ANOVA, standard logistic regression) rely on asymptotic theory. They assume that sampling distributions approximate normality as sample size increases. With small samples, these approximations can be inaccurate. P-values may be misleading, confidence intervals may have poor coverage, and maximum likelihood estimates may be unstable or even undefined (for example, in logistic regression with separation).

Small samples also amplify the impact of outliers and violations of distributional assumptions. A single extreme value can dominate a mean or distort a regression slope. Skewed or heavy-tailed distributions, which cause few problems in large samples, become serious concerns when n is small. When n is limited, researchers should also select outcome measures carefully, because continuous or ordinal outcomes usually preserve more information per observation than coarse binary outcomes.

The sample mean remains meaningful, but the reference distribution used by a familiar test may no longer match the actual sampling behaviour well enough. The simulation below shows the point using a one-sample t-test at a nominal two-sided $\alpha = 0.05$ when observations come from a strongly right-skewed distribution with true mean zero. The test is still centred on the correct null value, but the Type I error rate is higher than the nominal 5% level at small n.

Simulated Type I error for a one-sample t-test under strong right skew.
Sample size	Nominal alpha	Simulated Type I error
5	0.05	0.120
10	0.05	0.103
20	0.05	0.079
30	0.05	0.075

Interpretation: This simulation is deliberately simple, but it gives a concrete reason for caution. At n = 5 or 10, a nominal 5% t-test rejects too often under this skewed null distribution. The error rate moves closer to the target as n increases, but the improvement is gradual rather than automatic.

Visualising Power Trade-offs

Even modest reductions in sample size can have a dramatic impact on statistical power. Figure 1.1 uses base R’s power.t.test() function to illustrate how power declines as per-group sample size falls from 60 to 10, shown separately for small, medium, and large effects (Cohen’s d = 0.3, 0.5, and 0.8).

To calculate one point on the figure, you supply the per-group sample size n, the expected mean difference delta, the within-group standard deviation sd, the significance level sig.level, and the test design through type = "two.sample". For example, power.t.test(n = 35, delta = 0.5, sd = 1, sig.level = 0.05, type = "two.sample", alternative = "two.sided")$power returns a value a little above 0.50, meaning that a two-group study with 35 participants per group has only a little better than a fifty-fifty chance of detecting a true effect of d = 0.5 at the 5% level. Reading Figure 1.1 in this way helps translate abstract power calculations into concrete design choices.

Interpretation

Figure 1.1 shows that with medium effects (about d = 0.5), power does not rise above 50% until per-group sample size reaches roughly the mid-30s, and it still remains below the conventional 80% threshold even at 60 participants per group. Detecting smaller effects (about d = 0.3) would require far more observations than are typically feasible in small-sample settings. This visual reinforces the need to report minimum detectable effects and to focus on estimation rather than binary significance testing when n is limited.

Appropriate Methods for Small Samples

When standard asymptotic approximations cannot be trusted, three broad classes of method remain available. Exact tests (such as Fisher’s exact test, exact binomial tests, and exact Poisson tests) compute p-values directly from the combinatorial distribution of the data and are especially useful for small discrete datasets. Resampling methods (bootstrap and permutation tests) use the observed data to approximate the sampling distribution, often yielding more accurate inferences than large-sample formulas when an exact calculation is unavailable or when the statistic of interest is more complex than a standard test handles well.

Nonparametric rank-based tests (Mann–Whitney U, Wilcoxon signed-rank, Kruskal–Wallis) make fewer distributional assumptions and are less sensitive to outliers, making them natural choices for ordinal or skewed outcomes. Penalised regression (Firth logistic regression, ridge, LASSO) can stabilise coefficient estimates when events are sparse (that is, when outcome events are rare relative to the number of predictors, as in binary logistic regression with few observed cases of the outcome). Bayesian methods incorporate prior information and quantify uncertainty through posterior distributions, which remain well-defined even when data are limited. In practice, these families complement rather than replace one another: exact tests are often most natural for small discrete tables, resampling methods are flexible for estimation and custom statistics, and penalised or Bayesian models become practical when the research question requires regression rather than a single comparison.

Example: Comparing Two Small Groups

Suppose we wish to compare customer satisfaction scores (on a 1–10 scale) between two service branches, each with only 12 observations. The scores are ordinal and may not be normally distributed.

The Mann–Whitney U test compares the distributions of the two groups without assuming normality. The p-value is the probability of observing a rank difference at least as large as the one obtained, assuming the two groups share the same underlying distribution. Because the test is based on ranks, it is robust to skewness and outliers.

Interpretation

In this example, Branch A scores higher than Branch B, with a Hodges–Lehmann shift estimate of 2 points and a large standardised rank-based effect size (Rosenthal’s r = 0.67). The small p-value (p = 0.001) suggests that the observed rank difference would be unlikely if the two branches had the same underlying distribution. With only 12 observations per branch, the example shows how a rank-based method can still produce a clear, interpretable result when the outcome is ordinal and the effect is sizeable.

That gap between what standard rules of thumb demand and what is actually achievable in constrained settings is precisely the problem this book addresses.

Key Takeaways

Small samples are a common feature of many substantive fields rather than a deficiency to be apologised for, and large-sample approximations can fail when n is modest, leading to inaccurate p-values and confidence intervals. Exact tests, resampling methods, and rank-based procedures offer valid alternatives that do not require large samples to behave well. In practice, the choice of method should match the research question, outcome type, and available sample size. Small studies are strongest when the analysis is calibrated to the information actually observed and when uncertainty is reported transparently rather than hidden behind a binary significant/non-significant decision. That principle will guide the chapters that follow.

```{r} #| include: false suppressPackageStartupMessages({ library(dplyr) library(ggplot2) library(purrr) library(tibble) library(tidyr) }) ``` # Chapter 1: Why Small-Sample Research Matters Small samples appear routinely across many substantive disciplines: in clinical piloting, ethnographic research, specialised occupational settings, and programme evaluations. In each of these contexts, the productive response to limited *n* is careful reasoning about power, uncertainty, and method choice rather than apology or retreat. It explains why large-sample approximations can mislead when data are modest, shows how sample size changes what can realistically be detected, and introduces the families of methods that remain useful when information is scarce. The broader aim is to set up the rest of the book: choosing analyses that fit the data you actually have, rather than the data you wish you had. ### Learning Objectives By the end of this chapter, you will be able to explain why small samples are common in applied research, identify where large-sample approximations become unreliable, construct and interpret power curves for different effect sizes, and select methods that fit the data and question at hand rather than the sample size you might have preferred. ### Why Small Samples Are Often Unavoidable Many textbooks assume that researchers can collect hundreds or thousands of observations. In practice, however, numerous research contexts yield small samples. Clinical studies of rare diseases, evaluations of pilot programmes, classroom-based educational interventions, community-based participatory research, and studies in Small Island Developing States (SIDS) often involve fewer than 100 participants. Resource constraints, logistical barriers, and ethical considerations (such as minimising burden on vulnerable populations) make small samples the norm rather than the exception. For this book, a "small sample" usually means a dataset where routine large-sample approximations cannot be taken for granted: often *n* ≤ 50 per group for two-group comparisons, *n* ≤ 100 total for simple regression, or fewer than about 20 outcome events for binary or count models. These are working boundaries rather than universal cut-offs. A sample of 80 can still be small for a multivariable logistic model, while a sample of 24 paired observations may be informative for a tightly controlled within-person comparison. In health sciences, rare-disease trials, feasibility studies, and single-site hospital evaluations may each involve only a few dozen participants. In education, a classroom-based intervention may be tested in one class of 15 to 25 students, while in business and the social sciences the accessible population may be small from the outset, as in niche-market A/B tests, studies of remote communities, or work with specialised occupational groups. Across these settings, the statistical problem runs deeper than a shortfall in recruitment: the population available for study may itself be limited. Despite their ubiquity, small samples are often treated as deficient or temporary. In practice, this apologetic framing appears when authors describe modest *n* as a weakness by default, or when reviewers demand larger samples without asking whether a larger pool actually exists, whether recruitment would be ethical, or whether the research question is already well matched to a small but carefully analysed dataset. That response is methodologically unwarranted because sample size is not a free-floating quality marker: its adequacy depends on the population, the effect size, the design, and the inferential goal. In some settings, focused questions about a single outcome or a few key comparisons can often be addressed with modest samples, and even small studies can be informative when effects are large and variability is low. This mindset overlooks the fact that many important questions can only be addressed with small datasets. Rather than apologising, researchers should select methods that are appropriate for the sample size at hand. ### When Large-Sample Approximations Fail Classical parametric tests (t-tests, ANOVA, standard logistic regression) rely on asymptotic theory. They assume that sampling distributions approximate normality as sample size increases. With small samples, these approximations can be inaccurate. P-values may be misleading, confidence intervals may have poor coverage, and maximum likelihood estimates may be unstable or even undefined (for example, in logistic regression with separation). Small samples also amplify the impact of outliers and violations of distributional assumptions. A single extreme value can dominate a mean or distort a regression slope. Skewed or heavy-tailed distributions, which cause few problems in large samples, become serious concerns when n is small. When *n* is limited, researchers should also select outcome measures carefully, because continuous or ordinal outcomes usually preserve more information per observation than coarse binary outcomes. The sample mean remains meaningful, but the reference distribution used by a familiar test may no longer match the actual sampling behaviour well enough. The simulation below shows the point using a one-sample t-test at a nominal two-sided $\alpha = 0.05$ when observations come from a strongly right-skewed distribution with true mean zero. The test is still centred on the correct null value, but the Type I error rate is higher than the nominal 5% level at small *n*. ```{r} #| label: tab-ch1-type-i-simulation #| echo: false #| results: asis set.seed(2025) simulation_reps <- 10000 simulate_type_i <- function(n, reps = simulation_reps) { x <- matrix(rexp(n * reps, rate = 1) - 1, nrow = reps) sample_means <- rowMeans(x) centred <- sweep(x, 1, sample_means, FUN = "-") sample_sds <- sqrt(rowSums(centred^2) / (n - 1)) t_values <- sample_means / (sample_sds / sqrt(n)) mean(2 * pt(-abs(t_values), df = n - 1) < 0.05) } type_i_table <- tibble( `Sample size` = c(5, 10, 20, 30), `Nominal alpha` = "0.05", `Simulated Type I error` = sprintf("%.3f", purrr::map_dbl(`Sample size`, simulate_type_i)) ) knitr::kable( type_i_table, booktabs = TRUE, align = c("r", "r", "r"), caption = "Simulated Type I error for a one-sample t-test under strong right skew." ) ``` Interpretation: This simulation is deliberately simple, but it gives a concrete reason for caution. At *n* = 5 or 10, a nominal 5% t-test rejects too often under this skewed null distribution. The error rate moves closer to the target as *n* increases, but the improvement is gradual rather than automatic. ### Visualising Power Trade-offs Even modest reductions in sample size can have a dramatic impact on statistical power. Figure 1.1 uses base R's `power.t.test()` function to illustrate how power declines as per-group sample size falls from 60 to 10, shown separately for small, medium, and large effects (Cohen's *d* = 0.3, 0.5, and 0.8). To calculate one point on the figure, you supply the per-group sample size `n`, the expected mean difference `delta`, the within-group standard deviation `sd`, the significance level `sig.level`, and the test design through `type = "two.sample"`. For example, `power.t.test(n = 35, delta = 0.5, sd = 1, sig.level = 0.05, type = "two.sample", alternative = "two.sided")$power` returns a value a little above 0.50, meaning that a two-group study with 35 participants per group has only a little better than a fifty-fifty chance of detecting a true effect of *d* = 0.5 at the 5% level. Reading Figure 1.1 in this way helps translate abstract power calculations into concrete design choices. :::: {.content-visible when-format="html"} ::: {.panel-tabset group="chapter1-power-curve"} #### Output ```{webr-r} #| label: qfig-ch1-power-curves-web #| context: output #| fig-cap: "Figure 1.1: Power curves illustrating sensitivity to sample size." #| alt: "Power curves illustrating sensitivity to sample size." library(dplyr) library(ggplot2) library(purrr) library(tibble) library(tidyr) effect_sizes <- c(0.3, 0.5, 0.8) n_values <- seq(10, 60, by = 5) alpha <- 0.05 power_grid <- crossing(n = n_values, d = effect_sizes) %>% mutate(power = map2_dbl( n, d, ~ power.t.test( n = .x, delta = .y, sd = 1, sig.level = alpha, type = "two.sample", alternative = "two.sided" )$power )) ggplot(power_grid, aes(x = n, y = power, colour = factor(d))) + geom_line(linewidth = 1) + geom_point() + geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") + scale_colour_viridis_d(name = "Effect size (d)", option = "D", end = 0.85) + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + labs( x = "Sample size per group", y = "Power", title = "Statistical power rises with sample size — but slowly at small n", subtitle = "Dashed line marks the conventional 80% power threshold" ) + theme_classic(base_size = 12) ``` #### R Code ```{webr-r} #| label: ch1-power-curves-code #| context: interactive library(dplyr) library(ggplot2) library(purrr) library(tibble) library(tidyr) effect_sizes <- c(0.3, 0.5, 0.8) n_values <- seq(10, 60, by = 5) alpha <- 0.05 power_grid <- crossing(n = n_values, d = effect_sizes) %>% mutate(power = map2_dbl( n, d, ~ power.t.test( n = .x, delta = .y, sd = 1, sig.level = alpha, type = "two.sample", alternative = "two.sided" )$power )) ggplot(power_grid, aes(x = n, y = power, colour = factor(d))) + geom_line(linewidth = 1) + geom_point() + geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") + scale_colour_viridis_d(name = "Effect size (d)", option = "D", end = 0.85) + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + labs( x = "Sample size per group", y = "Power", title = "Statistical power rises with sample size — but slowly at small n", subtitle = "Dashed line marks the conventional 80% power threshold" ) + theme_classic(base_size = 12) ``` ::: ::: :::: {.content-visible unless-format="html"} ```{r} #| label: qfig-ch1-power-curves #| fig-cap: "Figure 1.1: Power curves illustrating sensitivity to sample size." #| alt: "Power curves illustrating sensitivity to sample size." effect_sizes <- c(0.3, 0.5, 0.8) n_values <- seq(10, 60, by = 5) alpha <- 0.05 power_grid <- crossing(n = n_values, d = effect_sizes) %>% mutate(power = map2_dbl( n, d, ~ power.t.test( n = .x, delta = .y, sd = 1, sig.level = alpha, type = "two.sample", alternative = "two.sided" )$power )) ggplot(power_grid, aes(x = n, y = power, colour = factor(d))) + geom_line(linewidth = 1) + geom_point() + geom_hline(yintercept = 0.80, linetype = "dashed", colour = "grey40") + scale_colour_viridis_d(name = "Effect size (d)", option = "D", end = 0.85) + scale_y_continuous(labels = scales::percent_format(accuracy = 1)) + labs( x = "Sample size per group", y = "Power", title = "Statistical power rises with sample size — but slowly at small n", subtitle = "Dashed line marks the conventional 80% power threshold" ) + theme_classic(base_size = 12) ``` :::: ::: {.callout-note appearance="simple" icon=false} ## Interpretation Figure 1.1 shows that with medium effects (about *d* = 0.5), power does not rise above 50% until per-group sample size reaches roughly the mid-30s, and it still remains below the conventional 80% threshold even at 60 participants per group. Detecting smaller effects (about *d* = 0.3) would require far more observations than are typically feasible in small-sample settings. This visual reinforces the need to report minimum detectable effects and to focus on estimation rather than binary significance testing when *n* is limited. ::: ### Appropriate Methods for Small Samples When standard asymptotic approximations cannot be trusted, three broad classes of method remain available. Exact tests (such as Fisher's exact test, exact binomial tests, and exact Poisson tests) compute p-values directly from the combinatorial distribution of the data and are especially useful for small discrete datasets. Resampling methods (bootstrap and permutation tests) use the observed data to approximate the sampling distribution, often yielding more accurate inferences than large-sample formulas when an exact calculation is unavailable or when the statistic of interest is more complex than a standard test handles well. Nonparametric rank-based tests (Mann–Whitney U, Wilcoxon signed-rank, Kruskal–Wallis) make fewer distributional assumptions and are less sensitive to outliers, making them natural choices for ordinal or skewed outcomes. Penalised regression (Firth logistic regression, ridge, LASSO) can stabilise coefficient estimates when events are sparse (that is, when outcome events are rare relative to the number of predictors, as in binary logistic regression with few observed cases of the outcome). Bayesian methods incorporate prior information and quantify uncertainty through posterior distributions, which remain well-defined even when data are limited. In practice, these families complement rather than replace one another: exact tests are often most natural for small discrete tables, resampling methods are flexible for estimation and custom statistics, and penalised or Bayesian models become practical when the research question requires regression rather than a single comparison. ### Example: Comparing Two Small Groups Suppose we wish to compare customer satisfaction scores (on a 1–10 scale) between two service branches, each with only 12 observations. The scores are ordinal and may not be normally distributed. :::: {.content-visible when-format="html"} ::: {.panel-tabset group="chapter1-mann-whitney"} #### Output ```{webr-r} #| label: ch1-mann-whitney-output #| context: output library(dplyr) library(htmltools) library(tibble) branch_a <- c(7, 8, 6, 7, 9, 8, 7, 6, 8, 7, 9, 8) branch_b <- c(5, 6, 7, 5, 6, 5, 7, 6, 5, 6, 7, 6) n1 <- length(branch_a) n2 <- length(branch_b) all_scores <- c(branch_a, branch_b) mw_result <- wilcox.test(branch_a, branch_b, conf.int = TRUE, exact = FALSE) # exact = FALSE avoids tie-related warnings because the satisfaction scores contain repeated values. tie_counts <- table(all_scores) tie_term <- sum(tie_counts^3 - tie_counts) var_u <- n1 * n2 / 12 * ((n1 + n2 + 1) - tie_term / ((n1 + n2) * (n1 + n2 - 1))) # variance of U statistic with ties, see Conover (1999) z_value <- (as.numeric(mw_result$statistic) - (n1 * n2 / 2)) / sqrt(var_u) # Standardized effect size r for Mann–Whitney U test (see e.g., Rosenthal & Rubin, 2003) wilcox_r <- abs(z_value) / sqrt(n1 + n2) summary_table <- tibble( branch = c("A", "B"), N = c(n1, n2), Median = c( formatC(median(branch_a), format = "f", digits = 1), formatC(median(branch_b), format = "f", digits = 1) ), IQR = c( paste0( formatC(as.numeric(quantile(branch_a, 0.25)), format = "f", digits = 1), "–", formatC(as.numeric(quantile(branch_a, 0.75)), format = "f", digits = 1) ), paste0( formatC(as.numeric(quantile(branch_b, 0.25)), format = "f", digits = 1), "–", formatC(as.numeric(quantile(branch_b, 0.75)), format = "f", digits = 1) ) ), Range = c( paste0(min(branch_a), "–", max(branch_a)), paste0(min(branch_b), "–", max(branch_b)) ) ) p_text <- if (mw_result$p.value < 0.001) "p < 0.001" else sprintf("p = %.3f", mw_result$p.value) summary_view <- tagList( tags$style(HTML(" .apa-table-block { font-family: 'Times New Roman', Georgia, serif; color: #111; max-width: 44rem; } .apa-table-number { margin: 0 0 0.1rem 0; font-weight: 700; } .apa-table-title { margin: 0 0 0.5rem 0; font-style: italic; } .apa-table { width: 100%; border-collapse: collapse; font-size: 0.98rem; line-height: 1.35; } .apa-table th, .apa-table td { padding: 0.35rem 0.5rem; border-left: none !important; border-right: none !important; background: transparent !important; } .apa-table thead th { border-top: 2px solid #000; border-bottom: 1px solid #000; font-weight: 600; } .apa-table tbody tr:last-child td { border-bottom: 2px solid #000; } .apa-table-note { margin: 0.45rem 0 0 0; font-size: 0.92rem; line-height: 1.35; } ")), tags$div( class = "apa-table-block", tags$p(class = "apa-table-number", "Table 1.1"), tags$p(class = "apa-table-title", "Customer satisfaction scores by branch"), HTML( knitr::kable( summary_table, format = "html", escape = FALSE, col.names = c("Branch", "n", "Median", "IQR", "Range"), align = c("l", "r", "r", "l", "l"), table.attr = "class='apa-table'" ) ), tags$p( class = "apa-table-note", HTML("Note."), sprintf( " Wilcoxon rank-sum test: W = %.0f, %s, Hodges–Lehmann shift = %.2f, standardised r = %.2f.", as.numeric(mw_result$statistic), p_text, unname(mw_result$estimate), wilcox_r ) ), tags$p( class = "apa-table-note", sprintf( "95%% confidence interval for the location shift: %.2f to %.2f.", mw_result$conf.int[1], mw_result$conf.int[2] ) ) ) ) html_print(summary_view) ``` #### R Code ```{webr-r} #| label: ch1-mann-whitney-code #| context: interactive library(dplyr) library(htmltools) library(tibble) branch_a <- c(7, 8, 6, 7, 9, 8, 7, 6, 8, 7, 9, 8) branch_b <- c(5, 6, 7, 5, 6, 5, 7, 6, 5, 6, 7, 6) n1 <- length(branch_a) n2 <- length(branch_b) all_scores <- c(branch_a, branch_b) mw_result <- wilcox.test(branch_a, branch_b, conf.int = TRUE, exact = FALSE) # exact = FALSE avoids tie-related warnings because the satisfaction scores contain repeated values. tie_counts <- table(all_scores) tie_term <- sum(tie_counts^3 - tie_counts) var_u <- n1 * n2 / 12 * ((n1 + n2 + 1) - tie_term / ((n1 + n2) * (n1 + n2 - 1))) z_value <- (as.numeric(mw_result$statistic) - (n1 * n2 / 2)) / sqrt(var_u) wilcox_r <- abs(z_value) / sqrt(n1 + n2) summary_table <- tibble( branch = c("A", "B"), N = c(n1, n2), Median = c( formatC(median(branch_a), format = "f", digits = 1), formatC(median(branch_b), format = "f", digits = 1) ), IQR = c( paste0( formatC(as.numeric(quantile(branch_a, 0.25)), format = "f", digits = 1), "–", formatC(as.numeric(quantile(branch_a, 0.75)), format = "f", digits = 1) ), paste0( formatC(as.numeric(quantile(branch_b, 0.25)), format = "f", digits = 1), "–", formatC(as.numeric(quantile(branch_b, 0.75)), format = "f", digits = 1) ) ), Range = c( paste0(min(branch_a), "–", max(branch_a)), paste0(min(branch_b), "–", max(branch_b)) ) ) p_text <- if (mw_result$p.value < 0.001) "p < 0.001" else sprintf("p = %.3f", mw_result$p.value) summary_view <- tagList( tags$style(HTML(" .apa-table-block { font-family: 'Times New Roman', Georgia, serif; color: #111; max-width: 44rem; } .apa-table-number { margin: 0 0 0.1rem 0; font-weight: 700; } .apa-table-title { margin: 0 0 0.5rem 0; font-style: italic; } .apa-table { width: 100%; border-collapse: collapse; font-size: 0.98rem; line-height: 1.35; } .apa-table th, .apa-table td { padding: 0.35rem 0.5rem; border-left: none !important; border-right: none !important; background: transparent !important; } .apa-table thead th { border-top: 2px solid #000; border-bottom: 1px solid #000; font-weight: 600; } .apa-table tbody tr:last-child td { border-bottom: 2px solid #000; } .apa-table-note { margin: 0.45rem 0 0 0; font-size: 0.92rem; line-height: 1.35; } ")), tags$div( class = "apa-table-block", tags$p(class = "apa-table-number", "Table 1.1"), tags$p(class = "apa-table-title", "Customer satisfaction scores by branch"), HTML( knitr::kable( summary_table, format = "html", escape = FALSE, col.names = c("Branch", "n", "Median", "IQR", "Range"), align = c("l", "r", "r", "l", "l"), table.attr = "class='apa-table'" ) ), tags$p( class = "apa-table-note", HTML("Note."), sprintf( " Wilcoxon rank-sum test: W = %.0f, %s, Hodges–Lehmann shift = %.2f, standardised r = %.2f.", as.numeric(mw_result$statistic), p_text, unname(mw_result$estimate), wilcox_r ) ), tags$p( class = "apa-table-note", sprintf( "95%% confidence interval for the location shift: %.2f to %.2f.", mw_result$conf.int[1], mw_result$conf.int[2] ) ) ) ) html_print(summary_view) ``` ::: ::: :::: {.content-visible unless-format="html"} ```{r} #| label: example-mann-whitney #| message: false #| warning: false #| results: asis #| tbl-cap: "Customer satisfaction scores by branch" branch_a <- c(7, 8, 6, 7, 9, 8, 7, 6, 8, 7, 9, 8) branch_b <- c(5, 6, 7, 5, 6, 5, 7, 6, 5, 6, 7, 6) n1 <- length(branch_a) n2 <- length(branch_b) all_scores <- c(branch_a, branch_b) mw_result <- wilcox.test(branch_a, branch_b, conf.int = TRUE, exact = FALSE) # exact = FALSE avoids tie-related warnings because the satisfaction scores contain repeated values. tie_counts <- table(all_scores) tie_term <- sum(tie_counts^3 - tie_counts) var_u <- n1 * n2 / 12 * ((n1 + n2 + 1) - tie_term / ((n1 + n2) * (n1 + n2 - 1))) z_value <- (as.numeric(mw_result$statistic) - (n1 * n2 / 2)) / sqrt(var_u) wilcox_r <- abs(z_value) / sqrt(n1 + n2) summary_table <- tibble( branch = c("A", "B"), N = c(n1, n2), Median = c( formatC(median(branch_a), format = "f", digits = 1), formatC(median(branch_b), format = "f", digits = 1) ), IQR = c( paste0( formatC(as.numeric(quantile(branch_a, 0.25)), format = "f", digits = 1), "–", formatC(as.numeric(quantile(branch_a, 0.75)), format = "f", digits = 1) ), paste0( formatC(as.numeric(quantile(branch_b, 0.25)), format = "f", digits = 1), "–", formatC(as.numeric(quantile(branch_b, 0.75)), format = "f", digits = 1) ) ), Range = c( paste0(min(branch_a), "–", max(branch_a)), paste0(min(branch_b), "–", max(branch_b)) ) ) p_text <- if (mw_result$p.value < 0.001) "p < 0.001" else sprintf("p = %.3f", mw_result$p.value) knitr::kable( summary_table, align = c("l", "r", "r", "l", "l"), booktabs = TRUE ) cat( sprintf( "\n\n*Note.* Wilcoxon rank-sum test: W = %.0f, %s, Hodges–Lehmann shift = %.2f, standardised r = %.2f. 95%% confidence interval for the location shift: %.2f to %.2f.\n", as.numeric(mw_result$statistic), p_text, unname(mw_result$estimate), wilcox_r, mw_result$conf.int[1], mw_result$conf.int[2] ) ) ``` :::: The Mann–Whitney U test compares the distributions of the two groups without assuming normality. The p-value is the probability of observing a rank difference at least as large as the one obtained, assuming the two groups share the same underlying distribution. Because the test is based on ranks, it is robust to skewness and outliers. ::: {.callout-note appearance="simple" icon=false} ## Interpretation In this example, Branch A scores higher than Branch B, with a Hodges–Lehmann shift estimate of 2 points and a large standardised rank-based effect size (Rosenthal's *r* = 0.67). The small p-value (*p* = 0.001) suggests that the observed rank difference would be unlikely if the two branches had the same underlying distribution. With only 12 observations per branch, the example shows how a rank-based method can still produce a clear, interpretable result when the outcome is ordinal and the effect is sizeable. That gap between what standard rules of thumb demand and what is actually achievable in constrained settings is precisely the problem this book addresses. ::: ### Key Takeaways Small samples are a common feature of many substantive fields rather than a deficiency to be apologised for, and large-sample approximations can fail when *n* is modest, leading to inaccurate p-values and confidence intervals. Exact tests, resampling methods, and rank-based procedures offer valid alternatives that do not require large samples to behave well. In practice, the choice of method should match the research question, outcome type, and available sample size. Small studies are strongest when the analysis is calibrated to the information actually observed and when uncertainty is reported transparently rather than hidden behind a binary significant/non-significant decision. That principle will guide the chapters that follow. --- ### Self-Assessment Quiz Test your understanding of the key concepts from Chapter 1. ```{r} #| echo: false #| results: asis quiz_helpers_path <- normalizePath(file.path(dirname(knitr::current_input(dir = TRUE)), "..", "R", "quiz_helpers.R"), mustWork = FALSE) if (file.exists(quiz_helpers_path)) { source(quiz_helpers_path) } else { cat("Quiz helper file not found at:", quiz_helpers_path, "\nPlease ensure 'quiz_helpers.R' exists in the R directory.\n") } smallsamplelab_render_quiz(list( list( prompt = "A study with n = 12 per group has 25% power to detect d = 0.5. What does this mean?", options = c("There is a 25% chance the treatment is effective", "If the true effect is d = 0.5, there is a 25% probability of detecting it (p < 0.05)", "The Type I error rate is 25%", "25% of participants will show the effect"), answer = 2L, explanation = "Statistical power is the probability of correctly rejecting a false null hypothesis when a specific effect size exists. With 25% power, there is a 75% chance of a Type II error (failing to detect a real effect of d = 0.5). This concept is directly illustrated in the power curve figure in this chapter, which shows how power declines as sample size decreases." ), list( prompt = "Why might large-sample approximations fail with n = 15?", options = c("Computers cannot process small datasets", "Sampling distributions may not be approximately normal", "Effect sizes cannot be calculated", "P-values are always incorrect"), answer = 2L, explanation = "The Central Limit Theorem requires sufficient sample size for sampling distributions to approximate normality. With n = 15, especially if data are skewed or have outliers, parametric test assumptions may be violated. This is why the chapter emphasizes that \"large-sample approximations can fail when n is small, leading to inaccurate p-values and confidence intervals.\"" ), list( prompt = "Which research question is MOST appropriate for n = 20?", options = c("\"What are all factors that predict customer loyalty?\" (testing 15 predictors)", "\"Is there a difference in satisfaction between two service approaches?\"", "\"How do age, gender, income, education, and occupation interact to predict outcomes?\"", "\"Can we build a machine learning model to predict customer behavior?\""), answer = 2L, explanation = "Focused comparisons are feasible with small samples, whereas broad multivariate prediction questions are not. Complex questions like A, C, and D require many more observations to estimate parameters stably and avoid overfitting." ), list( prompt = "A pilot study with n = 8 finds a mean difference of 5 points (95% CI: [-2, 12], p = 0.14). The correct interpretation is:", options = c("There is no effect", "The effect is exactly 5 points", "The study is underpowered; effects from -2 to 12 points are plausible", "The null hypothesis is proven true"), answer = 3L, explanation = "The wide confidence interval reflects substantial uncertainty. The study cannot rule out small negative effects (-2) or large positive effects (12). Non-significance with small n indicates insufficient evidence, not absence of effect. This aligns with the chapter's emphasis on focusing \"on estimation rather than binary significance testing when n is limited.\"" ), list( prompt = "Which outcome measure provides the MOST statistical information per observation?", options = c("Binary (pass/fail)", "Ordinal (grade A-F)", "Continuous (test score 0-100)", "All provide equal information"), answer = 3L, explanation = "Continuous measures preserve all variation in the data. Dichotomising or coarsening into categories discards information, reduces statistical power, and limits the ability to detect effects. This is why the chapter notes that, when n is limited, researchers should select outcome measures carefully." ), list( prompt = "With n = 10 per group, which statement about power is TRUE?", options = c("Power is always 50% regardless of effect size", "Power increases as the true effect size increases", "Power is unrelated to sample size", "Power cannot be calculated for small samples"), answer = 2L, explanation = "Statistical power increases with larger effect sizes, larger sample sizes, and lower variance. Even with n = 10, a very large effect (d = 1.5) might have adequate power, while a small effect (d = 0.2) would not. This is demonstrated in the power curve plot showing different effect sizes (d = 0.3, 0.5, 0.8)." ), list( prompt = "A study finds p = 0.048 with n = 8 per group. Which concern is MOST valid?", options = c("The result is definitely a false positive", "With small n, results near the significance threshold should be interpreted cautiously", "Small samples always produce spurious results", "The p-value is meaningless with n < 30"), answer = 2L, explanation = "P-values near cutoffs (0.05) are highly variable with small samples. A slight change in data or analysis could flip the result. Emphasis should be on effect size magnitude and confidence intervals, not borderline p-values. The chapter warns that \"small samples amplify the impact of outliers and violations of distributional assumptions.\"" ), list( prompt = "When is a small sample (n < 30) potentially sufficient?", options = c("Never—all research requires n >= 100", "When the effect is very large and variance is low", "Only for qualitative research", "When using machine learning methods"), answer = 2L, explanation = "Small samples can still be informative when effects are large and variability is low. The chapter argues that some important questions can only be addressed with small datasets, provided the method and interpretation are matched to the limited information available." ), list( prompt = "Which is a legitimate reason for small sample size?", options = c("The researcher is lazy", "The population is rare (e.g., a genetic disorder affecting 1 in 100,000)", "The researcher wants to save time", "Small samples are always preferable"), answer = 2L, explanation = "Rare populations, pilot studies, ethical constraints (minimizing burden on vulnerable groups), and resource limitations in SIDS contexts are all legitimate reasons for small samples. The chapter explicitly mentions \"clinical studies of rare diseases\" as one context where \"small samples are the norm rather than the exception.\"" ), list( prompt = "A researcher states: \"My study has n = 15, so I'll just use nonparametric tests.\" What is the problem with this reasoning?", options = c("Nonparametric tests require n >= 30", "The choice of test should depend on the data characteristics and research question, not just sample size", "Nonparametric tests are always inferior", "Parametric tests always work regardless of assumptions"), answer = 2L, explanation = "Test selection should consider outcome type, distributional properties, and the research question. Small n is one consideration, but not the sole criterion. As the chapter concludes, \"The choice of method should match the research question, outcome type, and available sample size.\"" ) )) ```