Chapter 10: Exact Tests and Resampling Methods

Small-sample analysis often requires reference distributions that do not depend on large-sample theory.

Exact tests and resampling methods become especially useful when sample sizes are modest and large-sample approximations carry meaningful risk of inaccuracy. This chapter explains when to use exact tests for discrete outcomes, when permutation tests provide a cleaner reference distribution than a parametric model, and when bootstrap resampling is useful for interval estimation. The emphasis is practical: matching the method to the design, the outcome type, and the inferential goal.

Learning Objectives

By the end of this chapter, you will be able to explain when exact tests are preferable to large-sample approximations, distinguish conditional exact tests from unconditional and mid-p alternatives, implement exact binomial, exact Poisson, permutation, and bootstrap procedures in R, and report resampling analyses with enough detail for readers to reproduce the statistic, number of resamples, random seed, and inferential target.

When to Use Exact and Resampling Methods

Exact tests calculate p-values directly from the null distribution of the statistic. They are especially useful when the outcome is discrete, when expected cell counts are small, or when the sample is too limited for asymptotic results to be reliable.

Resampling methods use the observed data to approximate a sampling distribution. Permutation tests reassign labels under the null hypothesis to create a reference distribution for a test statistic. Bootstrap methods resample with replacement from the observed data to approximate the variability of an estimator and to construct confidence intervals.

In practice, exact tests are most natural for sparse binary or count data. Permutation tests are useful when exchangeability under the null is plausible. Bootstrap methods are especially helpful when the statistic of interest lacks a simple closed-form standard error.

For any resampling analysis, report the statistic being resampled and the number of permutations or bootstrap resamples. Also report the random seed and whether the result is exact or Monte Carlo. Those details determine the stability and reproducibility of the reported inference.

Fisher’s Exact Test for 2×2 Tables

Fisher’s exact test is the standard conditional test for association in a 2×2 contingency table when some expected cell counts are small. It conditions on the observed margins and calculates the probability of the observed table, and all equally or more extreme tables, under the null hypothesis of no association.

Because the sample space is discrete, Fisher’s test can be conservative: its p-value is guaranteed to control the Type I error rate, but that guarantee can come at the cost of reduced power. That trade-off is often acceptable in confirmatory work, especially when one of the margins is fixed by design.

Example: Fisher’s Exact Test

Suppose we are evaluating a new training intervention. Of 10 employees who received training, 8 met their performance target; of 10 who did not receive training, 3 met the target. We test whether training is associated with meeting the target.

Table 10.1

Training outcomes by group

Training group Met target Did not meet target
Training 8 2
No training 3 7

Note. The smallest expected cell count under independence is 4.5, so exact inference is preferable to the chi-square approximation.

Fisher’s exact test gives an odds ratio of 8.15, with a 95% confidence interval from 0.88 to 127.06, and an exact two-sided p-value of 0.070. That odds ratio, despite the wide interval, points toward a meaningful association that a larger study would be well placed to assess.

Visualising the 2×2 Table

A mosaic-style plot helps readers see the same 2×2 structure graphically. The two training groups have equal width because each contains 10 employees, and the vertical split within each block shows the proportion who met the target. In this example, the training group has a visibly larger share of employees meeting the target.

Figure 10.1: Mosaic-style plot of training status and target attainment. Block width reflects group size; block height within each group reflects the proportion meeting the target.

When Fisher’s Exact Test Is Conservative

Fisher’s exact test conditions on both row and column margins of the 2x2 table. In prospective studies, only the group sizes are fixed by design; conditioning on the observed outcome margin as well can make the test conservative. In case-control studies, both margins may align more closely with the sampling plan, though Fisher’s conditioning remains stricter than the design itself.

  • Prospective study or trial: group sizes are fixed by design, so one margin matches the design directly. Fisher’s conditioning still also fixes the observed outcome margin, which is random and can make the test conservative.
  • Case-control study: case and control totals are fixed by design, so the conditioning aligns more closely with the sampling plan, though Fisher’s exact conditioning is still stricter than the design itself.
  • Cross-sectional or convenience sample: neither margin is fixed by design, so the extra conditioning is most visible and conservatism is often easiest to see.

The practical implication is straightforward: a Fisher p-value can stay above 0.05 even when a less conservative exact method gives stronger evidence. Fisher’s test remains valid, but the conditioning scheme affects how much power is available.

Unconditional Exact Tests

The unconditional approach, developed by Barnard, does not require conditioning on both margins. Modern implementations therefore offer unconditional exact tests that condition only on the group sizes. These tests are often more powerful than Fisher’s exact test when the margins are not fixed by design.

In R, the exact2x2 package provides unconditional exact procedures through uncondExact2x2() (Fay 2010). Install it with install.packages("exact2x2") if it is not already available. The score version shown below is Barnard-style in spirit and compares the observed proportions while conditioning only on the group sizes.

Table 10.2

Comparison of exact p-values for the training example

Method Conditioning Two-sided p-value
Fisher's exact test Conditions on both margins 0.070
Fisher's exact test with mid-p correction Conditions on both margins, then applies a mid-p adjustment 0.038
Unconditional exact score test Conditions on group sizes only 0.041

Note. The unconditional exact score test estimates a risk difference of 0.50, with a 95% confidence interval from 0.02 to 0.83.

For this table, Fisher’s exact p-value is 0.070, whereas the mid-p correction gives 0.038 and the unconditional exact score test gives 0.041. The unconditional method is less conservative here because it does not condition on both margins. Reported as training minus no training, the estimated risk difference is 0.50, with a 95% confidence interval from 0.02 to 0.83.

Mid-p Corrections

Mid-p corrections reduce conservatism by subtracting half the probability of the observed table from the tail area. This yields a test that does not guarantee Type I error control at the nominal level in every configuration. Report mid-p results as a sensitivity analysis alongside the standard Fisher test rather than as a replacement default.

WarningCommon Misconception: “Exact” Means “Automatically Best”

Myth: “If a test is exact, it must be the most appropriate or most informative option.”

Reality: Exactness describes how the p-value is computed under the null hypothesis. The appropriate conditioning scheme still depends on the design. In the training example above, Fisher’s exact test gives 0.070, while the mid-p and unconditional exact alternatives give 0.038 and 0.041 respectively.

Lesson:

  1. Exact computation and design appropriateness are separate questions.
  2. Fisher’s exact test is a strong default for confirmatory sparse 2×2 analyses.
  3. Mid-p and unconditional exact tests are useful sensitivity checks when Fisher’s conditioning may be overly restrictive.
  4. With very small samples, the effect estimate and the underlying table can be more informative than a p-value alone.

Exact Binomial Test

The exact binomial test assesses whether the observed number of successes is compatible with a hypothesised success probability. It is appropriate for a single binary outcome when the null benchmark is known in advance or supplied by design.

Example: Exact Binomial Test

A clinic claims that 70% of patients improve with standard care. In a small audit of 15 patients, 13 improved. We test whether the observed proportion is consistent with the clinic’s claim.

Table 10.3

Exact binomial test summary

Observed successes Observed proportion Hypothesised proportion 95% exact CI Exact p-value
13 of 15 0.867 0.70 0.60 to 0.98 0.258

The observed improvement proportion is 0.867, and the exact binomial p-value is 0.258. With such a small audit, the data remain compatible with the clinic’s claimed 70% rate. The 95% exact confidence interval from 0.60 to 0.98 is wide enough to include both the claimed value and substantially higher improvement rates.

Exact Poisson Test

The exact Poisson test is used for count data when the quantity of interest is the number of events in a fixed amount of time, area, or exposure. It tests whether an observed count is consistent with a specified Poisson rate.

Example: Exact Poisson Test

A manufacturing process is expected to produce 3 defects per batch on average. In a random sample of 8 batches, we observe 32 total defects. We test whether the observed rate is consistent with the expected rate of 3 per batch.

Table 10.4

Exact Poisson test summary

Observed count Observed rate Hypothesised rate 95% exact CI Exact p-value
32 defects across 8 batches 4.00 per batch 3.00 per batch 2.74 to 5.65 0.102

The observed defect rate is 4.00 per batch, compared with the hypothesised rate of 3.00. The exact Poisson p-value is 0.102, so the data are broadly consistent with the expected defect rate. The 95% exact interval from 2.74 to 5.65 still includes 3.00.

Permutation Tests

Permutation tests compare the observed statistic with the distribution obtained by rearranging group labels under the null hypothesis. When every possible reallocation is enumerated, the resulting p-value is exact under the null. When only a large random sample of reallocations is used, the result is a Monte Carlo approximation. For two groups of size n1 and n2, the number of reallocations is (n1 + n2)! / (n1! n2!), equivalent to choosing n1 observations from the pooled sample. If this number is computationally feasible, for example below 100,000, enumerate all permutations; otherwise use a large Monte Carlo sample such as B = 10,000 and report both B and the random seed.

The key assumption is exchangeability under the null hypothesis: if the null is true, the joint distribution of the data remains the same under any reassignment of labels. That makes permutation tests especially attractive for small-sample comparisons where a parametric model would be difficult to justify.

Example: Permutation Test for Difference in Means

We compare test scores between two teaching methods with small groups of 8 students each. The outcome is continuous, but with such a small sample it is reasonable to avoid normality assumptions and to test the mean difference by permutation.

Figure 10.2: Exact permutation distribution for the mean difference between teaching methods.

Across all 12,870 possible reallocations, the observed mean difference is 7.00 points and the exact two-sided permutation p-value is 0.0126. Four-decimal precision is shown here because the exact enumeration produces a very small non-zero p-value. Because every possible label assignment is included, this result is exact rather than Monte Carlo. The figure makes the logic visible: the observed difference sits at the extreme tail of the permutation distribution.

Bootstrap Confidence Intervals

Bootstrap resampling constructs confidence intervals by repeatedly drawing samples with replacement from the observed data and recalculating the statistic of interest. The resulting bootstrap distribution approximates the sampling distribution of that statistic.

This chapter uses the nonparametric bootstrap, which resamples directly from the observed data. A parametric bootstrap simulates new datasets from a fitted parametric model, so its validity depends entirely on how well that model represents the data. For small-sample work, the nonparametric bootstrap is often the safer default when a model-free interval is desired.

Example: Bootstrap CI for the Median

We estimate the median recovery time for a small sample of patients and construct a 95% bootstrap confidence interval.

Table 10.5

Bootstrap summary for median recovery time

Statistic Estimate 95% percentile bootstrap CI 95% BCa bootstrap CI Resamples
Median recovery time 16 days 14 to 17 days 14 to 17 days 2,000

The sample median recovery time is 16 days. The 95% percentile bootstrap interval runs from 14 to 17 days, and the BCa interval runs from 14 to 17 days. Both intervals summarise uncertainty around the median without requiring a normal-theory standard error formula. Percentile intervals are simple; BCa intervals adjust for bias and acceleration in the bootstrap distribution (Efron and Tibshirani 1993). With n < 30, BCa intervals can be unstable or fail when the statistic has many ties, so report the interval type and inspect the bootstrap distribution rather than treating BCa as automatically superior.

Key Takeaways

Exact tests are most valuable when the data are sparse enough that large-sample approximations become questionable. Fisher’s exact test is a strong default for sparse 2×2 tables, but mid-p and unconditional exact procedures can serve as useful sensitivity checks when Fisher’s conditioning is stronger than the design requires. Exact binomial and exact Poisson tests extend the same logic to single proportions and sparse event rates, while permutation tests rely on exchangeability and can be exact when every possible reallocation is enumerated. Bootstrap intervals are especially useful for statistics such as medians that lack simple analytic standard errors, provided the resampling procedure is reported transparently.

Self-Assessment Quiz

Test your understanding of exact tests and resampling methods from Chapter 10.

Question 1. When is Fisher’s exact test especially appropriate?

Explanation.

Fisher’s exact test is designed for sparse 2×2 tables where exact inference is preferred to the chi-square approximation. The chapter introduces it as the standard conditional test when some expected cell counts are small.

Question 2. Why can Fisher’s exact test be conservative in some settings?

Explanation.

The chapter explains that Fisher’s test conditions on both margins. When the design leaves one margin free, that conditioning can make the test more conservative and reduce power.

Question 3. When might Barnard-style unconditional tests or a mid-p correction be considered?

Explanation.

The chapter presents mid-p and unconditional exact procedures as less conservative alternatives or sensitivity checks for sparse 2×2 tables, especially when Fisher’s conditioning may be overly restrictive.

Question 4. What does the exact binomial test evaluate?

Explanation.

The exact binomial test is for a single-sample binary outcome. It compares the observed number of successes with a fixed benchmark proportion without relying on large-sample approximations.

Question 5. What is the exact Poisson test used for in this chapter?

Explanation.

The chapter uses the exact Poisson test for sparse count data when the interest is whether an observed count or event rate differs from a known benchmark.

Question 6. What key assumption makes a permutation test valid under the null hypothesis?

Explanation.

Permutation tests rely on exchangeability under the null hypothesis. If the joint distribution of the data remains the same under any reassignment of labels, the permutation reference distribution is valid.

Question 7. Why is the bootstrap especially useful in small-sample work?

Explanation.

The bootstrap is useful when the statistic of interest does not have a simple analytic standard error or when a model-free interval is preferred. The chapter’s median example illustrates this directly.

Question 8. Which reporting detail is most important to include for resampling procedures?

Explanation.

The chapter states that resampling analyses should report the statistic being resampled, the number of permutations or bootstrap resamples, the random seed, and whether the result is exact or Monte Carlo.