Chapter 11: Nonparametric Rank-Based Methods

Learning Objectives

By the end of this chapter, you will be able to explain why rank-based methods are useful for ordinal, skewed, or outlier-prone outcomes, choose among Mann–Whitney, Wilcoxon signed-rank, Kruskal–Wallis, Friedman, Spearman, and Kendall procedures, interpret rank-based tests as location shifts only when distributional assumptions permit, and report p-values alongside robust effect sizes and uncertainty.

When Rank-Based Methods Help

Rank-based methods replace raw observations with their ranks. That makes them less sensitive to extreme values and less dependent on normality than mean-based methods. The trade-off is that ranks discard some information about magnitude, so a t-test or linear model may be more efficient when assumptions are reasonable and the outcome scale is genuinely interval.

In small-sample work, rank-based tests are most useful for ordinal outcomes, visibly skewed continuous outcomes, and settings where one or two extreme observations would dominate a mean. They are not magic assumption-free substitutes for thinking about the design. Independence, pairing, similar distributional shape, and the substantive meaning of ranks still matter.

Mann–Whitney U Test

The Mann–Whitney U test, also called the Wilcoxon rank-sum test, compares two independent groups by ranking all observations together and evaluating whether one group tends to receive larger ranks. This chapter uses Mann–Whitney U for the two-sample procedure and refers to the Wilcoxon rank-sum statistic only when describing the statistic returned by R. When the two groups have similar distributional shapes, including similar variance and skew, the result can be described as evidence about a median or location shift. If the shapes differ markedly, the safer interpretation is stochastic dominance: a randomly selected observation from one group tends to be larger than a randomly selected observation from the other. Inspect histograms, density plots, or dotplots before choosing the interpretation.

Interpreting Mann–Whitney when shapes differ

Use this decision rule before writing the result. First inspect histograms, dotplots, or empirical cumulative distribution functions. If the two groups have broadly similar shape and spread, a location-shift interpretation is reasonable. If the shapes differ, do not describe the result as a simple median comparison. Report stochastic dominance using a probability-of-superiority measure such as Cliff’s delta, together with medians and IQRs for context.

Figure 11.1 shows why this distinction matters. The two groups have the same median, but the spread group has more observations in both tails. A rank test can be sensitive to this broader distributional difference, so the interpretation should be about relative ranks or stochastic dominance rather than a median shift.

Figure 11.1: Empirical cumulative distributions with identical medians but different shapes.

Table 11.1

Equal medians with different distributional shapes

group	Median	IQR	Minimum	Maximum
Compact	5	0	3	7
Spread	5	4	1	9

Note. Both groups have median = 5. The spread group has a wider distribution, so a rank-test result would need a stochastic-dominance interpretation rather than a median-difference interpretation.

In the wait-time example, Branch A has shorter waits than Branch B. Table 11.2 gives the descriptive context, and Table 11.3 gives the inferential summary.

Table 11.2

Wait-time descriptives by branch

Branch	n	Median	IQR	Minimum	Maximum
A	10	7.5	2.5	5	12
B	12	13.0	2.5	10	16

Note. Wait time is measured in minutes. The distributions should be inspected before interpreting the rank-sum test as a simple median comparison.

Table 11.3

Mann–Whitney test and Cliff's delta for the wait-time example

Test	W statistic	Hodges–Lehmann shift (A - B)	95% CI	p-value	Cliff's delta (A vs B)	Delta 95% CI
Mann–Whitney U	4.5	-5.0 minutes	-7.0 to -3.0	< 0.001	-0.925	-1.00 to -0.73

Note. The negative shift and negative Cliff's delta indicate that Branch A wait times tend to be lower than Branch B wait times. The bootstrap CI reflects effect-size uncertainty in this small sample.

The evidence is strong that the wait-time distributions differ. The estimated location shift is -5.0 minutes for Branch A minus Branch B, so Branch A tends to have shorter waits. Cliff’s delta is -0.925 with a bootstrap 95% CI from -1.00 to -0.73, meaning that a randomly selected Branch A wait is usually lower than a randomly selected Branch B wait but the exact magnitude remains sample-dependent.

Effect Sizes Can Be Unstable in Tiny Samples

Large effect estimates in small samples can arise from real separation, but they can also arise from ordinary sampling variation. Table 11.4 illustrates this with two groups generated from the same normal distribution. The observed Cohen’s d carries meaning only when read alongside the sample size, p-value, confidence interval, and substantive plausibility.

Table 11.4

A large-looking effect from two identical populations

Quantity	Value
Group A mean	51.9
Group B mean	49.6
Observed Cohen's d	0.24
Welch t-test p-value	0.719

Note. Both groups were generated from the same population with mean 50 and standard deviation 10. The example shows why effect sizes from n = 5 per group should be treated as provisional.

Wilcoxon Signed-Rank Test

The Wilcoxon signed-rank test is the paired-sample counterpart to the Mann–Whitney test. It ranks the absolute paired differences and tests whether positive and negative ranks balance around zero. The usual location-shift interpretation assumes that the distribution of paired differences is roughly symmetric. The pseudomedian estimate equals the median paired difference only under that symmetry. With skewed paired differences, report the pseudomedian and confidence interval without calling it the median.

The pseudomedian is the median of all Walsh averages: each paired difference is averaged with itself and with every other paired difference, giving $n(n + 1)/2$ values. In this example, 12 paired differences produce 78 Walsh averages. That definition explains why the signed-rank estimate can differ from the ordinary sample median when the paired-difference distribution is skewed.

In the intervention example, anxiety scores decline after treatment. Table 11.5 shows the paired summary.

Table 11.5

Wilcoxon signed-rank summary for paired anxiety scores

Median before	Median after	Median improvement	V statistic	Pseudomedian shift	95% CI	p-value
70	65	5	78	5.0	4.5 to 5.5	0.002

Note. Differences are coded as before minus after, so positive values indicate improvement.

The signed-rank test gives V = 78 and p = 0.002. The estimated pseudomedian improvement is about 5.0 points, with a confidence interval that excludes zero.

Kruskal–Wallis and Friedman Tests

Kruskal–Wallis extends the rank-sum idea to three or more independent groups. A significant result says that at least one group distribution differs, but it does not identify the pair responsible. Follow-up pairwise comparisons require a multiplicity adjustment.

The workflow is the same each time: rank all observations across groups, compute the omnibus Kruskal–Wallis statistic, estimate an effect size such as epsilon-squared, and then run adjusted pairwise comparisons only if the omnibus result is worth following up. In the ward-satisfaction example, the mean ranks show the direction of the pattern before the test is interpreted.

satisfaction_data %>%
  mutate(rank = rank(score, ties.method = "average")) %>%
  group_by(ward) %>%
  summarise(mean_rank = mean(rank), .groups = "drop")

kruskal.test(score ~ ward, data = satisfaction_data)

rstatix::dunn_test(
  satisfaction_data,
  score ~ ward,
  p.adjust.method = "holm"
)

The epsilon-squared estimate is computed as $(H - k + 1)/(n - k)$, where $H$ is the Kruskal–Wallis statistic, $k$ is the number of groups, and $n$ is the total sample size (Tomczak and Tomczak 2014). It is a descriptive measure of how strongly ranks differ across groups, not a replacement for the design context.

Table 11.6

Kruskal–Wallis and adjusted pairwise comparisons

Result	Statistic	df	p-value	Effect
Kruskal–Wallis	12.25	2	0.002	0.68
Blue vs Green			0.001
Blue vs Red			0.102
Green vs Red			0.124

Note. Pairwise rows report Holm-adjusted Dunn test p-values.

Friedman’s test handles three or more related conditions. For the task-condition example, Table 11.7 reports a large within-person rank effect.

Friedman’s test ranks conditions within each participant rather than ranking all observations together. Kendall’s W rescales the Friedman statistic to an effect-size measure from 0 to 1, where larger values indicate stronger separation among the repeated conditions. If the omnibus test is followed up, paired rank comparisons should again be adjusted for multiplicity.

friedman.test(score ~ condition | participant, data = performance_long)

pairwise.wilcox.test(
  performance_long$score,
  performance_long$condition,
  paired = TRUE,
  p.adjust.method = "holm",
  exact = FALSE
)

Table 11.7

Friedman test summary for repeated task conditions

Test	Chi-square	df	p-value	Kendall's W
Friedman	12.07	2	0.002	0.75

Note. Kendall's W is an effect-size measure for agreement or separation among repeated-measure ranks.

Table 11.8

Adjusted paired comparisons after the Friedman test

Comparison	Holm-adjusted p
condition_2 vs condition_1	0.025
condition_3 vs condition_1	0.040
condition_3 vs condition_2	0.240

Note. Pairwise rows report Holm-adjusted paired Wilcoxon p-values. These comparisons are descriptive follow-ups to the omnibus Friedman result.

Rank Correlations

Spearman’s rho and Kendall’s tau measure monotonic association without requiring a linear relationship or normally distributed variables. Spearman’s rho is the Pearson correlation of ranks. Kendall’s tau is based on concordant and discordant pairs, so it is often easier to interpret when tied ranks are common.

When there are no tied ranks, cor.test() can compute exact small-sample p-values for Spearman or Kendall by setting exact = TRUE. When ties are present, R uses approximate p-values. If exact inference is important with ties, use a permutation procedure and report it explicitly.

cor.test(x, y, method = "spearman", exact = TRUE)
cor.test(x, y, method = "kendall", exact = TRUE)

Table 11.9

Rank-correlation summaries for experience and satisfaction

Statistic	Estimate	p-value	Interpretation
Spearman's rho	0.962	< 0.001	Strong monotonic association
Kendall's tau	0.911	< 0.001	Strong concordance between ranks

Note. Both tests use approximate p-values because the data contain ties. If exact small-sample p-values matter, use a permutation procedure or specialised software and report the method.

Lab Practical 11.1: Sales Performance Analysis

A retail company piloted two training programmes across 10 stores each and collected customer satisfaction scores on a 1 to 100 scale. The question is whether Training B produces higher satisfaction than Training A. Figure 11.2 shows nearly complete separation between the programmes, and Table 11.10 reports the descriptive and inferential summaries.

Figure 11.2: Customer satisfaction scores by training programme.

Table 11.10

Sales-training descriptives and rank-sum test

Measure	Value	Details
Programme A	n = 10; median = 71.5; IQR = 3.8	Range 68 to 76
Programme B	n = 10; median = 80.5; IQR = 3.8	Range 77 to 85
Mann–Whitney U	W = 0; p < 0.001	Two-sided rank-sum test
Hodges–Lehmann shift (A - B)	-9 points	95% CI: -12 to -6
Cliff's delta (A vs B)	-1.00	Bootstrap 95% CI: -1.00 to -1.00

Note. The negative shift and Cliff's delta occur because the comparison is coded as A minus B. Substantively, Training B is higher. The bootstrap delta interval is degenerate here because all observed Training B scores exceed all observed Training A scores.

The corrected result is stronger than the older draft suggested: W = 0, p < 0.001, and Cliff’s delta is -1.00 with a bootstrap 95% CI from -1.00 to -1.00. Because all Training B scores exceed all Training A scores, this is complete stochastic separation in the sample. The reporting language should still acknowledge the pilot design rather than claiming guaranteed future superiority.

Choosing Among Rank-Based Methods

The methods in this chapter differ by design, not by which one seems most familiar. Start from the study structure: independent groups, paired observations, repeated conditions, or association between two ordered variables. Then report an effect size that matches the design rather than relying on the p-value alone.

Table 11.11

Rank-based method selection guide

Question	Test	Effect size	R function
Two independent groups	Mann–Whitney U	Cliff's delta or rank-biserial correlation	wilcox.test(); rstatix::wilcox_effsize()
Two paired measurements	Wilcoxon signed-rank	Rank-biserial correlation; pseudomedian shift with CI	wilcox.test(paired = TRUE)
Three or more independent groups	Kruskal–Wallis	Epsilon-squared plus adjusted pairwise contrasts	kruskal.test(); rstatix::dunn_test()
Three or more repeated conditions	Friedman	Kendall's W plus adjusted paired contrasts	friedman.test(); pairwise.wilcox.test(paired = TRUE)
Monotonic association	Spearman or Kendall rank correlation	rho or tau with exact or permutation p-value when feasible	cor.test(method = "spearman" or "kendall")

Note. Use visual checks and design knowledge before choosing the interpretation. A rank test is not automatically a median test when distributional shapes differ.

Reporting Rank-Based Results

A rank-based result should not be reported as a p-value alone. The reader needs to know the design, the outcome scale, the group summaries, the test statistic, the effect size, and the interpretation chosen. For independent groups, that usually means reporting medians and IQRs by group, the Mann–Whitney U result, the p-value, and a probability-style effect size such as Cliff’s delta or rank-biserial correlation. For paired designs, report the median paired change, the signed-rank statistic, the Hodges–Lehmann pseudomedian shift and its interval where available. For more than two groups, state whether follow-up comparisons were adjusted for multiplicity.

Use median-shift language only when the distributions have broadly similar shapes. If one group is more variable or more skewed, write the conclusion as stochastic dominance: observations from one group tended to be higher than observations from the other. This distinction is not cosmetic. It tells readers whether the analysis is about a typical location shift or about the ordering of observations across the full distributions.

A concise report might read: “Satisfaction scores were higher in Training B than Training A. Because the group shapes were similar, the rank-sum result was interpreted as a location shift, W = 0, p < .001, Hodges–Lehmann shift = -8.5 points, Cliff’s delta = -1.00. The negative sign reflects the A-minus-B coding. Substantively, all observed Training B scores exceeded the corresponding Training A range.” If the shapes had differed, the same result should be framed as evidence that Training B scores tended to be higher, not as proof of a median difference.

Key Takeaways

Rank-based tests are useful when outcomes are ordinal, skewed, tied, or vulnerable to outliers, but they still require attention to design and interpretation. Mann–Whitney and Wilcoxon signed-rank tests address independent and paired two-sample questions. Kruskal–Wallis and Friedman extend rank comparisons to multiple independent or repeated groups. Spearman’s rho and Kendall’s tau summarise monotonic association. In small samples, rank-based p-values should be reported with medians, IQRs, robust effect sizes, and enough context to distinguish a defensible pattern from sampling noise. Robust mean-based alternatives are also worth considering when the outcome scale remains meaningfully continuous (Mair and Wilcox 2020).

```{r} #| include: false suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(knitr)) source(normalizePath(file.path(dirname(knitr::current_input(dir = TRUE)), "..", "R", "chapter_table_helpers.R"), mustWork = TRUE)) format_p <- function(p, digits = 3) { threshold <- 10^(-digits) ifelse( p < threshold, paste0("< ", formatC(threshold, format = "f", digits = digits)), formatC(p, format = "f", digits = digits) ) } format_p_statement <- function(p, digits = 3) { threshold <- 10^(-digits) ifelse( p < threshold, paste0("p < ", formatC(threshold, format = "f", digits = digits)), paste0("p = ", formatC(p, format = "f", digits = digits)) ) } cliff_delta <- function(x, y) { comparisons <- outer(x, y, "-") (sum(comparisons > 0) - sum(comparisons < 0)) / (length(x) * length(y)) } rank_biserial_from_w <- function(w, n_x, n_y) { (2 * w) / (n_x * n_y) - 1 } branch_a_wait <- c(5, 7, 6, 8, 12, 7, 9, 6, 10, 8) branch_b_wait <- c(10, 14, 11, 13, 15, 12, 16, 11, 14, 13, 15, 12) wait_data <- tibble( wait_time = c(branch_a_wait, branch_b_wait), branch = rep(c("A", "B"), c(length(branch_a_wait), length(branch_b_wait))) ) wait_test <- wilcox.test(branch_a_wait, branch_b_wait, conf.int = TRUE, exact = FALSE) wait_delta <- cliff_delta(branch_a_wait, branch_b_wait) set.seed(2025) wait_delta_ci <- quantile( replicate( 2000, cliff_delta(sample(branch_a_wait, replace = TRUE), sample(branch_b_wait, replace = TRUE)) ), probs = c(0.025, 0.975), names = FALSE ) wait_rbc <- rank_biserial_from_w(unname(wait_test$statistic), length(branch_a_wait), length(branch_b_wait)) wait_summary <- wait_data %>% group_by(branch) %>% summarise( n = n(), Median = median(wait_time), IQR = IQR(wait_time), Minimum = min(wait_time), Maximum = max(wait_time), .groups = "drop" ) %>% rename(Branch = branch) wait_result_table <- tibble( Test = "Mann–Whitney U", `W statistic` = sprintf("%.1f", unname(wait_test$statistic)), `Hodges–Lehmann shift (A - B)` = sprintf("%.1f minutes", unname(wait_test$estimate)), `95% CI` = sprintf("%.1f to %.1f", wait_test$conf.int[1], wait_test$conf.int[2]), `p-value` = format_p(wait_test$p.value), `Cliff's delta (A vs B)` = sprintf("%.3f", wait_delta), `Delta 95% CI` = sprintf("%.2f to %.2f", wait_delta_ci[1], wait_delta_ci[2]) ) shape_example <- tibble( group = rep(c("Compact", "Spread"), each = 9), score = c(3, 4, 5, 5, 5, 5, 5, 6, 7, 1, 2, 3, 5, 5, 5, 7, 8, 9) ) shape_summary <- shape_example %>% group_by(group) %>% summarise( Median = median(score), IQR = IQR(score), Minimum = min(score), Maximum = max(score), .groups = "drop" ) shape_plot <- ggplot(shape_example, aes(x = score, colour = group)) + stat_ecdf(linewidth = 0.9) + geom_vline(xintercept = 5, linetype = "dashed", colour = "grey45") + scale_colour_manual(values = c(Compact = "#2E7D6E", Spread = "#A65E2E")) + labs( x = "Outcome score", y = "Empirical cumulative proportion", colour = "Group", title = "Equal medians do not guarantee similar distributional shapes" ) + theme_classic(base_size = 12) + theme(legend.position = "top", plot.title = element_text(size = 13)) set.seed(123) group_a <- rnorm(5, mean = 50, sd = 10) group_b <- rnorm(5, mean = 50, sd = 10) pooled_sd <- sqrt((var(group_a) + var(group_b)) / 2) chance_d <- (mean(group_a) - mean(group_b)) / pooled_sd chance_t <- t.test(group_a, group_b) chance_table <- tibble( Quantity = c("Group A mean", "Group B mean", "Observed Cohen's d", "Welch t-test p-value"), Value = c( sprintf("%.1f", mean(group_a)), sprintf("%.1f", mean(group_b)), sprintf("%.2f", chance_d), format_p(chance_t$p.value) ) ) anxiety_before <- c(65, 70, 68, 72, 75, 69, 71, 68, 74, 70, 73, 67) anxiety_after <- c(60, 65, 64, 68, 70, 63, 66, 62, 69, 65, 68, 62) anxiety_data <- tibble( participant = 1:12, before = anxiety_before, after = anxiety_after, difference = before - after ) wilcox_paired <- wilcox.test(anxiety_before, anxiety_after, paired = TRUE, conf.int = TRUE, exact = FALSE) wilcox_table <- tibble( `Median before` = median(anxiety_before), `Median after` = median(anxiety_after), `Median improvement` = median(anxiety_data$difference), `V statistic` = unname(wilcox_paired$statistic), `Pseudomedian shift` = sprintf("%.1f", unname(wilcox_paired$estimate)), `95% CI` = sprintf("%.1f to %.1f", wilcox_paired$conf.int[1], wilcox_paired$conf.int[2]), `p-value` = format_p(wilcox_paired$p.value) ) walsh_averages <- outer(anxiety_data$difference, anxiety_data$difference, "+") / 2 walsh_values <- walsh_averages[upper.tri(walsh_averages, diag = TRUE)] walsh_count <- length(walsh_values) ward_red <- c(7, 8, 6, 7, 9, 8) ward_blue <- c(5, 6, 7, 5, 6, 5) ward_green <- c(8, 9, 8, 9, 10, 9) satisfaction_data <- tibble( score = c(ward_red, ward_blue, ward_green), ward = rep(c("Red", "Blue", "Green"), each = 6) ) kw_result <- kruskal.test(score ~ ward, data = satisfaction_data) kw_epsilon <- (unname(kw_result$statistic) - length(unique(satisfaction_data$ward)) + 1) / (nrow(satisfaction_data) - length(unique(satisfaction_data$ward))) kw_rank_summary <- satisfaction_data %>% mutate(rank = rank(score, ties.method = "average")) %>% group_by(ward) %>% summarise( n = n(), `Median score` = median(score), `Mean rank` = mean(rank), .groups = "drop" ) pairwise_kw <- rstatix::dunn_test( satisfaction_data, score ~ ward, p.adjust.method = "holm" ) kw_table <- tibble( Test = "Kruskal–Wallis", `Chi-square` = sprintf("%.2f", unname(kw_result$statistic)), df = unname(kw_result$parameter), `p-value` = format_p(kw_result$p.value), `Epsilon-squared` = sprintf("%.2f", kw_epsilon) ) kw_pairwise_table <- pairwise_kw %>% transmute( Comparison = paste(group1, "vs", group2), `Holm-adjusted p` = format_p(p.adj) ) performance_data <- tibble( participant = 1:8, condition_1 = c(12, 14, 13, 15, 14, 13, 16, 14), condition_2 = c(14, 16, 15, 17, 16, 15, 18, 15), condition_3 = c(13, 15, 16, 16, 15, 15, 16, 16) ) performance_long <- performance_data %>% pivot_longer(starts_with("condition"), names_to = "condition", values_to = "score") friedman_result <- friedman.test(score ~ condition | participant, data = performance_long) n_subjects <- n_distinct(performance_long$participant) k_conditions <- n_distinct(performance_long$condition) friedman_w <- unname(friedman_result$statistic) / (n_subjects * (k_conditions - 1)) friedman_table <- tibble( Test = "Friedman", `Chi-square` = sprintf("%.2f", unname(friedman_result$statistic)), df = unname(friedman_result$parameter), `p-value` = format_p(friedman_result$p.value), `Kendall's W` = sprintf("%.2f", friedman_w) ) friedman_pairwise <- pairwise.wilcox.test( performance_long$score, performance_long$condition, paired = TRUE, p.adjust.method = "holm", exact = FALSE ) friedman_pairwise_raw <- as.data.frame(as.table(friedman_pairwise$p.value)) names(friedman_pairwise_raw) <- c("Comparison_1", "Comparison_2", "p_value") friedman_pairwise_table <- friedman_pairwise_raw %>% dplyr::filter(!is.na(.data$p_value)) %>% dplyr::transmute( Comparison = paste(.data$Comparison_1, "vs", .data$Comparison_2), `Holm-adjusted p` = format_p(.data$p_value) ) experience <- c(2, 5, 3, 8, 6, 4, 10, 7, 9, 3, 5, 6, 8, 4, 7) satisfaction <- c(5, 7, 6, 8, 7, 6, 9, 8, 9, 5, 6, 7, 8, 6, 7) spearman_result <- cor.test(experience, satisfaction, method = "spearman", exact = FALSE) kendall_result <- cor.test(experience, satisfaction, method = "kendall", exact = FALSE) correlation_table <- tibble( Statistic = c("Spearman's rho", "Kendall's tau"), Estimate = sprintf("%.3f", c(unname(spearman_result$estimate), unname(kendall_result$estimate))), `p-value` = c(format_p(spearman_result$p.value), format_p(kendall_result$p.value)), Interpretation = c("Strong monotonic association", "Strong concordance between ranks") ) sales_data <- tibble( training = rep(c("A", "B"), each = 10), satisfaction = c(72, 68, 75, 70, 74, 69, 73, 71, 76, 70, 78, 82, 79, 84, 81, 77, 83, 80, 85, 79) ) sales_summary <- sales_data %>% group_by(training) %>% summarise( n = n(), Median = median(satisfaction), IQR = IQR(satisfaction), Minimum = min(satisfaction), Maximum = max(satisfaction), .groups = "drop" ) %>% rename(`Training program` = training) sales_test <- wilcox.test(satisfaction ~ training, data = sales_data, exact = FALSE, conf.int = TRUE) sales_p_text <- sprintf( "p %s", ifelse( startsWith(format_p(sales_test$p.value), "<"), format_p(sales_test$p.value), paste0("=", format_p(sales_test$p.value)) ) ) sales_delta <- cliff_delta( sales_data$satisfaction[sales_data$training == "A"], sales_data$satisfaction[sales_data$training == "B"] ) set.seed(2025) sales_delta_ci <- quantile( replicate( 2000, cliff_delta( sample(sales_data$satisfaction[sales_data$training == "A"], replace = TRUE), sample(sales_data$satisfaction[sales_data$training == "B"], replace = TRUE) ) ), probs = c(0.025, 0.975), names = FALSE ) sales_test_table <- tibble( Test = "Mann–Whitney U", `W statistic` = sprintf("%.0f", unname(sales_test$statistic)), `Hodges–Lehmann shift (A - B)` = sprintf("%.0f points", unname(sales_test$estimate)), `95% CI` = sprintf("%.0f to %.0f", sales_test$conf.int[1], sales_test$conf.int[2]), `p-value` = format_p(sales_test$p.value), `Cliff's delta (A vs B)` = sprintf("%.2f", sales_delta), `Delta 95% CI` = sprintf("%.2f to %.2f", sales_delta_ci[1], sales_delta_ci[2]) ) sales_report_table <- bind_rows( sales_summary %>% transmute( Measure = paste("Programme", `Training program`), Value = sprintf("n = %d; median = %.1f; IQR = %.1f", n, Median, IQR), Details = sprintf("Range %d to %d", Minimum, Maximum) ), tibble( Measure = c("Mann–Whitney U", "Hodges–Lehmann shift (A - B)", "Cliff's delta (A vs B)"), Value = c( sprintf("W = %s; %s", sales_test_table$`W statistic`, sales_p_text), sales_test_table$`Hodges–Lehmann shift (A - B)`, sales_test_table$`Cliff's delta (A vs B)` ), Details = c( "Two-sided rank-sum test", paste0("95% CI: ", sales_test_table$`95% CI`), paste0("Bootstrap 95% CI: ", sales_test_table$`Delta 95% CI`) ) ) ) rank_method_table <- tibble( Question = c( "Two independent groups", "Two paired measurements", "Three or more independent groups", "Three or more repeated conditions", "Monotonic association" ), Test = c( "Mann–Whitney U", "Wilcoxon signed-rank", "Kruskal–Wallis", "Friedman", "Spearman or Kendall rank correlation" ), `Effect size` = c( "Cliff's delta or rank-biserial correlation", "Rank-biserial correlation; pseudomedian shift with CI", "Epsilon-squared plus adjusted pairwise contrasts", "Kendall's W plus adjusted paired contrasts", "rho or tau with exact or permutation p-value when feasible" ), `R function` = c( "wilcox.test(); rstatix::wilcox_effsize()", "wilcox.test(paired = TRUE)", "kruskal.test(); rstatix::dunn_test()", "friedman.test(); pairwise.wilcox.test(paired = TRUE)", "cor.test(method = \"spearman\" or \"kendall\")" ) ) sales_plot <- ggplot(sales_data, aes(x = training, y = satisfaction, fill = training)) + geom_boxplot(width = 0.45, alpha = 0.55, outlier.shape = NA) + geom_jitter(width = 0.08, height = 0, size = 2, alpha = 0.85) + scale_fill_manual(values = c("A" = "#86A6C8", "B" = "#D6A66E")) + labs( x = "Training programme", y = "Customer satisfaction score", title = "Training B shows consistently higher satisfaction" ) + theme_classic(base_size = 12) + theme(legend.position = "none", plot.title = element_text(size = 13)) ``` # Chapter 11: Nonparametric Rank-Based Methods ### Learning Objectives By the end of this chapter, you will be able to explain why rank-based methods are useful for ordinal, skewed, or outlier-prone outcomes, choose among Mann–Whitney, Wilcoxon signed-rank, Kruskal–Wallis, Friedman, Spearman, and Kendall procedures, interpret rank-based tests as location shifts only when distributional assumptions permit, and report p-values alongside robust effect sizes and uncertainty. ### When Rank-Based Methods Help Rank-based methods replace raw observations with their ranks. That makes them less sensitive to extreme values and less dependent on normality than mean-based methods. The trade-off is that ranks discard some information about magnitude, so a t-test or linear model may be more efficient when assumptions are reasonable and the outcome scale is genuinely interval. In small-sample work, rank-based tests are most useful for ordinal outcomes, visibly skewed continuous outcomes, and settings where one or two extreme observations would dominate a mean. They are not magic assumption-free substitutes for thinking about the design. Independence, pairing, similar distributional shape, and the substantive meaning of ranks still matter. ### Mann–Whitney U Test The Mann–Whitney U test, also called the Wilcoxon rank-sum test, compares two independent groups by ranking all observations together and evaluating whether one group tends to receive larger ranks. This chapter uses **Mann–Whitney U** for the two-sample procedure and refers to the Wilcoxon rank-sum statistic only when describing the statistic returned by R. When the two groups have similar distributional shapes, including similar variance and skew, the result can be described as evidence about a median or location shift. If the shapes differ markedly, the safer interpretation is stochastic dominance: a randomly selected observation from one group tends to be larger than a randomly selected observation from the other. Inspect histograms, density plots, or dotplots before choosing the interpretation. ::: {.callout-tip} ## Interpreting Mann–Whitney when shapes differ Use this decision rule before writing the result. First inspect histograms, dotplots, or empirical cumulative distribution functions. If the two groups have broadly similar shape and spread, a location-shift interpretation is reasonable. If the shapes differ, do not describe the result as a simple median comparison. Report stochastic dominance using a probability-of-superiority measure such as Cliff's delta, together with medians and IQRs for context. ::: Figure 11.1 shows why this distinction matters. The two groups have the same median, but the spread group has more observations in both tails. A rank test can be sensitive to this broader distributional difference, so the interpretation should be about relative ranks or stochastic dominance rather than a median shift. ```{r} #| label: qfig-ch11-shape-example #| echo: false #| fig-cap: "Figure 11.1: Empirical cumulative distributions with identical medians but different shapes." #| alt: "Empirical cumulative distributions with identical medians but different shapes." #| fig-align: "center" shape_plot ``` ```{r} #| label: tab-ch11-shape-example #| echo: false #| results: asis smallsamplelab_apa_table( "11.1", "Equal medians with different distributional shapes", shape_summary, note = "Both groups have median = 5. The spread group has a wider distribution, so a rank-test result would need a stochastic-dominance interpretation rather than a median-difference interpretation.", align = c("l", "r", "r", "r", "r") ) ``` In the wait-time example, Branch A has shorter waits than Branch B. Table 11.2 gives the descriptive context, and Table 11.3 gives the inferential summary. :::: {.content-visible when-format="html"} ::::: {.panel-tabset group="chapter-11-mann-whitney-descriptives"} #### Rendered Output ```{r} #| label: ch11-wait-descriptives-html #| echo: false #| results: asis smallsamplelab_apa_table( "11.2", "Wait-time descriptives by branch", wait_summary, note = "Wait time is measured in minutes. The distributions should be inspected before interpreting the rank-sum test as a simple median comparison.", align = c("l", "r", "r", "r", "r", "r") ) ``` #### Cell Code ```{webr-r} #| context: interactive branch_a_wait <- c(5, 7, 6, 8, 12, 7, 9, 6, 10, 8) branch_b_wait <- c(10, 14, 11, 13, 15, 12, 16, 11, 14, 13, 15, 12) wilcox.test(branch_a_wait, branch_b_wait, conf.int = TRUE, exact = FALSE) ``` ::::: :::: :::: {.content-visible unless-format="html"} ```{r} #| label: ch11-wait-descriptives #| echo: false #| results: asis smallsamplelab_apa_table( "11.2", "Wait-time descriptives by branch", wait_summary, note = "Wait time is measured in minutes. The distributions should be inspected before interpreting the rank-sum test as a simple median comparison.", align = c("l", "r", "r", "r", "r", "r") ) ``` :::: ```{r} #| label: ch11-mann-whitney-summary #| echo: false #| results: asis smallsamplelab_apa_table( "11.3", "Mann–Whitney test and Cliff's delta for the wait-time example", wait_result_table, note = "The negative shift and negative Cliff's delta indicate that Branch A wait times tend to be lower than Branch B wait times. The bootstrap CI reflects effect-size uncertainty in this small sample.", align = c("l", "r", "r", "l", "r", "r", "l") ) ``` The evidence is strong that the wait-time distributions differ. The estimated location shift is `r sprintf("%.1f", unname(wait_test$estimate))` minutes for Branch A minus Branch B, so Branch A tends to have shorter waits. Cliff's delta is `r sprintf("%.3f", wait_delta)` with a bootstrap 95% CI from `r sprintf("%.2f", wait_delta_ci[1])` to `r sprintf("%.2f", wait_delta_ci[2])`, meaning that a randomly selected Branch A wait is usually lower than a randomly selected Branch B wait but the exact magnitude remains sample-dependent. ### Effect Sizes Can Be Unstable in Tiny Samples Large effect estimates in small samples can arise from real separation, but they can also arise from ordinary sampling variation. Table 11.4 illustrates this with two groups generated from the same normal distribution. The observed Cohen's d carries meaning only when read alongside the sample size, p-value, confidence interval, and substantive plausibility. ```{r} #| label: ch11-chance-effect #| echo: false #| results: asis smallsamplelab_apa_table( "11.4", "A large-looking effect from two identical populations", chance_table, note = "Both groups were generated from the same population with mean 50 and standard deviation 10. The example shows why effect sizes from n = 5 per group should be treated as provisional.", align = c("l", "r") ) ``` ### Wilcoxon Signed-Rank Test The Wilcoxon signed-rank test is the paired-sample counterpart to the Mann–Whitney test. It ranks the absolute paired differences and tests whether positive and negative ranks balance around zero. The usual location-shift interpretation assumes that the distribution of paired differences is roughly symmetric. The pseudomedian estimate equals the median paired difference only under that symmetry. With skewed paired differences, report the pseudomedian and confidence interval without calling it the median. The pseudomedian is the median of all Walsh averages: each paired difference is averaged with itself and with every other paired difference, giving $n(n + 1)/2$ values. In this example, `r nrow(anxiety_data)` paired differences produce `r walsh_count` Walsh averages. That definition explains why the signed-rank estimate can differ from the ordinary sample median when the paired-difference distribution is skewed. In the intervention example, anxiety scores decline after treatment. Table 11.5 shows the paired summary. ```{r} #| label: ch11-wilcoxon-summary #| echo: false #| results: asis smallsamplelab_apa_table( "11.5", "Wilcoxon signed-rank summary for paired anxiety scores", wilcox_table, note = "Differences are coded as before minus after, so positive values indicate improvement.", align = c("r", "r", "r", "r", "r", "l", "r") ) ``` The signed-rank test gives V = `r unname(wilcox_paired$statistic)` and `r format_p_statement(wilcox_paired$p.value)`. The estimated pseudomedian improvement is about `r sprintf("%.1f", unname(wilcox_paired$estimate))` points, with a confidence interval that excludes zero. ### Kruskal–Wallis and Friedman Tests Kruskal–Wallis extends the rank-sum idea to three or more independent groups. A significant result says that at least one group distribution differs, but it does not identify the pair responsible. Follow-up pairwise comparisons require a multiplicity adjustment. The workflow is the same each time: rank all observations across groups, compute the omnibus Kruskal–Wallis statistic, estimate an effect size such as epsilon-squared, and then run adjusted pairwise comparisons only if the omnibus result is worth following up. In the ward-satisfaction example, the mean ranks show the direction of the pattern before the test is interpreted. ```{r} #| label: ch11-kruskal-step-code #| eval: false satisfaction_data %>% mutate(rank = rank(score, ties.method = "average")) %>% group_by(ward) %>% summarise(mean_rank = mean(rank), .groups = "drop") kruskal.test(score ~ ward, data = satisfaction_data) rstatix::dunn_test( satisfaction_data, score ~ ward, p.adjust.method = "holm" ) ``` The epsilon-squared estimate is computed as $(H - k + 1)/(n - k)$, where $H$ is the Kruskal–Wallis statistic, $k$ is the number of groups, and $n$ is the total sample size [@tomczak2014]. It is a descriptive measure of how strongly ranks differ across groups, not a replacement for the design context. ```{r} #| label: ch11-kruskal-summary #| echo: false #| results: asis smallsamplelab_apa_table( "11.6", "Kruskal–Wallis and adjusted pairwise comparisons", bind_rows( kw_table %>% transmute(Result = Test, Statistic = `Chi-square`, df = as.character(df), `p-value`, Effect = `Epsilon-squared`), kw_pairwise_table %>% transmute(Result = Comparison, Statistic = "", df = "", `p-value` = `Holm-adjusted p`, Effect = "") ), note = "Pairwise rows report Holm-adjusted Dunn test p-values.", align = c("l", "r", "r", "r", "r") ) ``` Friedman's test handles three or more related conditions. For the task-condition example, Table 11.7 reports a large within-person rank effect. Friedman's test ranks conditions within each participant rather than ranking all observations together. Kendall's W rescales the Friedman statistic to an effect-size measure from 0 to 1, where larger values indicate stronger separation among the repeated conditions. If the omnibus test is followed up, paired rank comparisons should again be adjusted for multiplicity. ```{r} #| label: ch11-friedman-step-code #| eval: false friedman.test(score ~ condition | participant, data = performance_long) pairwise.wilcox.test( performance_long$score, performance_long$condition, paired = TRUE, p.adjust.method = "holm", exact = FALSE ) ``` ```{r} #| label: ch11-friedman-summary #| echo: false #| results: asis smallsamplelab_apa_table( "11.7", "Friedman test summary for repeated task conditions", friedman_table, note = "Kendall's W is an effect-size measure for agreement or separation among repeated-measure ranks.", align = c("l", "r", "r", "r", "r") ) ``` ```{r} #| label: ch11-friedman-pairwise #| echo: false #| results: asis smallsamplelab_apa_table( "11.8", "Adjusted paired comparisons after the Friedman test", friedman_pairwise_table, note = "Pairwise rows report Holm-adjusted paired Wilcoxon p-values. These comparisons are descriptive follow-ups to the omnibus Friedman result.", align = c("l", "r") ) ``` ### Rank Correlations Spearman's rho and Kendall's tau measure monotonic association without requiring a linear relationship or normally distributed variables. Spearman's rho is the Pearson correlation of ranks. Kendall's tau is based on concordant and discordant pairs, so it is often easier to interpret when tied ranks are common. When there are no tied ranks, `cor.test()` can compute exact small-sample p-values for Spearman or Kendall by setting `exact = TRUE`. When ties are present, R uses approximate p-values. If exact inference is important with ties, use a permutation procedure and report it explicitly. ```{r} #| eval: false #| label: ch11-exact-rank-correlation-code cor.test(x, y, method = "spearman", exact = TRUE) cor.test(x, y, method = "kendall", exact = TRUE) ``` ```{r} #| label: ch11-rank-correlations #| echo: false #| results: asis smallsamplelab_apa_table( "11.9", "Rank-correlation summaries for experience and satisfaction", correlation_table, note = "Both tests use approximate p-values because the data contain ties. If exact small-sample p-values matter, use a permutation procedure or specialised software and report the method.", align = c("l", "r", "r", "l") ) ``` ### Lab Practical 11.1: Sales Performance Analysis A retail company piloted two training programmes across 10 stores each and collected customer satisfaction scores on a 1 to 100 scale. The question is whether Training B produces higher satisfaction than Training A. Figure 11.2 shows nearly complete separation between the programmes, and Table 11.10 reports the descriptive and inferential summaries. :::: {.content-visible when-format="html"} ::::: {.panel-tabset group="chapter-11-sales-figure"} #### Rendered Output ```{r} #| label: ch11-sales-figure-html #| echo: false #| fig-cap: "Figure 11.2: Customer satisfaction scores by training programme." #| alt: "Customer satisfaction scores by training programme." sales_plot ``` #### Cell Code ```{webr-r} #| context: interactive sales_data <- tibble( training = rep(c("A", "B"), each = 10), satisfaction = c(72, 68, 75, 70, 74, 69, 73, 71, 76, 70, 78, 82, 79, 84, 81, 77, 83, 80, 85, 79) ) ggplot(sales_data, aes(x = training, y = satisfaction, fill = training)) + geom_boxplot(alpha = 0.75, width = 0.55) + geom_jitter(width = 0.08, alpha = 0.75, size = 2) ``` ::::: :::: :::: {.content-visible unless-format="html"} ```{r} #| label: ch11-sales-figure #| echo: false #| fig-cap: "Figure 11.2: Customer satisfaction scores by training programme." #| alt: "Customer satisfaction scores by training programme." sales_plot ``` :::: ```{r} #| label: ch11-sales-summary #| echo: false #| results: asis smallsamplelab_apa_table( "11.10", "Sales-training descriptives and rank-sum test", sales_report_table, note = "The negative shift and Cliff's delta occur because the comparison is coded as A minus B. Substantively, Training B is higher. The bootstrap delta interval is degenerate here because all observed Training B scores exceed all observed Training A scores.", align = c("l", "l", "l") ) ``` The corrected result is stronger than the older draft suggested: W = `r sprintf("%.0f", unname(sales_test$statistic))`, `r sales_p_text`, and Cliff's delta is `r sprintf("%.2f", sales_delta)` with a bootstrap 95% CI from `r sprintf("%.2f", sales_delta_ci[1])` to `r sprintf("%.2f", sales_delta_ci[2])`. Because all Training B scores exceed all Training A scores, this is complete stochastic separation in the sample. The reporting language should still acknowledge the pilot design rather than claiming guaranteed future superiority. ### Choosing Among Rank-Based Methods The methods in this chapter differ by design, not by which one seems most familiar. Start from the study structure: independent groups, paired observations, repeated conditions, or association between two ordered variables. Then report an effect size that matches the design rather than relying on the p-value alone. ```{r} #| label: ch11-rank-method-guide #| echo: false #| results: asis smallsamplelab_apa_table( "11.11", "Rank-based method selection guide", rank_method_table, note = "Use visual checks and design knowledge before choosing the interpretation. A rank test is not automatically a median test when distributional shapes differ.", align = c("l", "l", "l", "l", "l") ) ``` ### Reporting Rank-Based Results A rank-based result should not be reported as a p-value alone. The reader needs to know the design, the outcome scale, the group summaries, the test statistic, the effect size, and the interpretation chosen. For independent groups, that usually means reporting medians and IQRs by group, the Mann–Whitney U result, the p-value, and a probability-style effect size such as Cliff's delta or rank-biserial correlation. For paired designs, report the median paired change, the signed-rank statistic, the Hodges–Lehmann pseudomedian shift and its interval where available. For more than two groups, state whether follow-up comparisons were adjusted for multiplicity. Use median-shift language only when the distributions have broadly similar shapes. If one group is more variable or more skewed, write the conclusion as stochastic dominance: observations from one group tended to be higher than observations from the other. This distinction is not cosmetic. It tells readers whether the analysis is about a typical location shift or about the ordering of observations across the full distributions. A concise report might read: "Satisfaction scores were higher in Training B than Training A. Because the group shapes were similar, the rank-sum result was interpreted as a location shift, W = 0, p < .001, Hodges–Lehmann shift = -8.5 points, Cliff's delta = -1.00. The negative sign reflects the A-minus-B coding. Substantively, all observed Training B scores exceeded the corresponding Training A range." If the shapes had differed, the same result should be framed as evidence that Training B scores tended to be higher, not as proof of a median difference. ### Key Takeaways Rank-based tests are useful when outcomes are ordinal, skewed, tied, or vulnerable to outliers, but they still require attention to design and interpretation. Mann–Whitney and Wilcoxon signed-rank tests address independent and paired two-sample questions. Kruskal–Wallis and Friedman extend rank comparisons to multiple independent or repeated groups. Spearman's rho and Kendall's tau summarise monotonic association. In small samples, rank-based p-values should be reported with medians, IQRs, robust effect sizes, and enough context to distinguish a defensible pattern from sampling noise. Robust mean-based alternatives are also worth considering when the outcome scale remains meaningfully continuous [@mair2020]. ### Self-Assessment Quiz Test your understanding of nonparametric tests and rank-based methods from Chapter 11. ```{r} #| echo: false #| results: asis source(normalizePath(file.path(dirname(knitr::current_input(dir = TRUE)), "..", "R", "quiz_helpers.R"), mustWork = TRUE)) smallsamplelab_render_quiz(list( list( prompt = "The Mann–Whitney U test is most appropriate when:", options = c("The outcome is a normally distributed continuous variable and means are the only estimand", "Two independent groups have ordinal or skewed continuous outcomes", "The same participants are measured under three conditions", "The goal is to estimate a Poisson event rate"), answer = 2L, explanation = "Mann–Whitney is the independent two-group rank test. It is especially useful for ordinal outcomes or continuous outcomes where skewness or outliers make mean-based inference fragile." ), list( prompt = "A significant Mann–Whitney test can be interpreted as a median difference only when:", options = c("Both groups have similar distributional shapes", "The p-value is below 0.001", "The sample size is larger than 200", "There are no repeated values"), answer = 1L, explanation = "The median-shift interpretation depends on similar distributional shapes, including similar variance and skew. If the shapes differ, stochastic dominance is the safer interpretation." ), list( prompt = "The Wilcoxon signed-rank test is used for:", options = c("Two independent groups", "Paired or matched observations", "Three or more independent groups", "A single count against a benchmark rate"), answer = 2L, explanation = "The signed-rank test is the nonparametric paired-sample procedure. It ranks paired differences and tests whether they are centered around zero." ), list( prompt = "What does the Hodges–Lehmann estimate summarise in a rank-sum comparison?", options = c("The ordinary mean difference", "The median of all pairwise differences", "The chi-square statistic", "The number of tied ranks"), answer = 2L, explanation = "For a two-sample rank-sum comparison, the Hodges–Lehmann estimate is the median of all pairwise differences and is interpreted as a robust location shift." ), list( prompt = "After a significant Kruskal–Wallis test, the next step is usually to:", options = c("Assume every pair differs", "Use adjusted pairwise rank comparisons to identify where the difference lies", "Switch to Pearson correlation", "Ignore the result because it is nonparametric"), answer = 2L, explanation = "The omnibus Kruskal–Wallis test says at least one distribution differs. Pairwise comparisons with multiplicity correction are needed to identify the specific contrasts." ), list( prompt = "Friedman's test is the rank-based alternative to:", options = c("Repeated-measures ANOVA", "One-sample t-test", "Poisson regression", "Fisher's exact test"), answer = 1L, explanation = "Friedman's test compares three or more related or repeated-measure conditions using within-participant ranks." ), list( prompt = "Kendall's tau is often useful when:", options = c("The outcome is a sparse event count", "There are many tied ranks or a probability-based rank interpretation is helpful", "The model has many predictors", "A bootstrap median interval is required"), answer = 2L, explanation = "Kendall's tau is based on concordant and discordant pairs, making it useful when tied ranks are common and when that probability interpretation helps readers." ), list( prompt = "A large Cliff's delta from a tiny sample should be interpreted:", options = c("As definitive proof of a population effect", "Together with the sample size, confidence interval, p-value, and substantive plausibility", "As invalid because rank tests cannot have effect sizes", "As equivalent to a mean difference in standard deviations"), answer = 2L, explanation = "With small samples, large effect estimates can be real or can arise by chance. The effect estimate needs uncertainty, context, and preferably replication." ) )) ```