Chapter 15: Effect Sizes and Confidence Intervals over P-Values

Learning Objectives

By the end of this chapter, you will be able to explain why p-values alone are insufficient for small-sample inference, compute and interpret common effect-size measures, use confidence intervals to judge magnitude and precision, and report results in language that separates statistical evidence from practical importance.

Why P-Values Are Not Enough

A p-value answers a narrow question: how unusual the data would be if the null hypothesis and modelling assumptions were true. It does not tell the reader whether an effect is large, precise, clinically important, educationally meaningful, or worth acting on. In small studies, this limitation is especially visible because even practically important effects can have p-values above 0.05 when the confidence interval is wide.

Effect sizes and confidence intervals restore the missing information. The effect size describes magnitude, while the interval describes the range of values that remain compatible with the data. A transparent report should therefore make the p-value secondary to the estimate, its uncertainty, and its practical interpretation.

Common Effect-Size Metrics

Cohen’s d expresses a mean difference in pooled standard-deviation units. Odds ratios compare the odds of a binary outcome between groups. Correlations such as Pearson’s r, Spearman’s rho, and Kendall’s tau describe association. Variance-accounted-for measures such as eta-squared and epsilon-squared are useful for omnibus comparisons. These metrics are not interchangeable; the correct choice depends on the outcome scale, design, and question.

For ANOVA-style summaries, eta-squared is often biased upward in small samples because it attributes sample-specific noise to the effect. Epsilon-squared and omega-squared are usually more cautious alternatives. For a Kruskal–Wallis test, a common epsilon-squared estimate is \(\epsilon^2 = (H - k + 1)/(n - k)\), where \(H\) is the test statistic, \(k\) is the number of groups, and \(n\) is the total sample size (Tomczak and Tomczak 2014).

Cohen’s conventional benchmarks of d = 0.20, 0.50, and 0.80 are useful only as a starting vocabulary (Cohen 1988). Practical importance is domain-specific. In education, effects as small as d = 0.10 can be meaningful when scaled to large populations (Kraft 2020); in clinical trials, even d = 0.50 may be insufficient if the intervention carries substantial risk, cost, or burden.

Table 15.1

Context-specific effect-size benchmarks

Domain	Small benchmark	Moderate benchmark	Large benchmark	Source
Psychological interventions	d = 0.20	d = 0.50	d = 0.80	Cohen; Lipsey and Wilson
Educational interventions	d = 0.05-0.10	d = 0.20-0.30	d >= 0.50	Kraft
Medical treatments	OR = 1.20	OR = 2.00	OR >= 4.00	Chen and colleagues
Customer satisfaction	d = 0.20	d = 0.50	d = 0.80	Domain standard
Process improvements	5% change	10-20% change	>=25% change	Domain standard

Note. The values are interpretive anchors, not universal cut-offs. Use domain evidence and substantive consequences when judging magnitude.

Educational effects are often smaller than laboratory effects but can still matter when an intervention reaches many students (Kraft 2020). Medical odds ratios require clinical context because the same odds ratio can imply very different absolute risk changes depending on baseline risk (Chen, Cohen, and Chen 2010). Meta-analytic norms, such as those reviewed by Lipsey and Wilson (1993), are useful because they place a new estimate within an empirical distribution of related studies.

Mean Differences and Standardised Effects

The most interpretable effect size is often the unstandardised mean difference. In the teaching-method example below, Method A has a higher average score than Method B. The difference in original score units is the first result to report; Cohen’s d and Hedges’ g help compare the result with other studies that used different scales.

Table 15.2

Mean-difference and standardised effect-size summary

Quantity	Estimate	95% CI	Interpretation
Method A mean	81.5	--	Average score under Method A
Method B mean	79.8	--	Average score under Method B
Mean difference (A - B)	1.8	-1.0 to 4.5	Difference in original test-score units
Cohen's d	0.54	-0.32 to 1.40	Standardised difference using the pooled SD
Hedges' g	0.52	-0.31 to 1.35	Small-sample correction to Cohen's d

Note. Two independent groups with n = 12 per group. Equal-variance t test: t(22) = 1.32, p = 0.200.

Students in Method A scored about 1.8 points higher than those in Method B, with a 95% CI from -1.0 to 4.5 points. On the standardised scale, Cohen’s d is about 0.54 and Hedges’ g is about 0.52. The reporting lesson is to give the reader both the original-unit difference and the standardised comparison, while recognising that the standardised interval remains wide with only 12 observations per group.

Binary Effects: Odds Ratios, Risk Differences, and NNT

For binary outcomes, odds ratios are common but can be difficult to interpret. Odds ratios compare odds, not risks. When baseline risk is low, roughly below 10%, the odds ratio approximates the risk ratio; when baseline risk is higher, the odds ratio can substantially overstate the apparent effect. Risk differences are often more directly useful because they describe the absolute change in event probability. The number needed to treat (NNT) is the reciprocal of the absolute risk difference and answers how many people would need the intervention for one additional success relative to control.

Table 15.3

Binary effect-size summaries for a small trial

Quantity	Estimate	95% CI	Interpretation
Treatment success rate	0.75	--	15 successes among 20 treated participants
Control success rate	0.50	--	10 successes among 20 control participants
Absolute risk difference	0.25	-0.04 to 0.54	Additional success probability under treatment
Odds ratio	2.92	0.66 to 14.50	Odds of success in treatment relative to control
Number needed to treat	4.0	Not bounded because the risk-difference CI crosses 0	Approximate patients treated for one additional success

Note. Fisher's exact test for the 2 x 2 table gives two-sided p = 0.191. The odds-ratio CI is exact; the odds ratio is the conditional maximum-likelihood estimate from Fisher's exact test and can differ slightly from the simple cross-product ratio. The risk-difference CI is an uncorrected large-sample interval used here for interpretation.

The treatment group has a higher observed success rate, but the confidence interval for the risk difference crosses zero and the exact odds-ratio interval is wide. Because NNT is the reciprocal of the risk difference, an interval that includes zero crosses infinity and produces an unbounded or discontinuous NNT interval. In such cases, report the risk difference and its interval as the primary inferential summary, and present the point-estimate NNT with a cautionary note rather than as a simple bounded interval.

Confidence Intervals as the Primary Summary

Confidence intervals are often more informative than p-values because they communicate magnitude and precision simultaneously. An interval that excludes the null value also corresponds to a statistically significant result at the matching alpha level, but its real value is broader: it shows whether the data are compatible with effects that would matter in practice.

Figure 15.1: Confidence intervals communicate both magnitude and precision.

In small-sample work, a non-significant result with a wide interval should usually be described as imprecise rather than negative. Conversely, a narrow interval that rules out practically important effects can support a more substantive conclusion even when p is above 0.05. Chapter 16 develops this distinction in detail.

Reporting Effect Sizes

A complete results sentence names the estimator, gives the point estimate, reports the confidence interval, and then interprets the magnitude in context. For the teaching-method example, a concise report would be: “Method A scores were higher than Method B scores by 1.8 points, 95% CI [-1.0, 4.5], corresponding to Hedges’ g = 0.52.” The practical interpretation should then explain whether a 1.8-point gain is educationally meaningful, not merely whether the p-value crossed a threshold.

Results Sentence Builder

Use this structure when writing small-sample results:

Estimator: Name the comparison or association in original units.

Magnitude: Report the point estimate and a suitable effect size.

Uncertainty: Report the confidence interval before the p-value if both are included.

Context: State the practical threshold, benchmark, or prior evidence used to judge importance.

Caution: If the interval is wide, say that the estimate is imprecise rather than over-interpreting the point estimate.

Example: “The intervention group scored 4.2 points higher than the comparison group, 95% CI [0.8, 7.6], Hedges’ g = 0.62. The interval includes values from a small educational benefit to a larger practically meaningful gain, so the estimate should be treated as promising but still imprecise.”

When adapting the code examples, record the random seed for any simulation or bootstrap step and include package versions in reproducible reports. Stochastic procedures can otherwise produce slightly different intervals across machines.

Key Takeaways

P-values alone are too thin for small-sample interpretation. Effect sizes describe how large a result is, confidence intervals describe how precisely it has been estimated, and original-unit summaries often communicate practical importance better than standardised metrics. A strong report gives the estimate, its uncertainty, and the substantive benchmark used to judge whether the effect matters.

Self-Assessment Quiz

Ayre, Colin, and Andrew J. Scally. 2014. ‘Critical Values for Lawshe’s Content Validity Ratio: Revisiting the Original Methods of Calculation’. Measurement and Evaluation in Counseling and Development 47 (1): 79–86. https://doi.org/10.1177/0748175613513808.

Briggs, Steven R., and Jonathan M. Cheek. 1986. ‘The Role of Factor Analysis in the Development and Evaluation of Personality Scales’. Journal of Personality 54 (1): 106–48.

Buuren, Stef van. 2018. Flexible Imputation of Missing Data. 2nd ed. Boca Raton, FL: Chapman; Hall/CRC.

Cameron, A. Colin, and Pravin K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139013567.

Chen, Henian, Patricia Cohen, and Sophie Chen. 2010. ‘How Big Is a Big Odds Ratio? Interpreting the Magnitudes of Odds Ratios in Epidemiological Studies’. Communications in Statistics - Simulation and Computation 39 (4): 860–64. https://doi.org/10.1080/03610911003650383.

Cicchetti, Domenic V. 1994. ‘Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology’. Psychological Assessment 6 (4): 284–90. https://doi.org/10.1037/1040-3590.6.4.284.

Clark, Lee Anna, and David Watson. 1995. ‘Constructing Validity: Basic Issues in Objective Scale Development’. Psychological Assessment 7 (3): 309–19.

Cohen, Jacob. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates.

Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: John Wiley & Sons.

Costello, Anna B., and Jason Osborne. 2005. ‘Best Practices in Exploratory Factor Analysis: Four Recommendations for Getting the Most from Your Analysis’. Practical Assessment, Research, and Evaluation 10 (1): 1–9. https://doi.org/10.7275/jyj1-4868.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press.

DeVellis, Robert F., and Carolyn T. Thorpe. 2021. Scale Development: Theory and Applications. 5th ed. Thousand Oaks, CA: SAGE Publications.

Dienes, Zoltan. 2014. ‘Using Bayes to Get the Most Out of Non-Significant Results’. Frontiers in Psychology 5: 781. https://doi.org/10.3389/fpsyg.2014.00781.

Efron, Bradley, and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. New York, NY: Chapman; Hall/CRC.

Elm, Erik von, Douglas G. Altman, Matthias Egger, Stuart J. Pocock, Peter C. Gøtzsche, and Jan P. Vandenbroucke. 2007. ‘The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: Guidelines for Reporting Observational Studies’. PLoS Medicine 4 (10): e296. https://doi.org/10.1371/journal.pmed.0040296.

Fay, Michael P. 2010. ‘Confidence Intervals That Match Fisher’s Exact or Blaker’s Exact Tests’. Biostatistics 11 (2): 373–74. https://doi.org/10.1093/biostatistics/kxp050.

Firth, David. 1993. ‘Bias Reduction of Maximum Likelihood Estimates’. Biometrika 80 (1): 27–38. https://doi.org/10.1093/biomet/80.1.27.

Gelman, Andrew, Daniel Simpson, and Michael Betancourt. 2017. ‘The Prior Can Often Only Be Understood in the Context of the Likelihood’. Entropy 19 (10): 555. https://doi.org/10.3390/e19100555.

Good, Phillip I. 2005. Permutation, Parametric, and Bootstrap Tests of Hypotheses. 3rd ed. New York: Springer.

Graham, John W. 2009. ‘Missing Data Analysis: Making It Work in the Real World’. Annual Review of Psychology 60: 549–76. https://doi.org/10.1146/annurev.psych.58.110405.085530.

Hair, Joseph F., G. Tomas M. Hult, Christian M. Ringle, and Marko Sarstedt. 2017. A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM). 2nd ed. Thousand Oaks, CA: SAGE Publications.

Harrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Springer Series in Statistics. New York: Springer. https://doi.org/10.1007/978-3-319-19425-7.

Heinze, Georg, and Michael Schemper. 2002. ‘A Solution to the Problem of Separation in Logistic Regression’. Statistics in Medicine 21 (16): 2409–19. https://doi.org/10.1002/sim.1047.

Hodges, Jr., J. L., and E. L. Lehmann. 1963. ‘Estimates of Location Based on Rank Tests’. The Annals of Mathematical Statistics 34 (2): 598–611. https://doi.org/10.1214/aoms/1177704172.

Hoenig, John M., and Dennis M. Heisey. 2001. ‘The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis’. The American Statistician 55 (1): 19–24.

Hosmer, David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons. https://doi.org/10.1002/9781118548387.

Hu, Li-tze, and Peter M. Bentler. 1999. ‘Cutoff Criteria for Fit Indexes in Covariance Structure Analysis: Conventional Criteria Versus New Alternatives’. Structural Equation Modeling 6 (1): 1–55. https://doi.org/10.1080/10705519909540118.

Huberty, Carl J., and Stephen Olejnik. 2006. Applied MANOVA and Discriminant Analysis. 2nd ed. Hoboken, NJ: Wiley. https://doi.org/10.1002/047178947X.

Hwang, Ching-Lai, and Kwangsun Yoon. 1981. Multiple Attribute Decision Making: Methods and Applications. Berlin: Springer. https://doi.org/10.1007/978-3-642-48318-9.

Hyndman, Rob J., and George Athanasopoulos. 2021. Forecasting: Principles and Practice. 3rd ed. Melbourne, Australia: OTexts. https://otexts.com/fpp3/.

Koo, Terry K., and Mae Y. Li. 2016. ‘A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research’. Journal of Chiropractic Medicine 15 (2): 155–63. https://doi.org/10.1016/j.jcm.2016.02.012.

Kraft, Matthew A. 2020. ‘Interpreting Effect Sizes of Education Interventions’. Educational Researcher 49 (4): 241–53. https://doi.org/10.3102/0013189X20912798.

Lakens, Daniël. 2013. ‘Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-Tests and ANOVAs’. Frontiers in Psychology 4: 863. https://doi.org/10.3389/fpsyg.2013.00863.

Lakens, Daniël, Anne M. Scheel, and Peder M. Isager. 2018. ‘Equivalence Testing for Psychological Research: A Tutorial’. Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.

Lawshe, Charles H. 1975. ‘A Quantitative Approach to Content Validity’. Personnel Psychology 28 (4): 563–75. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x.

Lipsey, Mark W., and David B. Wilson. 1993. ‘The Efficacy of Psychological, Educational, and Behavioural Treatment: Confirmation from Meta-Analysis’. American Psychologist 48 (12): 1181–1209. https://doi.org/10.1037/0003-066X.48.12.1181.

Little, Roderick J. A. 1988. ‘A Test of Missing Completely at Random for Multivariate Data with Missing Values’. Journal of the American Statistical Association 83 (404): 1198–1202. https://doi.org/10.1080/01621459.1988.10478722.

MacKinnon, David P., Jennifer L. Krull, and Chondra M. Lockwood. 2000. ‘Equivalence of the Mediation, Confounding and Suppression Effect’. Prevention Science 1 (4): 173–81.

Mair, Patrick, and Rand R. Wilcox. 2020. ‘Robust Statistical Methods in r Using the WRS2 Package’. Behavior Research Methods 52 (2): 464–88. https://doi.org/10.3758/s13428-019-01246-w.

Mann, Henry B., and Donald R. Whitney. 1947. ‘On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other’. The Annals of Mathematical Statistics 18 (1): 50–60. https://doi.org/10.1214/aoms/1177730491.

McDonald, Roderick P. 1999. Test Theory: A Unified Treatment. Mahwah, NJ: Lawrence Erlbaum Associates.

Morey, Richard D., and Jeffrey N. Rouder. 2011. ‘Bayes Factor Approaches for Testing Interval Null Hypotheses’. Psychological Methods 16 (4): 406–19. https://doi.org/10.1037/a0024377.

Nielsen, Jakob. 1993. Usability Engineering. San Francisco, CA: Morgan Kaufmann.

Olsson, Ulf. 1979. ‘Maximum Likelihood Estimation of the Polychoric Correlation Coefficient’. Psychometrika 44 (4): 443–60.

Opricovic, Serafim, and Gwo-Hshiung Tzeng. 2004. ‘Compromise Solution by MCDM Methods: A Comparative Analysis of VIKOR and TOPSIS’. European Journal of Operational Research 156 (2): 445–55. https://doi.org/10.1016/S0377-2217(03)00020-1.

Page, Matthew J., Joanne E. McKenzie, Patrick M. Bossuyt, Isabelle Boutron, Tammy C. Hoffmann, Cynthia D. Mulrow, Larissa Shamseer, et al. 2021. ‘The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews’. BMJ 372: n71. https://doi.org/10.1136/bmj.n71.

R Core Team. 2026. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. New York, NY: Wiley.

Saaty, Thomas L. 1980. The Analytic Hierarchy Process. New York, NY: McGraw-Hill.

Schulz, Kenneth F., Douglas G. Altman, and David Moher. 2010. ‘CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials’. BMJ 340: c332. https://doi.org/10.1136/bmj.c332.

Shan, Guogen. 2018. Exact Statistical Inference for Categorical Data. London: Academic Press.

Sterne, Jonathan A. C., Ian R. White, John B. Carlin, Michael Spratt, Patrick Royston, Michael G. Kenward, Angela M. Wood, and James R. Carpenter. 2009. ‘Multiple Imputation for Missing Data in Epidemiological and Clinical Research: Potential and Pitfalls’. BMJ 338: b2393. https://doi.org/10.1136/bmj.b2393.

Teare, M. Dawn, Munyaradzi Dimairo, Neil Shephard, Amanda Hayman, Amy Whitehead, and Stephen J. Walters. 2014. ‘Sample Size Requirements to Estimate Key Design Parameters from External Pilot Randomised Controlled Trials: A Simulation Study’. Trials 15: 264. https://doi.org/10.1186/1745-6215-15-264.

Tomczak, Maciej, and Ewa Tomczak. 2014. ‘The Need to Report Effect Size Estimates Revisited: An Overview of Some Recommended Measures of Effect Size’. Trends in Sport Sciences 21 (1): 19–25.

Trizano-Hermosilla, Italo, and Jose M. Alvarado. 2016. ‘Best Alternatives to Cronbach’s Alpha Reliability in Realistic Conditions: Congeneric and Asymmetrical Measurements’. Frontiers in Psychology 7: 769. https://doi.org/10.3389/fpsyg.2016.00769.

Van de Schoot, Rens, and Milica Miočević. 2020. Small Sample Size Solutions: A Guide for Applied Researchers and Practitioners. London: Routledge.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. ‘Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC’. Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

Wagenmakers, Eric-Jan, Jonathon Love, Maarten Marsman, Tahira Jamil, Alexander Ly, Josine Verhagen, Ravi Selker, et al. 2018. ‘Bayesian Inference for Psychology. Part II: Example Applications with JASP’. Psychonomic Bulletin & Review 25 (1): 58–76. https://doi.org/10.3758/s13423-017-1323-7.

White, Ian R., Patrick Royston, and Angela M. Wood. 2011. ‘Multiple Imputation Using Chained Equations: Issues and Guidance for Practice’. Statistics in Medicine 30 (4): 377–99. https://doi.org/10.1002/sim.4067.