Chapter 16: Interpreting Non-Significant Results

Learning Objectives

By the end of this chapter, you will be able to distinguish “no evidence of effect” from “evidence of no meaningful effect,” interpret non-significant results using confidence intervals and power, recognise when a study is inconclusive, and use equivalence or non-inferiority logic when the research question is about similarity rather than difference.

What a Non-Significant Result Means

A non-significant result means that the observed data were not sufficiently inconsistent with the null hypothesis to reject it at the chosen alpha level. It does not prove the null hypothesis, and it does not show that an intervention is ineffective. With small samples, non-significant results are common because power is limited; a study with 30% power will fail to reject the null 70% of the time even when the alternative hypothesis is true.

The phrase “absence of evidence is not evidence of absence” is useful here, but it should be qualified by power. If a design has only 30% power to detect a meaningful effect, a non-significant result provides little information either way. If a study reports p = 0.15, the safest conclusion is that the data do not provide strong evidence against the null. Whether the result also rules out practically meaningful effects depends on the confidence interval.

Reading the Confidence Interval

The confidence interval is the main tool for interpreting a non-significant result. A narrow interval that excludes effects large enough to matter can support a claim that any remaining effect is likely trivial. A wide interval that includes trivial and important values means the study is inconclusive.

Table 16.1

Confidence-interval interpretations for non-significant results

Scenario	n per group	Mean difference	95% CI	p-value	Interpretation
Narrow non-significant interval	100	1	-0.5 to 2.5	0.120	If differences below 3 points are trivial, this interval rules out a meaningful benefit.
Wide non-significant interval	12	2	-3.0 to 7.0	0.350	The interval includes harmful, trivial, and beneficial values; the study is inconclusive.
Equivalence not established	12	2	-3.0 to 7.0	0.450	The interval extends outside the +/-5 equivalence margin, so equivalence is not demonstrated.

Note. All examples use a mean-difference scale where positive values favour the intervention.

Figure 16.1: Non-significant results can be precise, inconclusive, or insufficient for equivalence.

The first interval rules out effects above a practical threshold of 3 points, so a cautious “no meaningful difference” interpretation may be reasonable if that threshold was defined in advance. The second interval is wider and includes values that could change the substantive conclusion. The third example shows why a small point estimate alone is not enough to claim equivalence.

Power and Minimum Detectable Effects

Small samples can fail to detect effects that would matter in practice. A useful post hoc descriptive check is the minimum detectable effect: the effect size the planned sample would have had 80% power to detect under conventional assumptions. The purpose is to communicate what the design was capable of detecting, which is a different question from salvaging an observed p-value.

Table 16.2

Minimum detectable standardised effects under conventional two-sample testing

n per group	Rounded MDE	Interpretation
12	1.20	Only very large effects are detectable with 80% power.
20	0.91	Large effects remain the main detectable target.
50	0.57	Moderate-to-large effects become detectable.
100	0.40	Moderate effects are detectable with conventional power.

Note. Values use two-sided alpha = 0.05, 80% power, equal group sizes, and sd = 1. They should be treated as planning summaries rather than guarantees. The 100-per-group row is shown only as a comparison point and exceeds the book's primary small-sample scope.

With 12 participants per group, the design is well powered only for a standardised effect of about d = 1.20. If the smallest meaningful effect is d = 0.40, a non-significant result from that design cannot rule it out. The correct conclusion is that the study was too imprecise to decide the question.

This design-based MDE check is different from post-hoc power calculated from the observed effect size. MDE uses the planned design inputs: sample size, alpha, and a target power level. Post-hoc power based on the observed effect is circular because it is mostly a re-expression of the p-value; it does not add information after the study has been analysed and is not recommended (Hoenig and Heisey 2001).

Equivalence and Non-Inferiority

If the scientific question is whether two treatments are similar enough, ordinary null-hypothesis testing is the wrong framework. Equivalence testing starts by defining a margin: the largest difference that would still be practically negligible. The two one-sided tests procedure then asks whether the confidence interval lies entirely inside that margin (Lakens, Scheel, and Isager 2018). Non-inferiority testing uses a one-sided version when the goal is to show that a new treatment is not unacceptably worse than a standard option.

In the anxiety-score example below, the observed difference is 2 points on a 0-100 scale. The prespecified equivalence margin is +/-5 points. The ordinary p-value is non-significant, but the 95% CI extends from -3 to 7, so values above the +5 margin remain plausible.

The TOST statistics are derived from the confidence interval and margin. With df = 22, the standard error implied by the 95% CI is SE = (7 - (-3)) / (2 * qt(0.975, 22)) = 2.41. The lower-margin statistic is (2 - (-5)) / SE = 2.90; the upper-margin statistic is (2 - 5) / SE = -1.24. The same calculation can be reproduced with a dedicated function such as TOSTER::TOSTtwo.raw() when the group means, standard deviations and sample sizes are available.

# The values below reproduce the worked example:
# group mean difference = 2, n = 12 per group, pooled SD chosen so SE = 2.41.
library(TOSTER)

TOSTtwo.raw(
  m1 = 52,
  m2 = 50,
  sd1 = 5.91,
  sd2 = 5.91,
  n1 = 12,
  n2 = 12,
  low_eqbound = -5,
  high_eqbound = 5,
  alpha = 0.05,
  var.equal = TRUE
)

When TOSTER is not available, the manual calculation in Table 16.3 is sufficient: compute the standard error, test the lower margin with the upper-tail probability, test the upper margin with the lower-tail probability, and conclude equivalence only if both one-sided p-values are below alpha.

Table 16.3

Two one-sided tests for the equivalence example

Test	Null	Statistic	p-value	Interpretation
Lower-margin test	Difference <= -5 points	t = 2.90	0.004	The data reject differences worse than -5 points.
Upper-margin test	Difference >= +5 points	t = -1.24	0.113	The data do not reject differences at or above +5 points.
Overall TOST decision	Both one-sided tests must reject	--	max p = 0.113	Equivalence is not established because the upper-margin test is not significant.

Note. TOST evaluates H0: Delta <= -delta and H0: Delta >= +delta, where delta is the equivalence margin. Equivalence is concluded only if both one-sided tests reject at alpha, which is equivalent to the 100(1 - 2 alpha)% CI lying inside (-delta, +delta).

The lower-margin test rules out differences worse than -5 points, but the upper-margin test does not rule out differences of +5 points or more. The result should therefore be reported as non-significant and not equivalent. A larger or more precise study would be needed to support an equivalence claim.

Reporting Non-Significant Results

Responsible reporting avoids definitive language unless the design and interval justify it. The report should state the estimate, confidence interval, p-value, practical threshold, and design limitation. If the result is inconclusive, say so directly.

Limitations Paragraph Template

Use a limitations paragraph to explain what the study could and could not rule out:

“This study estimated [effect/comparison] with [sample size/design]. The point estimate was [estimate] and the confidence interval ranged from [lower] to [upper]. Because the interval [does/does not] include the prespecified practically important threshold of [threshold], the result should be interpreted as [inconclusive / compatible with no meaningful effect / insufficient for equivalence]. With this sample size, the design had 80% power only for effects of approximately [MDE] or larger, so smaller effects remain unresolved. Future work should [replicate with larger n / improve measurement precision / use a paired design / pre-specify equivalence margins].”

Table 16.4

Safer language for reporting non-significant results

Weak wording	Stronger wording
There was no effect.	No statistically significant difference was observed, and the 95% CI should be used to judge plausible effects.
The treatments were equivalent.	The observed difference was small, but the confidence interval did not stay within the prespecified equivalence margin.
The study proved the null hypothesis.	The data did not provide strong evidence against the null hypothesis; they do not prove that the effect is zero.
The result was non-significant, so smaller effects do not matter.	With n = 12 per group, the study had 80% power to detect d = 1.20; smaller effects could not be ruled out.

Note. Claims of equivalence or no meaningful effect require a prespecified margin and a sufficiently precise interval.

Bayes factors can also quantify relative evidence for a null model versus an alternative model, but they are sensitive to the prior distribution for the effect size under the alternative (Morey and Rouder 2011; Dienes 2014; Wagenmakers et al. 2018). Report the prior distribution and, when feasible, conduct a prior-sensitivity analysis showing how the Bayes factor changes across plausible specifications. In introductory small-sample reporting, a confidence-interval and equivalence-margin approach is usually more transparent for readers.

Key Takeaways

Non-significant results are not automatically negative findings. A narrow interval can rule out effects large enough to matter, whereas a wide interval leaves the study inconclusive. Equivalence and non-inferiority claims require a prespecified practical margin and evidence that the interval stays within that margin. In small-sample research, the strongest reporting pairs the p-value with the estimate, confidence interval, minimum detectable effect, and a clear statement of what remains unresolved.

Self-Assessment Quiz

Ayre, Colin, and Andrew J. Scally. 2014. ‘Critical Values for Lawshe’s Content Validity Ratio: Revisiting the Original Methods of Calculation’. Measurement and Evaluation in Counseling and Development 47 (1): 79–86. https://doi.org/10.1177/0748175613513808.

Briggs, Steven R., and Jonathan M. Cheek. 1986. ‘The Role of Factor Analysis in the Development and Evaluation of Personality Scales’. Journal of Personality 54 (1): 106–48.

Buuren, Stef van. 2018. Flexible Imputation of Missing Data. 2nd ed. Boca Raton, FL: Chapman; Hall/CRC.

Cameron, A. Colin, and Pravin K. Trivedi. 2013. Regression Analysis of Count Data. 2nd ed. Cambridge: Cambridge University Press. https://doi.org/10.1017/CBO9781139013567.

Chen, Henian, Patricia Cohen, and Sophie Chen. 2010. ‘How Big Is a Big Odds Ratio? Interpreting the Magnitudes of Odds Ratios in Epidemiological Studies’. Communications in Statistics - Simulation and Computation 39 (4): 860–64. https://doi.org/10.1080/03610911003650383.

Cicchetti, Domenic V. 1994. ‘Guidelines, Criteria, and Rules of Thumb for Evaluating Normed and Standardized Assessment Instruments in Psychology’. Psychological Assessment 6 (4): 284–90. https://doi.org/10.1037/1040-3590.6.4.284.

Clark, Lee Anna, and David Watson. 1995. ‘Constructing Validity: Basic Issues in Objective Scale Development’. Psychological Assessment 7 (3): 309–19.

Cohen, Jacob. 1988. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates.

Conover, W. J. 1999. Practical Nonparametric Statistics. 3rd ed. New York: John Wiley & Sons.

Costello, Anna B., and Jason Osborne. 2005. ‘Best Practices in Exploratory Factor Analysis: Four Recommendations for Getting the Most from Your Analysis’. Practical Assessment, Research, and Evaluation 10 (1): 1–9. https://doi.org/10.7275/jyj1-4868.

Davison, A. C., and D. V. Hinkley. 1997. Bootstrap Methods and Their Application. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge: Cambridge University Press.

DeVellis, Robert F., and Carolyn T. Thorpe. 2021. Scale Development: Theory and Applications. 5th ed. Thousand Oaks, CA: SAGE Publications.

Dienes, Zoltan. 2014. ‘Using Bayes to Get the Most Out of Non-Significant Results’. Frontiers in Psychology 5: 781. https://doi.org/10.3389/fpsyg.2014.00781.

Efron, Bradley, and Robert J. Tibshirani. 1993. An Introduction to the Bootstrap. New York, NY: Chapman; Hall/CRC.

Elm, Erik von, Douglas G. Altman, Matthias Egger, Stuart J. Pocock, Peter C. Gøtzsche, and Jan P. Vandenbroucke. 2007. ‘The Strengthening the Reporting of Observational Studies in Epidemiology (STROBE) Statement: Guidelines for Reporting Observational Studies’. PLoS Medicine 4 (10): e296. https://doi.org/10.1371/journal.pmed.0040296.

Fay, Michael P. 2010. ‘Confidence Intervals That Match Fisher’s Exact or Blaker’s Exact Tests’. Biostatistics 11 (2): 373–74. https://doi.org/10.1093/biostatistics/kxp050.

Firth, David. 1993. ‘Bias Reduction of Maximum Likelihood Estimates’. Biometrika 80 (1): 27–38. https://doi.org/10.1093/biomet/80.1.27.

Gelman, Andrew, Daniel Simpson, and Michael Betancourt. 2017. ‘The Prior Can Often Only Be Understood in the Context of the Likelihood’. Entropy 19 (10): 555. https://doi.org/10.3390/e19100555.

Good, Phillip I. 2005. Permutation, Parametric, and Bootstrap Tests of Hypotheses. 3rd ed. New York: Springer.

Graham, John W. 2009. ‘Missing Data Analysis: Making It Work in the Real World’. Annual Review of Psychology 60: 549–76. https://doi.org/10.1146/annurev.psych.58.110405.085530.

Hair, Joseph F., G. Tomas M. Hult, Christian M. Ringle, and Marko Sarstedt. 2017. A Primer on Partial Least Squares Structural Equation Modeling (PLS-SEM). 2nd ed. Thousand Oaks, CA: SAGE Publications.

Harrell, Frank E. 2015. Regression Modeling Strategies: With Applications to Linear Models, Logistic and Ordinal Regression, and Survival Analysis. 2nd ed. Springer Series in Statistics. New York: Springer. https://doi.org/10.1007/978-3-319-19425-7.

Heinze, Georg, and Michael Schemper. 2002. ‘A Solution to the Problem of Separation in Logistic Regression’. Statistics in Medicine 21 (16): 2409–19. https://doi.org/10.1002/sim.1047.

Hodges, Jr., J. L., and E. L. Lehmann. 1963. ‘Estimates of Location Based on Rank Tests’. The Annals of Mathematical Statistics 34 (2): 598–611. https://doi.org/10.1214/aoms/1177704172.

Hoenig, John M., and Dennis M. Heisey. 2001. ‘The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis’. The American Statistician 55 (1): 19–24.

Hosmer, David W., Stanley Lemeshow, and Rodney X. Sturdivant. 2013. Applied Logistic Regression. 3rd ed. Wiley Series in Probability and Statistics. Hoboken, NJ: John Wiley & Sons. https://doi.org/10.1002/9781118548387.

Hu, Li-tze, and Peter M. Bentler. 1999. ‘Cutoff Criteria for Fit Indexes in Covariance Structure Analysis: Conventional Criteria Versus New Alternatives’. Structural Equation Modeling 6 (1): 1–55. https://doi.org/10.1080/10705519909540118.

Huberty, Carl J., and Stephen Olejnik. 2006. Applied MANOVA and Discriminant Analysis. 2nd ed. Hoboken, NJ: Wiley. https://doi.org/10.1002/047178947X.

Hwang, Ching-Lai, and Kwangsun Yoon. 1981. Multiple Attribute Decision Making: Methods and Applications. Berlin: Springer. https://doi.org/10.1007/978-3-642-48318-9.

Hyndman, Rob J., and George Athanasopoulos. 2021. Forecasting: Principles and Practice. 3rd ed. Melbourne, Australia: OTexts. https://otexts.com/fpp3/.

Koo, Terry K., and Mae Y. Li. 2016. ‘A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research’. Journal of Chiropractic Medicine 15 (2): 155–63. https://doi.org/10.1016/j.jcm.2016.02.012.

Kraft, Matthew A. 2020. ‘Interpreting Effect Sizes of Education Interventions’. Educational Researcher 49 (4): 241–53. https://doi.org/10.3102/0013189X20912798.

Lakens, Daniël. 2013. ‘Calculating and Reporting Effect Sizes to Facilitate Cumulative Science: A Practical Primer for t-Tests and ANOVAs’. Frontiers in Psychology 4: 863. https://doi.org/10.3389/fpsyg.2013.00863.

Lakens, Daniël, Anne M. Scheel, and Peder M. Isager. 2018. ‘Equivalence Testing for Psychological Research: A Tutorial’. Advances in Methods and Practices in Psychological Science 1 (2): 259–69. https://doi.org/10.1177/2515245918770963.

Lawshe, Charles H. 1975. ‘A Quantitative Approach to Content Validity’. Personnel Psychology 28 (4): 563–75. https://doi.org/10.1111/j.1744-6570.1975.tb01393.x.

Lipsey, Mark W., and David B. Wilson. 1993. ‘The Efficacy of Psychological, Educational, and Behavioural Treatment: Confirmation from Meta-Analysis’. American Psychologist 48 (12): 1181–1209. https://doi.org/10.1037/0003-066X.48.12.1181.

Little, Roderick J. A. 1988. ‘A Test of Missing Completely at Random for Multivariate Data with Missing Values’. Journal of the American Statistical Association 83 (404): 1198–1202. https://doi.org/10.1080/01621459.1988.10478722.

MacKinnon, David P., Jennifer L. Krull, and Chondra M. Lockwood. 2000. ‘Equivalence of the Mediation, Confounding and Suppression Effect’. Prevention Science 1 (4): 173–81.

Mair, Patrick, and Rand R. Wilcox. 2020. ‘Robust Statistical Methods in r Using the WRS2 Package’. Behavior Research Methods 52 (2): 464–88. https://doi.org/10.3758/s13428-019-01246-w.

Mann, Henry B., and Donald R. Whitney. 1947. ‘On a Test of Whether One of Two Random Variables Is Stochastically Larger Than the Other’. The Annals of Mathematical Statistics 18 (1): 50–60. https://doi.org/10.1214/aoms/1177730491.

McDonald, Roderick P. 1999. Test Theory: A Unified Treatment. Mahwah, NJ: Lawrence Erlbaum Associates.

Morey, Richard D., and Jeffrey N. Rouder. 2011. ‘Bayes Factor Approaches for Testing Interval Null Hypotheses’. Psychological Methods 16 (4): 406–19. https://doi.org/10.1037/a0024377.

Nielsen, Jakob. 1993. Usability Engineering. San Francisco, CA: Morgan Kaufmann.

Olsson, Ulf. 1979. ‘Maximum Likelihood Estimation of the Polychoric Correlation Coefficient’. Psychometrika 44 (4): 443–60.

Opricovic, Serafim, and Gwo-Hshiung Tzeng. 2004. ‘Compromise Solution by MCDM Methods: A Comparative Analysis of VIKOR and TOPSIS’. European Journal of Operational Research 156 (2): 445–55. https://doi.org/10.1016/S0377-2217(03)00020-1.

Page, Matthew J., Joanne E. McKenzie, Patrick M. Bossuyt, Isabelle Boutron, Tammy C. Hoffmann, Cynthia D. Mulrow, Larissa Shamseer, et al. 2021. ‘The PRISMA 2020 Statement: An Updated Guideline for Reporting Systematic Reviews’. BMJ 372: n71. https://doi.org/10.1136/bmj.n71.

R Core Team. 2026. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.

Rubin, Donald B. 1987. Multiple Imputation for Nonresponse in Surveys. New York, NY: Wiley.

Saaty, Thomas L. 1980. The Analytic Hierarchy Process. New York, NY: McGraw-Hill.

Schulz, Kenneth F., Douglas G. Altman, and David Moher. 2010. ‘CONSORT 2010 Statement: Updated Guidelines for Reporting Parallel Group Randomised Trials’. BMJ 340: c332. https://doi.org/10.1136/bmj.c332.

Shan, Guogen. 2018. Exact Statistical Inference for Categorical Data. London: Academic Press.

Sterne, Jonathan A. C., Ian R. White, John B. Carlin, Michael Spratt, Patrick Royston, Michael G. Kenward, Angela M. Wood, and James R. Carpenter. 2009. ‘Multiple Imputation for Missing Data in Epidemiological and Clinical Research: Potential and Pitfalls’. BMJ 338: b2393. https://doi.org/10.1136/bmj.b2393.

Teare, M. Dawn, Munyaradzi Dimairo, Neil Shephard, Amanda Hayman, Amy Whitehead, and Stephen J. Walters. 2014. ‘Sample Size Requirements to Estimate Key Design Parameters from External Pilot Randomised Controlled Trials: A Simulation Study’. Trials 15: 264. https://doi.org/10.1186/1745-6215-15-264.

Tomczak, Maciej, and Ewa Tomczak. 2014. ‘The Need to Report Effect Size Estimates Revisited: An Overview of Some Recommended Measures of Effect Size’. Trends in Sport Sciences 21 (1): 19–25.

Trizano-Hermosilla, Italo, and Jose M. Alvarado. 2016. ‘Best Alternatives to Cronbach’s Alpha Reliability in Realistic Conditions: Congeneric and Asymmetrical Measurements’. Frontiers in Psychology 7: 769. https://doi.org/10.3389/fpsyg.2016.00769.

Van de Schoot, Rens, and Milica Miočević. 2020. Small Sample Size Solutions: A Guide for Applied Researchers and Practitioners. London: Routledge.

Vehtari, Aki, Andrew Gelman, and Jonah Gabry. 2017. ‘Practical Bayesian Model Evaluation Using Leave-One-Out Cross-Validation and WAIC’. Statistics and Computing 27 (5): 1413–32. https://doi.org/10.1007/s11222-016-9696-4.

Wagenmakers, Eric-Jan, Jonathon Love, Maarten Marsman, Tahira Jamil, Alexander Ly, Josine Verhagen, Ravi Selker, et al. 2018. ‘Bayesian Inference for Psychology. Part II: Example Applications with JASP’. Psychonomic Bulletin & Review 25 (1): 58–76. https://doi.org/10.3758/s13423-017-1323-7.

White, Ian R., Patrick Royston, and Angela M. Wood. 2011. ‘Multiple Imputation Using Chained Equations: Issues and Guidance for Practice’. Statistics in Medicine 30 (4): 377–99. https://doi.org/10.1002/sim.4067.