Chapter 6: Developing Short Scales for Small Samples

Learning Objectives

By the end of this chapter, you will be able to explain why short-scale development with small samples requires staged validation rather than one definitive study, match psychometric tools to the sample sizes they can realistically support, use item-level diagnostics to refine candidate items, and report scale-development evidence transparently without overstating what a pilot can show.

The Scale Development Lifecycle

Scale development is inherently a multi-stage process. With small samples, researchers must be strategic about which psychometric analyses to conduct at each stage and which to reserve for later validation.

The small helper functions used below (for example, chapter6_simulate_scale() and chapter6_problem_item_diagnostics()) are defined in R/chapter_helpers.R and sourced in the setup chunk. Readers who want to run individual chunks can first run source("R/chapter_helpers.R") from the project root.

The Iterative Process

Stage 1: Item Generation (n = 5–10 cognitive interviews)

A first round of about 5 to 10 cognitive interviews usually surfaces the clearest wording and comprehension problems in an initial item pool (Nielsen 1993).

Goal: Generate a pool of 2–3× your target number of items and ensure they are comprehensible.

Methods:

  • Literature review: Identify existing scales and adapt items
  • Expert consultation: Subject matter experts suggest relevant content
  • Cognitive interviews: Think-aloud protocols with 5–10 participants from the target population

Example:

Table 6.1

Candidate items flagged during cognitive interviewing.

Item ID Candidate Item Interview Note
2 I bounce back quickly from setbacks Ambiguous: what counts as 'quickly'?
7 I rebound after difficult situations Too similar to item 2
13 I seek support when needed Double-barreled: 'seek' and 'support'

Note. Items flagged during think-aloud work should be revised before any pilot administration.

Key Point: At this stage, do NOT collect quantitative data. Focus on qualitative feedback about item clarity, relevance, and comprehensiveness.

Stage 2: Pilot Testing (n = 20–30)

Goal: Identify problematic items before committing to a larger study.

Methods:

  • Administer all items to a small pilot sample
  • Compute item-total correlations (r.cor)
  • Check for ceiling/floor effects (> 80% at extreme response)
  • Examine item means and SDs (avoid items with no variance)

What you CAN do with n = 20–30:

At this stage, item-total correlations are screening tools rather than definitive item-deletion criteria. With n = 20–30, values near the usual 0.30 heuristic have substantial sampling variability, so a result like 0.28 versus 0.32 should be treated as a provisional flag for review rather than a mechanical keep-or-drop rule. The Briggs and Cheek (Briggs and Cheek 1986) mean-interitem-correlation guidance is useful here, but it is still a heuristic: for n = 25, a sample correlation of r = 0.30 has an approximate Fisher-z 95% CI from about -0.11 to 0.62. That interval is too wide to support automatic item deletion based on a single pilot correlation.

Table 6.2

Pilot-stage item diagnostics for the 12-item candidate scale.

Item Item-total r Mean SD Extreme % Flag
WRS5 NA 5.00 0.00 100 Ceiling/floor effect
WRS8 0.10 1.64 0.49 36 Low variance
WRS11 0.32 3.84 0.37 0 Low variance

Note. The pilot flags one ceiling item, one weak item-total correlation, and one low-variance item for revision or removal.

Interpretation:

  • WRS5: Ceiling effect (100% of responses at the maximum). Remove or reword.
  • WRS8: Weak item-total correlation of 0.10. Consider dropping or rewriting.
  • WRS11: Low variance (SD = 0.37) and weak discrimination. Reword for clarity or replace it.

What you CANNOT do with n = 20–30:

WarningDo NOT Overinterpret Alpha with n < 30

Cronbach’s alpha estimates are highly unstable with n < 30. The 95% confidence interval will usually be very wide. For example, when the observed alpha is around 0.65, an approximate interval of about [0.40, 0.85] would not be unusual. The exact width depends on both the observed alpha value and the sample size, so use the psych::alpha() output to report the interval from your own data.

Instead: Focus on item-level diagnostics (means, SDs, ceiling or floor patterns, and item-total correlations) to refine your scale. If software computes alpha while extracting those diagnostics, do not treat the pilot alpha as a stable reliability result. Defer formal reliability reporting to Stage 3.

Stage 3: Refinement (n = 50–100)

Goal: Estimate reliability and assess dimensionality.

Methods:

  • Cronbach’s alpha with confidence intervals
  • McDonald’s omega (if you suspect multidimensionality)
  • Split-half reliability as a robustness check
  • Exploratory Factor Analysis (EFA) if n ≥ 100 and you suspect subscales

Example:

Table 6.3

Refinement-stage reliability summary for the 8-item scale.

Metric Value
Cronbach's alpha 0.833
Bootstrap 95% CI lower 0.747
Bootstrap 95% CI upper 0.885
Average inter-item correlation 0.383
Mean split-half reliability 0.832
Minimum split-half reliability 0.779
Maximum split-half reliability 0.901

Note. The confidence interval is a percentile bootstrap interval from row resampling. Split-half values are Spearman-Brown-adjusted reliability coefficients from psych::splitHalf().

Interpretation:

  • Here the scale shows good research reliability (alpha = 0.835) with an approximate 95% CI of about [0.772, 0.898].
  • The mean split-half reliability is 0.836, with adjusted split-half values ranging from 0.789 to 0.895, which supports the same conclusion from a second perspective.
  • With n = 60, uncertainty is still present, but the interval is now narrow enough to support cautious reporting.

Exploratory Factor Analysis (EFA) with n = 50–100:

Figure 6.1: Parallel analysis for the candidate short scale.

Figure 6.1 shows the parallel-analysis result for the simulated 8-item scale. In this example, the factor solution is the relevant guide: one factor is retained, and the fitted one-factor model explains about 43.9% of the total variance. Some software also prints a separate “components” recommendation. For early scale development, the factor recommendation is the substantively relevant result (Costello and Osborne 2005).

Caution: Even when parallel analysis suggests one dominant factor in a simulated example like this, n = 100 still only supports preliminary guidance. Treat EFA results as provisional until a larger validation sample can confirm the structure. For early psychological scale development, a first factor explaining roughly 40% to 60% of the variance is often acceptable as an initial signal rather than a final validation result (Costello and Osborne 2005).

Stage 4: Validation (n = 150+)

Goal: Confirm scale structure and establish validity.

In this chapter, SEM refers to structural equation modelling. In Chapter 5, SEM denoted the standard error of measurement. The context distinguishes the two uses.

Methods:

  • Confirmatory Factor Analysis (CFA): Test hypothesized factor structure
  • Test-retest reliability: Administer scale twice (2–4 weeks apart)
  • Convergent validity: Correlate with theoretically related measures
  • Discriminant validity: Show low correlation with unrelated constructs
  • Known-groups validity: Scale discriminates between relevant groups

Example:

# CFA requires lavaan package and n ≥ 150
library(lavaan)

# Define 1-factor model
model <- '
  resilience =~ WRS1 + WRS2 + WRS3 + WRS4 + WRS5 + WRS6 + WRS7 + WRS8
'

# Fit model
cfa_result <- cfa(model, data = validation_data)
summary(cfa_result, fit.measures = TRUE, standardized = TRUE)

# Fit indices to assess model adequacy (conventional heuristics):
# - CFI >= 0.90 (acceptable), >= 0.95 (good)
# - RMSEA <= 0.08 (acceptable), <= 0.06 (good)
# - SRMR <= 0.08 (acceptable)
# Evaluate fit holistically rather than treating cutoffs as automatic rules.

Use these fit indices as heuristics rather than absolute pass-fail rules. CFI compares the hypothesised model with a baseline model in which the items are treated as unrelated. Larger values indicate better relative fit. RMSEA estimates lack of fit per model degree of freedom. Smaller values indicate closer approximate fit. SRMR summarises the average standardised residual discrepancy between the observed and model-implied correlations. Conventional thresholds such as CFI >= 0.95, RMSEA <= 0.06, and SRMR <= 0.08 are useful vocabulary, but they were developed mostly for larger samples and should not be applied mechanically in small validation studies (Hu and Bentler 1999). If fit is poor, report the indices, inspect the pattern of residuals and loadings, and explain the limitation. Do not repeatedly refit the model until the cutoffs are met.

Some short scales also contain both a broad general construct and narrower item clusters. In that setting, a bifactor model can separate the general factor from group factors and can support omega-hierarchical as an estimate of reliability for the general factor. This is a larger-sample validation tool, not a pilot-stage shortcut.

Test-retest reliability:

# Compute scale scores at Time 1 and Time 2
validation_data <- validation_data %>%
  dplyr::mutate(
    resilience_t1 = rowMeans(dplyr::select(., WRS1_t1:WRS8_t1), na.rm = TRUE),
    resilience_t2 = rowMeans(dplyr::select(., WRS1_t2:WRS8_t2), na.rm = TRUE)
  )

# Intraclass correlation coefficient (ICC)
# Specify the model explicitly because ICC values depend on the chosen form.
library(irr)
icc_result <- icc(
  cbind(validation_data$resilience_t1, validation_data$resilience_t2),
  model = "twoway",
  type = "agreement",
  unit = "single"
)
print(icc_result)

# ICC > 0.75 is often treated as good test-retest reliability,
# but always report the ICC model and confidence interval.

When reporting ICC, specify the model and interpretation rule directly in the text. A two-way agreement ICC above about 0.75 is often treated as good evidence of test-retest reliability, but the exact value depends on the ICC form chosen (Cicchetti 1994; Koo and Li 2016).

Special Considerations for n < 50

What You CANNOT Do

WarningAnalyses That Require Larger Samples

With n < 50, the following analyses are not feasible or will produce unreliable results:

  1. Exploratory Factor Analysis (EFA): Rule of thumb is n ≥ 100 or 5–10 participants per item
  2. Confirmatory Factor Analysis (CFA): Requires n ≥ 150–200 for stable parameter estimates
  3. Measurement Invariance Testing: Requires n ≥ 200 per group
  4. Structural Equation Modelling (SEM): Complex models need n ≥ 200–400
  5. PLS-SEM (Partial Least Squares SEM): Despite “small-sample” marketing claims, stable path estimates still usually need at least n ≈ 100–150
  6. Item Response Theory (IRT): Most models require n ≥ 250–500
  7. Reliable Cronbach’s alpha: With n < 30, alpha estimates have 95% CIs spanning 0.3–0.4 units

Keep these methods for larger studies. Forced application to very small samples produces misleading results.


Question: “Can I use SEM, CFA, or PLS-SEM with my small sample (n < 100)?”

Short Answer: No. SEM-based methods require substantially larger samples than this book’s target range (n = 10–100).

Minimum Sample Size Requirements

Method Minimum n Realistic n Why?
Confirmatory Factor Analysis (CFA) 150 200-300 Stable factor loadings, fit indices
Structural Equation Modelling (SEM) 200 300-500 Complex path models, multiple latent variables
PLS-SEM 100 150-300 Despite “small-sample” marketing claims, stable path estimates usually need at least 100-150 cases
Multi-Group SEM 200/group 300/group Measurement invariance testing

Rule of Thumb: 10-20 observations per estimated parameter is a common starting heuristic (e.g., 5 indicators + 3 paths = 8 parameters → need 80-160 observations), but actual requirements vary with model complexity, indicator quality, and estimation method.

That is why the table above should be read as planning guidance rather than a universal rulebook. Even methods sometimes marketed as “small-sample friendly,” such as PLS-SEM, still need enough observations for stable path estimates and standard errors (Hair et al. 2017).

What Happens If You Ignore These Requirements?

With n < 100, SEM/CFA/PLS-SEM tends to produce unstable parameter estimates. Factor loadings can fluctuate sharply after very small data changes, path coefficients often carry very large standard errors, and conclusions may change when only a handful of observations are added or removed.

Small samples also increase the risk of non-convergent or improper solutions. Maximum-likelihood estimation may fail to converge, Heywood cases such as negative variances or loadings above 1.0 become more likely, and analysts can end up imposing arbitrary constraints simply to force the model to run.

Even when the software returns output, the fit statistics are difficult to trust. With small n, χ², CFI, TLI, and RMSEA can look reassuring for the wrong reasons, modification indices often suggest spurious changes, and a model may appear to fit the sample well only because it has overfit noise that will not replicate in new data.

The final danger is false confidence. SEM software will still print parameter estimates, p-values, confidence intervals, and polished path diagrams, but that appearance of technical completeness does not make the results trustworthy. Reviewers and readers will rightly question claims built on latent-variable models that the sample size cannot support.

What Should You Do Instead? (For n < 100)

Use the methods in THIS book:

SEM Goal Small-Sample Alternative Chapter Minimum n
Assess scale reliability Cronbach’s α, McDonald’s ω, split-half Ch 6 30-50
Validate items Item-total correlations, alpha-if-deleted Ch 6 30-50
Reduce dimensionality Sum/mean composite scores Ch 6 20+
Test relationships (X → Y) Regression with composite scores Ch 5 30-50
Multiple predictors Penalized regression (ridge/lasso/elastic net) Ch 13 50-100
Mediation (X → M → Y) Simple mediation with bootstrap CIs Part E Project 5 80-100
Latent correlations Polychoric correlations (exploratory) Ch 6 50+
Measurement precision Standard Error of Measurement (SEM statistic) Ch 6 30+

Key Principle: Composite scores are your friend. - Sum or average your scale items to create observed composite variables - Use these composites in regression, t-tests, ANOVA - Acknowledge measurement error in limitations section - Plan larger validation study (n ≥ 200) for future CFA/SEM

Example: Replacing SEM with Composite-Score Analysis

Proposed SEM Model (n = 60):

Job_Satisfaction (5 items) → Turnover_Intent (3 items)
     ↑
Performance (4 items)

Small-Sample Alternative:

Advantages: This composite-score approach works with n = 60, remains interpretable because coefficients refer to changes in averaged scale scores, and is usually more robust than forcing a latent-variable model onto limited data. It is also more honest, because it acknowledges that observed composites are being analysed directly rather than pretending the sample is large enough for stable latent-variable estimation.

Limitations to Acknowledge: Composite scores still contain measurement error, they cannot test complex factor structures, and they do not separate within-item from between-item variance. Missing-data rules also need to be stated explicitly so readers know when partial composites were allowed and when cases were excluded.

When to Pursue SEM: Collect n ≥ 200 in a follow-up study. Then: 1. Use EFA to explore factor structure (if theory is unclear) 2. Use CFA to confirm measurement model 3. Test structural paths with latent variables 4. Assess model fit rigorously

Software Will Let You Do Bad Things

Warning: SmartPLS, AMOS, Mplus, and other SEM software will happily run with n = 50. They will produce: - Parameter estimates - p-values - Fit indices - Pretty path diagrams

This does NOT mean the results are trustworthy. Software cannot judge whether your sample size is adequate—you must.

Bottom Line

For n = 10–100, the methods in this book are the defensible default because they are appropriate for small samples, robust to common assumption problems, honest about uncertainty, and interpretable for substantive readers and reviewers. Save SEM for a later study with n >= 200. Until then, composite scores combined with transparent regression-style analyses will usually serve the research question much better.


What You CAN Do

With n = 20–50, focus on these feasible and informative analyses:

  1. Content validity. Use expert review, cognitive interviews, and explicit links to the theoretical framework to decide whether the item pool is clear, relevant, and broad enough before you rely on any statistics.

  2. Item-level diagnostics. Examine item means, standard deviations, skewness, corrected item-total correlations, floor or ceiling effects, and the inter-item correlation matrix to flag items that are obviously misbehaving.

  3. Preliminary reliability (n ≥ 30). Report Cronbach’s alpha with its 95% confidence interval, add split-half reliability as a robustness check, and inspect the mean inter-item correlation to see whether the scale is too loose or too redundant.

  4. Known-groups validity. Compare scale scores between groups that theory predicts should differ, and use nonparametric methods such as the Mann–Whitney U test if the scale scores are skewed or ordinal.

  5. Preparation for a larger validation study. Document the item-generation process, report pilot decisions transparently including which items were dropped and why, and use the pilot to specify the hypotheses that a later CFA or validation study will test.

Example: Documenting a Small-Sample Scale Development

Table 6.4

Illustrative staged reporting summary for a short-scale development project.

Development Stage Sample Size Analyses Conducted Key Findings
Item Generation n = 8 (cognitive interviews) Think-aloud protocols; expert review (CVI = 0.88) Generated 15 items; 2 flagged as ambiguous
Pilot Testing n = 25 Item-total correlations; ceiling/floor checks; 3 items dropped Dropped WRS5 (ceiling), WRS8 (r.cor = 0.10), WRS11 (low variance)
Refinement n = 60 Alpha = 0.84 [95% CI: 0.77, 0.90]; split-half = 0.79-0.90 8-item scale shows acceptable reliability for research use
Validation Planned: n = 200 CFA, test-retest ICC, convergent validity (planned) Pending larger validation sample

Note. Report the actual pilot sample sizes, item deletions, and planned validation targets transparently.

Reporting Guidelines for Small-Sample Scale Development

When publishing or reporting scale development work with n < 50:

  1. Acknowledge limitations explicitly:

    • “With n = 25, we could not reliably estimate Cronbach’s alpha. Instead, we focused on item-total correlations to identify weak items.”
    • “Exploratory factor analysis was not feasible (n = 60); we plan CFA with a larger sample (target n = 200).”
  2. Report what you did (not what you wish you could do): Avoid presenting alpha as a definitive reliability estimate when n < 30. If a journal requires it, report the confidence interval and describe the result as preliminary screening information. Do not force EFA or CFA with inadequate samples. Instead, explain the theoretical rationale for the proposed item grouping and reserve the structural test for a later study.

  3. Frame as a preliminary/pilot study: State plainly that the work is a pilot or refinement study and explain what that means for interpretation. For example: “This pilot study (n = 35) was designed to refine item wording and identify problematic items before a larger validation study,” or “Results are preliminary and should be interpreted with caution pending validation with n ≥ 150.”

  4. Provide detailed item-level information:

    • Publish item means, SDs, item-total correlations
    • Report which items were dropped and why
    • Share cognitive interview feedback (qualitative)
  5. Plan and fund the validation study:

    • Use pilot data to justify sample size for validation (power analysis for CFA)
    • Secure resources for n ≥ 150–200 before claiming a “validated” scale

Key Takeaways

Scale development with small samples is best understood as a staged process rather than a single psychometric event. Very small pilots are most useful for qualitative refinement and item-level diagnostics, somewhat larger samples can support cautious preliminary reliability work, and only much larger studies can justify structural validation through EFA, CFA, and broader validity testing. Across all of those stages, transparency about what was and was not feasible is what makes the evidence credible.


Self-Assessment Quiz

Question 1. Why is scale development with small samples best treated as a multi-stage process?

Explanation.

The chapter presents scale development as an iterative process. Early stages focus on item generation and pilot diagnostics, while later stages support reliability estimation, factor analysis, and validation. Small samples can support some of these steps, but not all of them at once.

Question 2. At the item-generation stage, what is the main priority?

Explanation.

Stage 1 emphasises item generation and comprehension. The chapter explicitly states that this stage should focus on literature review, expert consultation, and cognitive interviews rather than quantitative psychometric estimation.

Question 3. With a pilot sample of about 20 to 30 participants, which analysis is most appropriate?

Explanation.

The chapter recommends item-level diagnostics at the pilot stage: item-total correlations, ceiling and floor checks, and basic descriptive statistics. These analyses help identify weak items before larger validation work.

Question 4. Why does the chapter warn against relying on Cronbach’s alpha with n < 30?

Explanation.

The warning callout explains that alpha estimates are highly unstable with n < 30, making the confidence interval so wide that the estimate is not very informative for decision-making.

Question 5. What should you do when an item shows a ceiling effect in the pilot data?

Explanation.

The chapter’s pilot example flags ceiling effects as a sign that an item may not discriminate well. The recommended response is to remove or reword the item.

Question 6. At roughly what stage does the chapter consider exploratory factor analysis potentially feasible?

Explanation.

The chapter states that EFA is more defensible at the refinement stage and ideally with n around 100 or more. With smaller samples, the results are described as exploratory and unstable.

Question 7. What is the main goal of the validation stage?

Explanation.

Stage 4 is the validation phase. It is where the chapter places confirmatory factor analysis, test-retest reliability, convergent validity, discriminant validity, and known-groups validity.

Question 8. Which reporting practice best fits a pilot scale-development study with n = 35?

Explanation.

The reporting guidelines emphasise honesty about what the study could and could not show. With a pilot sample, the chapter recommends framing findings as preliminary and stating the need for a larger validation study.

Question 9. Why is transparency especially important in small-sample scale development?

Explanation.

The chapter repeatedly stresses transparency: report what you did, what you did not do, and why. Small samples can support useful refinement decisions, but they do not justify overclaiming full validation.

Question 10. Which conclusion is most consistent with the chapter’s overall message?

Explanation.

Small samples require staged, realistic claims. Early work can be rigorous when researchers match their analyses to the evidence their sample can support, even if full validation must wait for a larger dataset.