Chapter 4: Measurement Quality and Scale Development

Learning Objectives

By the end of this chapter, you will be able to explain the distinctions between content, construct, and criterion validity, pilot and refine short scales with limited samples, compute and interpret basic item-level diagnostics in R, and report measurement evidence transparently when full psychometric validation is not yet feasible.

The Challenge of Measurement in Small Studies

Many small-sample studies rely on brief, custom-developed measurement instruments. Standard scale development protocols (large pilot studies, factor analysis, item response theory) require hundreds of observations. With small samples, researchers must balance the need for reliable, valid measurement with practical constraints.

Short scales (3–5 items) can be internally consistent and valid if items are carefully chosen. Pilot testing with qualitative feedback (cognitive interviews, think-aloud protocols) can identify ambiguous wording, response biases, and cultural appropriateness. Quantitative pilot data (even with n of about 20 to 30) can reveal extreme floor or ceiling effects, items with no variance, and obvious inconsistencies.

Content and Face Validity

Content validity refers to whether items comprehensively and appropriately represent the construct being measured. Face validity refers to whether items appear relevant and appropriate to respondents. Face validity is not a psychometric property in its own right and therefore carries limited evidentiary weight; it should complement, not substitute for, stronger content or construct validity evidence. Both are assessed through expert review and respondent feedback during scale development, before quantitative pilot testing, typically with input from both domain experts and representatives of the target population.

Content Validity Ratio (CVR)

Lawshe’s Content Validity Ratio provides a simple index of expert agreement on whether an item is “essential” to a construct. With $N$ experts and $n_e$ rating an item as essential, the CVR is:

\[ \operatorname{CVR} = \frac{n_e - N/2}{N/2} \]

CVR ranges from −1 to +1. Positive values indicate majority agreement that the item is essential, but retention thresholds depend on panel size. Under the conventional one-tailed $\alpha = 0.05$ rule, a panel of eight experts requires a CVR of 0.75 or higher, which means at least seven experts must judge the item essential (Lawshe 1975; Ayre and Scally 2014). These critical values correspond to exact binomial probabilities and were re-examined by Ayre and Scally (Ayre and Scally 2014). Use CVR alongside qualitative feedback to decide which items to retain.

[1] 0.5

Interpretation: A CVR of 0.50 reflects that 6 of 8 experts (75%) rated the item essential. However, for an 8-member panel the usual retention threshold is 0.75, which requires agreement from at least 7 experts (Lawshe 1975; Ayre and Scally 2014). Because 0.50 is below 0.75, this item would not satisfy Lawshe’s criterion for statistically significant content validity. Researchers should consult verified critical-value tables when using CVR for item-retention decisions.

Content Validity Index (CVI)

The Content Validity Index is a more descriptive companion to CVR and is common in health, education, and applied measurement work. For an item-level CVI (I-CVI), experts rate each item for relevance, often on a 1–4 scale, and the analyst computes the proportion of experts rating the item as either 3 or 4. A scale-level average CVI (S-CVI/Ave) is then the mean of the item-level CVIs across the candidate item set. CVI does not replace qualitative review, but it gives a transparent summary of expert agreement before a small quantitative pilot begins.

For example, suppose five experts rated six candidate items on a four-point relevance scale. Ratings of 3 or 4 are counted as content-valid endorsements.

Item	Experts rating 3 or 4	I-CVI	Pilot decision
Item 1	5 of 5	1.00	Retain
Item 2	4 of 5	0.80	Retain, check wording
Item 3	3 of 5	0.60	Revise before pilot
Item 4	5 of 5	1.00	Retain
Item 5	2 of 5	0.40	Drop or rewrite
Item 6	4 of 5	0.80	Retain, check redundancy

The S-CVI/Ave for this set is 0.77, calculated as the average of the six I-CVI values. The table is a discussion tool, not a mechanical keep/drop decision rule. Items 3 and 5 need substantive review because several experts did not judge them clearly relevant. Items 2 and 6 may be acceptable, but the qualitative comments should be checked for ambiguity or overlap.

Construct and Criterion Validity in Small Samples

Content validity is only one part of measurement quality. Researchers also need to consider whether scores behave as theory predicts, which is the core of construct validity, and whether they relate to a meaningful external benchmark, which is the core of criterion validity. In small studies, these checks should usually be modest and pre-specified. Rather than claiming a full psychometric validation from n = 20–40, look for tentative supporting evidence by comparing groups that theory predicts should differ on the construct, examining correlations with one or two closely related measures while reporting effect sizes and confidence intervals, or checking whether scores predict a practically relevant external outcome. Report effect sizes and uncertainty, avoid post-hoc fishing for significant associations, and avoid strong claims from unstable estimates.

Reliability, Validity, and Measurement Error

Reliability concerns consistency; validity concerns whether the instrument measures the intended construct. A scale can be reliable without being valid. In an early pilot, item statistics and internal consistency estimates are best treated as screening tools that flag obvious weaknesses, not as definitive proof that the scale is ready for high-stakes use. Chapter 5 extends this discussion for short scales.

If items are scored or coded by human raters, agreement among raters becomes part of measurement quality. With two raters, report a chance-corrected agreement statistic such as Cohen’s kappa alongside the raw agreement percentage; with more than two raters, use a multi-rater extension such as Fleiss’ kappa or an intraclass correlation coefficient when scores are numeric. In small samples these estimates can be unstable, so describe the coding protocol, rater training, and disagreement-resolution process rather than reporting a coefficient alone.

Steps for Scale Development with Small Samples

Define the construct clearly. Start by specifying what you are measuring and which dimensions or facets belong inside that construct.
Generate candidate items. Write more items than you expect to keep so that weak or redundant items can be removed later without leaving gaps in content coverage.
Expert review. Ask domain experts to rate each candidate item for relevance, clarity, and representativeness, and use their feedback to flag wording or content problems early.
Cognitive interviews. Ask a small number of respondents (often n = 5–10) to complete the scale while thinking aloud so you can hear how they interpret the wording and response options.
Quantitative pilot. Administer the revised scale to a small sample (often n = 20–40) and compute descriptive item statistics to identify obvious weaknesses.
Item analysis. Review patterns such as low variance, weak corrected item-total correlations, and floor or ceiling effects to decide which items need revision.
Refine and re-test. Remove or revise problematic items and, if resources permit, test the revised version again because scale development is iterative rather than strictly linear (DeVellis and Thorpe 2021).

Example: Item Analysis for a Pilot Scale

We pilot a 5-item job satisfaction scale with n = 25 employees. Each item uses a 1–7 Likert response.

Table 4.1

Item descriptive statistics for the pilot scale

Item	Mean	SD	Min	Max
clear_communication	4.32	1.97	1	7
staff_courtesy	4.20	1.71	1	7
wait_time_ok	6.08	0.76	5	7
problem_resolved	3.60	1.80	1	7
overall_satisfied	4.12	1.72	1	7

Note. Item 3 has the narrowest spread and the highest mean, which is consistent with a possible ceiling effect.

Table 4.2

Inter-item correlation matrix for the pilot scale

Item	clear_communication	staff_courtesy	wait_time_ok	problem_resolved	overall_satisfied
clear_communication	1.00	0.34	0.18	-0.20	0.46
staff_courtesy	0.34	1.00	0.12	0.03	0.55
wait_time_ok	0.18	0.12	1.00	0.27	0.41
problem_resolved	-0.20	0.03	0.27	1.00	0.06
overall_satisfied	0.46	0.55	0.41	0.06	1.00

Note. Positive values indicate items moving in the same direction; negative values flag possible misfit, reverse coding, or a different latent facet.

Table 4.3

Corrected item-total correlations for the pilot scale

Item	Corrected item-total correlation	Mean	SD
clear_communication	0.29	4.32	1.97
staff_courtesy	0.45	4.20	1.71
wait_time_ok	0.38	6.08	0.76
problem_resolved	-0.01	3.60	1.80
overall_satisfied	0.62	4.12	1.72

Note. Items 3 and 4 fall below the usual screening threshold of about 0.30. Cronbach’s alpha for the initial 5-item scale was 0.540.

Reproducibility note: When adapting this code for your own pilot, record your set.seed() value and package versions so the same simulated item patterns can be regenerated.

Cronbach’s alpha for the initial 5-item scale was 0.540. Because this pilot uses only 25 respondents, the estimate should be treated cautiously: alpha can be unstable with small samples, so in applied work it is sensible either to report confidence intervals or to note the sampling uncertainty around the point estimate.

Interpretation: Look for a broadly coherent pattern rather than demanding that every statistic fall within a rigid threshold. In this pilot, Items 1, 2, and 5 move together reasonably well. Item 3 has a restricted high-end range and a negative corrected item-total correlation, suggesting that it is not functioning like the rest of the scale. Item 4 contributes very little to the total score, which makes it a candidate for revision or removal as well. The corrected item-total correlation (r.drop) is especially useful here: values below approximately 0.30 often indicate weak discrimination and deserve closer review, but the threshold is context-dependent. Very short scales may legitimately show lower average inter-item correlations, and substantively important facets should not be dropped solely on statistical grounds.

Identifying Problematic Items

Problematic items usually reveal themselves through a combination of weak descriptive and correlational signals. If an item has very low variance, most respondents are giving essentially the same answer, which often means the wording is too obvious, too extreme, or too narrow. If the corrected item-total correlation is weak, the item may be discriminating poorly or may be tapping a different construct. Floor or ceiling effects limit an item’s ability to distinguish among respondents because most answers cluster at one end of the scale. Negative correlations are especially important warning signs because they often indicate incorrect reverse coding or an item working against the rest of the scale. Treat these patterns as heuristics rather than automatic deletion rules: a statistically weak item may still capture an important facet of the construct and be worth revising rather than dropping.

Refining the Scale

Based on item analysis, revise or remove problematic items. For example, if Item 3 shows a ceiling effect and Item 4 has weak item-total correlation, consider removing them. Compute alpha for the revised scale.

Table 4.4

Internal consistency before and after item refinement

Scale	Cronbach’s alpha	Average inter-item correlation
Initial 5-item scale	0.540	0.220
Revised 3-item scale	0.702	0.447

Note. Removing wait_time_ok and problem_resolved improves both alpha and the average inter-item correlation in this simulated pilot example.

Interpretation: In this example, dropping Items 3 and 4 raises alpha from the initial 5-item version to the revised 3-item version. That does not mean researchers should always delete items with low statistics. Item removal should also reflect the construct definition, expert review, and respondent feedback. If Item 3 captures an important facet of job satisfaction, revising its wording may be preferable to dropping it entirely. For very short scales, it is also useful to inspect the average inter-item correlation rather than relying on alpha alone (DeVellis and Thorpe 2021).

Qualitative Feedback and Cognitive Interviews

With very small samples (n < 20), quantitative item analysis is unreliable. Qualitative methods (cognitive interviews, focus groups) are more informative. Ask respondents:

What does each item mean to you?
Were any items confusing, ambiguous, or difficult to answer?
Are the response options appropriate?
Are any items culturally inappropriate or offensive?

This feedback can prevent major problems before larger-scale data collection.

Key Takeaways

Measurement quality matters especially in small-sample research because unreliable measures reduce power and can distort substantive conclusions. In practice, strong small-sample scale development depends less on claiming full psychometric validation and more on combining expert review, respondent feedback, and cautious item-level diagnostics to identify obvious weaknesses. Iterative refinement, rather than one definitive pilot, is what gradually improves the scale when resources are limited.

```{r} #| include: false suppressPackageStartupMessages(library(tidyverse)) suppressPackageStartupMessages(library(knitr)) suppressPackageStartupMessages(library(htmltools)) chapter4_measurement_emit_table_styles <- function() { if (isTRUE(getOption("smallsamplelab.chapter4_measurement_table_styles_emitted"))) { return(invisible(NULL)) } cat( "<style> .chapter4-measurement-apa-table-block { font-family: 'Times New Roman', Georgia, serif; color: #111; max-width: 52rem; margin-bottom: 1rem; } .chapter4-measurement-apa-table-number { margin: 0 0 0.1rem 0; font-weight: 700; } .chapter4-measurement-apa-table-title { margin: 0 0 0.55rem 0; font-style: italic; } .chapter4-measurement-apa-table { width: 100%; border-collapse: collapse; font-size: 0.98rem; line-height: 1.35; } .chapter4-measurement-apa-table th, .chapter4-measurement-apa-table td { padding: 0.35rem 0.5rem; border-left: none !important; border-right: none !important; background: transparent !important; vertical-align: top; } .chapter4-measurement-apa-table thead th { border-top: 2px solid #000; border-bottom: 1px solid #000; font-weight: 600; } .chapter4-measurement-apa-table tbody tr:last-child td { border-bottom: 2px solid #000; } .chapter4-measurement-apa-table-note { margin: 0.45rem 0 0 0; font-size: 0.92rem; line-height: 1.35; } </style>\n", sep = "" ) options(smallsamplelab.chapter4_measurement_table_styles_emitted = TRUE) invisible(NULL) } chapter4_measurement_html_table <- function(number, title, data, note = NULL, align = rep("l", ncol(data)), col.names = names(data)) { chapter4_measurement_emit_table_styles() table_view <- tags$div( class = "chapter4-measurement-apa-table-block", tags$p(class = "chapter4-measurement-apa-table-number", paste("Table", number)), tags$p(class = "chapter4-measurement-apa-table-title", title), HTML( knitr::kable( data, format = "html", align = align, col.names = col.names, escape = FALSE, table.attr = "class='chapter4-measurement-apa-table'" ) ), if (!is.null(note)) { tags$p(class = "chapter4-measurement-apa-table-note", HTML(note)) } ) cat(as.character(table_view)) invisible(NULL) } ``` :::: {.content-visible when-format="html"} ```{webr-r} #| context: setup #| include: false #| echo: false library(dplyr) library(tidyr) library(tibble) library(purrr) library(psych) chapter4_measurement_html_table <- function(number, title, data, note = NULL, align = rep("l", ncol(data)), col.names = names(data)) { print(data) invisible(data) } ``` :::: # Chapter 4: Measurement Quality and Scale Development ### Learning Objectives By the end of this chapter, you will be able to explain the distinctions between content, construct, and criterion validity, pilot and refine short scales with limited samples, compute and interpret basic item-level diagnostics in R, and report measurement evidence transparently when full psychometric validation is not yet feasible. ### The Challenge of Measurement in Small Studies Many small-sample studies rely on brief, custom-developed measurement instruments. Standard scale development protocols (large pilot studies, factor analysis, item response theory) require hundreds of observations. With small samples, researchers must balance the need for reliable, valid measurement with practical constraints. Short scales (3–5 items) can be internally consistent and valid if items are carefully chosen. Pilot testing with qualitative feedback (cognitive interviews, think-aloud protocols) can identify ambiguous wording, response biases, and cultural appropriateness. Quantitative pilot data (even with n of about 20 to 30) can reveal extreme floor or ceiling effects, items with no variance, and obvious inconsistencies. ### Content and Face Validity Content validity refers to whether items comprehensively and appropriately represent the construct being measured. Face validity refers to whether items appear relevant and appropriate to respondents. Face validity is not a psychometric property in its own right and therefore carries limited evidentiary weight; it should complement, not substitute for, stronger content or construct validity evidence. Both are assessed through expert review and respondent feedback during scale development, before quantitative pilot testing, typically with input from both domain experts and representatives of the target population. #### Content Validity Ratio (CVR) Lawshe's Content Validity Ratio provides a simple index of expert agreement on whether an item is "essential" to a construct. With $N$ experts and $n_e$ rating an item as essential, the CVR is: $$ \operatorname{CVR} = \frac{n_e - N/2}{N/2} $$ CVR ranges from −1 to +1. Positive values indicate majority agreement that the item is essential, but retention thresholds depend on panel size. Under the conventional one-tailed $\alpha = 0.05$ rule, a panel of eight experts requires a CVR of 0.75 or higher, which means at least seven experts must judge the item essential [@lawshe1975; @ayre2014]. These critical values correspond to exact binomial probabilities and were re-examined by Ayre and Scally [@ayre2014]. Use CVR alongside qualitative feedback to decide which items to retain. :::: {.content-visible when-format="html"} ::::: {.panel-tabset group="part-b-data-collection-chapter-10-measurement-quality-and-scale-development-cell-1"} #### Rendered Output ```{r} #| label: part-b-chunk-05-html #| echo: false # CVR example: 8 experts, 6 judge the item essential n_experts <- 8 n_essential <- 6 cvr <- (n_essential - n_experts / 2) / (n_experts / 2) cvr ``` #### Cell Code ```{webr-r} #| context: interactive # CVR example: 8 experts, 6 judge the item essential n_experts <- 8 n_essential <- 6 cvr <- (n_essential - n_experts / 2) / (n_experts / 2) cvr ``` ::::: :::: :::: {.content-visible unless-format="html"} ```{r} #| label: part-b-chunk-05 # CVR example: 8 experts, 6 judge the item essential n_experts <- 8 n_essential <- 6 cvr <- (n_essential - n_experts / 2) / (n_experts / 2) cvr ``` :::: Interpretation: A CVR of 0.50 reflects that 6 of 8 experts (75%) rated the item essential. However, for an 8-member panel the usual retention threshold is 0.75, which requires agreement from at least 7 experts [@lawshe1975; @ayre2014]. Because 0.50 is below 0.75, this item would not satisfy Lawshe's criterion for statistically significant content validity. Researchers should consult verified critical-value tables when using CVR for item-retention decisions. #### Content Validity Index (CVI) The Content Validity Index is a more descriptive companion to CVR and is common in health, education, and applied measurement work. For an item-level CVI (I-CVI), experts rate each item for relevance, often on a 1–4 scale, and the analyst computes the proportion of experts rating the item as either 3 or 4. A scale-level average CVI (S-CVI/Ave) is then the mean of the item-level CVIs across the candidate item set. CVI does not replace qualitative review, but it gives a transparent summary of expert agreement before a small quantitative pilot begins. For example, suppose five experts rated six candidate items on a four-point relevance scale. Ratings of 3 or 4 are counted as content-valid endorsements. | Item | Experts rating 3 or 4 | I-CVI | Pilot decision | |---|---:|---:|---| | Item 1 | 5 of 5 | 1.00 | Retain | | Item 2 | 4 of 5 | 0.80 | Retain, check wording | | Item 3 | 3 of 5 | 0.60 | Revise before pilot | | Item 4 | 5 of 5 | 1.00 | Retain | | Item 5 | 2 of 5 | 0.40 | Drop or rewrite | | Item 6 | 4 of 5 | 0.80 | Retain, check redundancy | The S-CVI/Ave for this set is 0.77, calculated as the average of the six I-CVI values. The table is a discussion tool, not a mechanical keep/drop decision rule. Items 3 and 5 need substantive review because several experts did not judge them clearly relevant. Items 2 and 6 may be acceptable, but the qualitative comments should be checked for ambiguity or overlap. ### Construct and Criterion Validity in Small Samples Content validity is only one part of measurement quality. Researchers also need to consider whether scores behave as theory predicts, which is the core of **construct validity**, and whether they relate to a meaningful external benchmark, which is the core of **criterion validity**. In small studies, these checks should usually be modest and pre-specified. Rather than claiming a full psychometric validation from n = 20–40, look for tentative supporting evidence by comparing groups that theory predicts should differ on the construct, examining correlations with one or two closely related measures while reporting effect sizes and confidence intervals, or checking whether scores predict a practically relevant external outcome. Report effect sizes and uncertainty, avoid post-hoc fishing for significant associations, and avoid strong claims from unstable estimates. ### Reliability, Validity, and Measurement Error Reliability concerns **consistency**; validity concerns whether the instrument measures the **intended construct**. A scale can be reliable without being valid. In an early pilot, item statistics and internal consistency estimates are best treated as screening tools that flag obvious weaknesses, not as definitive proof that the scale is ready for high-stakes use. Chapter 5 extends this discussion for short scales. If items are scored or coded by human raters, agreement among raters becomes part of measurement quality. With two raters, report a chance-corrected agreement statistic such as Cohen's kappa alongside the raw agreement percentage; with more than two raters, use a multi-rater extension such as Fleiss' kappa or an intraclass correlation coefficient when scores are numeric. In small samples these estimates can be unstable, so describe the coding protocol, rater training, and disagreement-resolution process rather than reporting a coefficient alone. ### Steps for Scale Development with Small Samples 1. **Define the construct clearly.** Start by specifying what you are measuring and which dimensions or facets belong inside that construct. 2. **Generate candidate items.** Write more items than you expect to keep so that weak or redundant items can be removed later without leaving gaps in content coverage. 3. **Expert review.** Ask domain experts to rate each candidate item for relevance, clarity, and representativeness, and use their feedback to flag wording or content problems early. 4. **Cognitive interviews.** Ask a small number of respondents (often n = 5–10) to complete the scale while thinking aloud so you can hear how they interpret the wording and response options. 5. **Quantitative pilot.** Administer the revised scale to a small sample (often n = 20–40) and compute descriptive item statistics to identify obvious weaknesses. 6. **Item analysis.** Review patterns such as low variance, weak corrected item-total correlations, and floor or ceiling effects to decide which items need revision. 7. **Refine and re-test.** Remove or revise problematic items and, if resources permit, test the revised version again because scale development is iterative rather than strictly linear [@devellis2021]. ### Example: Item Analysis for a Pilot Scale We pilot a 5-item job satisfaction scale with n = 25 employees. Each item uses a 1–7 Likert response. :::: {.content-visible when-format="html"} ::::: {.panel-tabset group="part-b-data-collection-chapter-10-measurement-quality-and-scale-development-cell-2"} #### Rendered Output ```{r} #| label: part-b-chunk-06-html #| echo: false #| results: asis library(tidyverse) library(psych) set.seed(2025) n <- 25 latent_satisfaction <- rnorm(n) # Simulated pilot data: 25 respondents, 5 items # clear_communication, staff_courtesy, and overall_satisfied track the same construct. # wait_time_ok has a restricted high-end range; problem_resolved discriminates poorly. lkrt7 <- function(z) { as.integer(cut(z, breaks = c(-Inf, -1.5, -0.9, -0.3, 0.3, 0.9, 1.5, Inf), labels = FALSE)) } pilot_data <- tibble( respondent = 1:n, clear_communication = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)), staff_courtesy = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)), wait_time_ok = pmax(4L, lkrt7(0.25 * latent_satisfaction + 1.3 + rnorm(n, 0, 0.55))), problem_resolved = lkrt7(0.10 * latent_satisfaction + rnorm(n, 0, 1.2)), overall_satisfied = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)) ) # Item descriptive statistics item_stats <- pilot_data %>% select(clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied) %>% summarise(across(everything(), list( mean = mean, sd = sd, min = min, max = max ))) %>% pivot_longer(everything(), names_to = c("item", ".value"), names_pattern = "^(.+)_(mean|sd|min|max)$") # Inter-item correlations items_only <- select(pilot_data, clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied) cor_matrix <- cor(items_only) # Item-total correlations (corrected for item overlap) invisible(capture.output( alpha_full <- suppressWarnings(psych::alpha(items_only)) )) item_total <- alpha_full$item.stats %>% as.data.frame() %>% rownames_to_column("item") %>% transmute( item, r_drop = round(r.drop, 2), mean = round(mean, 2), sd = round(sd, 2) ) item_stats_display <- item_stats %>% mutate( mean = formatC(mean, format = "f", digits = 2), sd = formatC(sd, format = "f", digits = 2), min = formatC(min, format = "f", digits = 0), max = formatC(max, format = "f", digits = 0) ) %>% rename(Item = item, Mean = mean, SD = sd, Min = min, Max = max) cor_display <- round(cor_matrix, 2) %>% as.data.frame() %>% rownames_to_column("Item") %>% mutate(across(-Item, ~ formatC(.x, format = "f", digits = 2))) item_total_display <- item_total %>% mutate( r_drop = formatC(r_drop, format = "f", digits = 2), mean = formatC(mean, format = "f", digits = 2), sd = formatC(sd, format = "f", digits = 2) ) %>% rename( Item = item, `Corrected item-total correlation` = r_drop, Mean = mean, SD = sd ) if (knitr::is_html_output()) { chapter4_measurement_html_table( "4.1", "Item descriptive statistics for the pilot scale", item_stats_display, note = "Note. Item 3 has the narrowest spread and the highest mean, which is consistent with a possible ceiling effect.", align = c("l", "r", "r", "r", "r") ) chapter4_measurement_html_table( "4.2", "Inter-item correlation matrix for the pilot scale", cor_display, note = "Note. Positive values indicate items moving in the same direction; negative values flag possible misfit, reverse coding, or a different latent facet.", align = c("l", "r", "r", "r", "r", "r") ) chapter4_measurement_html_table( "4.3", "Corrected item-total correlations for the pilot scale", item_total_display, note = sprintf( "Note. Items 3 and 4 fall below the usual screening threshold of about 0.30. Cronbach's alpha for the initial 5-item scale was %.3f.", alpha_full$total$raw_alpha ), align = c("l", "r", "r", "r") ) } ``` #### Cell Code ```{webr-r} #| context: interactive library(dplyr) library(ggplot2) library(tidyr) library(tibble) library(purrr) library(readr) library(psych) set.seed(2025) n <- 25 latent_satisfaction <- rnorm(n) # Simulated pilot data: 25 respondents, 5 items # clear_communication, staff_courtesy, and overall_satisfied track the same construct. # wait_time_ok has a restricted high-end range; problem_resolved discriminates poorly. lkrt7 <- function(z) { as.integer(cut(z, breaks = c(-Inf, -1.5, -0.9, -0.3, 0.3, 0.9, 1.5, Inf), labels = FALSE)) } pilot_data <- tibble( respondent = 1:n, clear_communication = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)), staff_courtesy = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)), wait_time_ok = pmax(4L, lkrt7(0.25 * latent_satisfaction + 1.3 + rnorm(n, 0, 0.55))), problem_resolved = lkrt7(0.10 * latent_satisfaction + rnorm(n, 0, 1.2)), overall_satisfied = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)) ) # Item descriptive statistics item_stats <- pilot_data %>% select(clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied) %>% summarise(across(everything(), list( mean = mean, sd = sd, min = min, max = max ))) %>% pivot_longer(everything(), names_to = c("item", ".value"), names_pattern = "^(.+)_(mean|sd|min|max)$") # Inter-item correlations items_only <- select(pilot_data, clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied) cor_matrix <- cor(items_only) # Item-total correlations (corrected for item overlap) invisible(capture.output( alpha_full <- suppressWarnings(psych::alpha(items_only)) )) item_total <- alpha_full$item.stats %>% as.data.frame() %>% rownames_to_column("item") %>% transmute( item, r_drop = round(r.drop, 2), mean = round(mean, 2), sd = round(sd, 2) ) item_stats_display <- item_stats %>% mutate( mean = formatC(mean, format = "f", digits = 2), sd = formatC(sd, format = "f", digits = 2), min = formatC(min, format = "f", digits = 0), max = formatC(max, format = "f", digits = 0) ) %>% rename(Item = item, Mean = mean, SD = sd, Min = min, Max = max) cor_display <- round(cor_matrix, 2) %>% as.data.frame() %>% rownames_to_column("Item") %>% mutate(across(-Item, ~ formatC(.x, format = "f", digits = 2))) item_total_display <- item_total %>% mutate( r_drop = formatC(r_drop, format = "f", digits = 2), mean = formatC(mean, format = "f", digits = 2), sd = formatC(sd, format = "f", digits = 2) ) %>% rename( Item = item, `Corrected item-total correlation` = r_drop, Mean = mean, SD = sd ) if (knitr::is_html_output()) { chapter4_measurement_html_table( "4.1", "Item descriptive statistics for the pilot scale", item_stats_display, note = "Note. Item 3 has the narrowest spread and the highest mean, which is consistent with a possible ceiling effect.", align = c("l", "r", "r", "r", "r") ) chapter4_measurement_html_table( "4.2", "Inter-item correlation matrix for the pilot scale", cor_display, note = "Note. Positive values indicate items moving in the same direction; negative values flag possible misfit, reverse coding, or a different latent facet.", align = c("l", "r", "r", "r", "r", "r") ) chapter4_measurement_html_table( "4.3", "Corrected item-total correlations for the pilot scale", item_total_display, note = sprintf( "Note. Items 3 and 4 fall below the usual screening threshold of about 0.30. Cronbach's alpha for the initial 5-item scale was %.3f.", alpha_full$total$raw_alpha ), align = c("l", "r", "r", "r") ) } ``` ::::: :::: :::: {.content-visible unless-format="html"} ```{r} #| label: part-b-chunk-06 #| results: asis library(tidyverse) library(psych) set.seed(2025) n <- 25 latent_satisfaction <- rnorm(n) # Simulated pilot data: 25 respondents, 5 items # clear_communication, staff_courtesy, and overall_satisfied track the same construct. # wait_time_ok has a restricted high-end range; problem_resolved discriminates poorly. lkrt7 <- function(z) { as.integer(cut(z, breaks = c(-Inf, -1.5, -0.9, -0.3, 0.3, 0.9, 1.5, Inf), labels = FALSE)) } pilot_data <- tibble( respondent = 1:n, clear_communication = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)), staff_courtesy = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)), wait_time_ok = pmax(4L, lkrt7(0.25 * latent_satisfaction + 1.3 + rnorm(n, 0, 0.55))), problem_resolved = lkrt7(0.10 * latent_satisfaction + rnorm(n, 0, 1.2)), overall_satisfied = lkrt7(latent_satisfaction + rnorm(n, 0, 0.8)) ) # Item descriptive statistics item_stats <- pilot_data %>% select(clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied) %>% summarise(across(everything(), list( mean = mean, sd = sd, min = min, max = max ))) %>% pivot_longer(everything(), names_to = c("item", ".value"), names_pattern = "^(.+)_(mean|sd|min|max)$") # Inter-item correlations items_only <- select(pilot_data, clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied) cor_matrix <- cor(items_only) # Item-total correlations (corrected for item overlap) invisible(capture.output( alpha_full <- suppressWarnings(psych::alpha(items_only)) )) item_total <- alpha_full$item.stats %>% as.data.frame() %>% rownames_to_column("item") %>% transmute( item, r_drop = round(r.drop, 2), mean = round(mean, 2), sd = round(sd, 2) ) item_stats_display <- item_stats %>% rename(Item = item, Mean = mean, SD = sd, Min = min, Max = max) cor_display <- round(cor_matrix, 2) %>% as.data.frame() %>% rownames_to_column("Item") item_total_display <- item_total %>% rename( Item = item, `Corrected item-total correlation` = r_drop, Mean = mean, SD = sd ) print(knitr::kable( item_stats_display, align = c("l", "r", "r", "r", "r"), booktabs = TRUE, caption = "Item descriptive statistics for the pilot scale." )) print(knitr::kable( cor_display, align = c("l", "r", "r", "r", "r", "r"), booktabs = TRUE, caption = "Inter-item correlation matrix for the pilot scale." )) print(knitr::kable( item_total_display, align = c("l", "r", "r", "r"), booktabs = TRUE, caption = "Corrected item-total correlations for the pilot scale." )) ``` :::: Reproducibility note: When adapting this code for your own pilot, record your `set.seed()` value and package versions so the same simulated item patterns can be regenerated. Cronbach's alpha for the initial 5-item scale was `r sprintf("%.3f", alpha_full$total$raw_alpha)`. Because this pilot uses only 25 respondents, the estimate should be treated cautiously: alpha can be unstable with small samples, so in applied work it is sensible either to report confidence intervals or to note the sampling uncertainty around the point estimate. Interpretation: Look for a broadly coherent pattern rather than demanding that every statistic fall within a rigid threshold. In this pilot, Items 1, 2, and 5 move together reasonably well. Item 3 has a restricted high-end range and a negative corrected item-total correlation, suggesting that it is not functioning like the rest of the scale. Item 4 contributes very little to the total score, which makes it a candidate for revision or removal as well. The corrected item-total correlation (`r.drop`) is especially useful here: values below approximately 0.30 often indicate weak discrimination and deserve closer review, but the threshold is context-dependent. Very short scales may legitimately show lower average inter-item correlations, and substantively important facets should not be dropped solely on statistical grounds. ### Identifying Problematic Items Problematic items usually reveal themselves through a combination of weak descriptive and correlational signals. If an item has very low variance, most respondents are giving essentially the same answer, which often means the wording is too obvious, too extreme, or too narrow. If the corrected item-total correlation is weak, the item may be discriminating poorly or may be tapping a different construct. Floor or ceiling effects limit an item's ability to distinguish among respondents because most answers cluster at one end of the scale. Negative correlations are especially important warning signs because they often indicate incorrect reverse coding or an item working against the rest of the scale. Treat these patterns as heuristics rather than automatic deletion rules: a statistically weak item may still capture an important facet of the construct and be worth revising rather than dropping. ### Refining the Scale Based on item analysis, revise or remove problematic items. For example, if Item 3 shows a ceiling effect and Item 4 has weak item-total correlation, consider removing them. Compute alpha for the revised scale. :::: {.content-visible when-format="html"} ::::: {.panel-tabset group="part-b-data-collection-chapter-10-measurement-quality-and-scale-development-cell-3"} #### Rendered Output ```{r} #| label: part-b-chunk-07-html #| echo: false #| results: asis # Revised scale: keep the three items that cohere best invisible(capture.output( alpha_full <- suppressWarnings(psych::alpha(select(pilot_data, clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied))) )) revised_items <- select(pilot_data, clear_communication, staff_courtesy, overall_satisfied) invisible(capture.output( alpha_revised <- suppressWarnings(psych::alpha(revised_items)) )) alpha_comparison <- tibble( scale = c("Initial 5-item scale", "Revised 3-item scale"), alpha = c(alpha_full$total$raw_alpha, alpha_revised$total$raw_alpha), average_inter_item_r = c(alpha_full$total$average_r, alpha_revised$total$average_r) ) %>% mutate(across(where(is.numeric), ~ formatC(.x, format = "f", digits = 3))) %>% rename( Scale = scale, `Cronbach's alpha` = alpha, `Average inter-item correlation` = average_inter_item_r ) if (knitr::is_html_output()) { chapter4_measurement_html_table( "4.4", "Internal consistency before and after item refinement", alpha_comparison, note = "Note. Removing wait_time_ok and problem_resolved improves both alpha and the average inter-item correlation in this simulated pilot example.", align = c("l", "r", "r") ) } ``` #### Cell Code ```{webr-r} #| context: interactive # Revised scale: keep the three items that cohere best invisible(capture.output( alpha_full <- suppressWarnings(psych::alpha(select(pilot_data, clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied))) )) revised_items <- select(pilot_data, clear_communication, staff_courtesy, overall_satisfied) invisible(capture.output( alpha_revised <- suppressWarnings(psych::alpha(revised_items)) )) alpha_comparison <- tibble( scale = c("Initial 5-item scale", "Revised 3-item scale"), alpha = c(alpha_full$total$raw_alpha, alpha_revised$total$raw_alpha), average_inter_item_r = c(alpha_full$total$average_r, alpha_revised$total$average_r) ) %>% mutate(across(where(is.numeric), ~ formatC(.x, format = "f", digits = 3))) %>% rename( Scale = scale, `Cronbach's alpha` = alpha, `Average inter-item correlation` = average_inter_item_r ) if (knitr::is_html_output()) { chapter4_measurement_html_table( "4.4", "Internal consistency before and after item refinement", alpha_comparison, note = "Note. Removing wait_time_ok and problem_resolved improves both alpha and the average inter-item correlation in this simulated pilot example.", align = c("l", "r", "r") ) } ``` ::::: :::: :::: {.content-visible unless-format="html"} ```{r} #| label: part-b-chunk-07 #| results: asis # Revised scale: keep the three items that cohere best invisible(capture.output( alpha_full <- suppressWarnings(psych::alpha(select(pilot_data, clear_communication, staff_courtesy, wait_time_ok, problem_resolved, overall_satisfied))) )) revised_items <- select(pilot_data, clear_communication, staff_courtesy, overall_satisfied) invisible(capture.output( alpha_revised <- suppressWarnings(psych::alpha(revised_items)) )) alpha_comparison <- tibble( scale = c("Initial 5-item scale", "Revised 3-item scale"), alpha = c(alpha_full$total$raw_alpha, alpha_revised$total$raw_alpha), average_inter_item_r = c(alpha_full$total$average_r, alpha_revised$total$average_r) ) %>% mutate(across(where(is.numeric), ~ round(.x, 3))) %>% rename( Scale = scale, `Cronbach's alpha` = alpha, `Average inter-item correlation` = average_inter_item_r ) print(knitr::kable( alpha_comparison, align = c("l", "r", "r"), booktabs = TRUE, caption = "Internal consistency before and after item refinement." )) ``` :::: Interpretation: In this example, dropping Items 3 and 4 raises alpha from the initial 5-item version to the revised 3-item version. That does **not** mean researchers should always delete items with low statistics. Item removal should also reflect the construct definition, expert review, and respondent feedback. If Item 3 captures an important facet of job satisfaction, revising its wording may be preferable to dropping it entirely. For very short scales, it is also useful to inspect the average inter-item correlation rather than relying on alpha alone [@devellis2021]. ### Qualitative Feedback and Cognitive Interviews With very small samples (n < 20), quantitative item analysis is unreliable. Qualitative methods (cognitive interviews, focus groups) are more informative. Ask respondents: - What does each item mean to you? - Were any items confusing, ambiguous, or difficult to answer? - Are the response options appropriate? - Are any items culturally inappropriate or offensive? This feedback can prevent major problems before larger-scale data collection. ### Key Takeaways Measurement quality matters especially in small-sample research because unreliable measures reduce power and can distort substantive conclusions. In practice, strong small-sample scale development depends less on claiming full psychometric validation and more on combining expert review, respondent feedback, and cautious item-level diagnostics to identify obvious weaknesses. Iterative refinement, rather than one definitive pilot, is what gradually improves the scale when resources are limited. ### Self-Assessment Quiz ```{r} #| echo: false #| results: asis source(normalizePath(file.path(dirname(knitr::current_input(dir = TRUE)), "..", "R", "quiz_helpers.R"), mustWork = TRUE)) smallsamplelab_render_quiz(list( list( prompt = "What is the primary advantage of using qualitative methods (cognitive interviews) over quantitative methods when pilot testing scales with very small samples (n < 20)?", options = c("Cognitive interviews provide more statistical power", "Qualitative feedback can identify ambiguous wording and cultural issues without requiring statistical reliability", "Quantitative item analysis is too expensive", "Cognitive interviews automatically calculate Cronbach's alpha"), answer = 2L, explanation = "\"With very small samples (n < 20), quantitative item analysis is unreliable. Qualitative methods (cognitive interviews, focus groups) are more informative.\" Cognitive interviews reveal ambiguous wording and cultural issues without requiring the large samples needed for statistical reliability indices." ), list( prompt = "In Lawshe's Content Validity Ratio (CVR), if 8 experts are consulted and 6 judge an item as \"essential,\" what is the CVR?", options = c("0.25", "0.50", "0.75", "1.00"), answer = 2L, explanation = "CVR = (6 - 8/2) / (8/2) = (6 - 4) / 4 = 2/4 = 0.50. That means 75% of experts judged the item essential. However, with N = 8 the usual Lawshe threshold is 0.75, so a CVR of 0.50 would not satisfy the standard retention criterion." ), list( prompt = "What does a \"ceiling effect\" in item analysis indicate?", options = c("Most respondents give the lowest possible response", "Most respondents give the highest possible response", "The item has perfect reliability", "The item correlates negatively with the total score"), answer = 2L, explanation = "\"If most responses cluster at the low or high end, the item cannot differentiate among respondents.\" A ceiling effect occurs when most respondents give the highest possible response, limiting discrimination." ), list( prompt = "Which corrected item-total correlation threshold typically indicates that an item discriminates poorly and should be considered for removal?", options = c("Above 0.7", "Between 0.3 and 0.7", "Below 0.3", "Exactly 0.5"), answer = 3L, explanation = "\"The corrected item-total correlation (r.drop) indicates how well each item correlates with the total score excluding itself. Values below about 0.3 often suggest weak items that deserve closer review.\" Such items may discriminate poorly, but final decisions should still consider scale length, construct coverage, and respondent feedback." ), list( prompt = "What does content validity assess?", options = c("Whether items comprehensively and appropriately represent the construct being measured", "Whether the scale has high Cronbach's alpha", "Whether factor analysis confirms a one-dimensional structure", "Whether the scale predicts future behavior"), answer = 1L, explanation = "\"Content validity refers to whether items comprehensively and appropriately represent the construct being measured.\" Content validity is assessed through expert review, not statistical tests." ), list( prompt = "In a small pilot study, what would count as preliminary evidence for construct validity?", options = c("A single high Cronbach's alpha", "Scores relate to other variables or known groups in the direction theory predicts", "Respondents say the scale looks professional", "A factor analysis with n = 25 produces one factor"), answer = 2L, explanation = "Construct validity concerns whether scores behave in ways theory predicts. In small studies, this usually means modest, pre-specified checks such as known-groups comparisons or correlations with closely related measures, reported with effect sizes and uncertainty rather than post-hoc fishing for significant associations." ), list( prompt = "In pilot testing with small samples (n of about 20 to 30), what is the primary limitation of conducting factor analysis?", options = c("Factor analysis requires specialized software", "Factor analysis usually needs much larger samples for stable loadings; n of about 20 to 30 is typically too small", "Factor analysis only works with 7-point Likert scales", "Factor analysis cannot handle missing data"), answer = 2L, explanation = "Factor analysis usually needs substantially larger samples than a small pilot can provide. Common rules of thumb suggest about 5 to 10 participants per item or an absolute N of at least 100, although the exact requirement depends on communalities, item quality, and model complexity. A pilot sample of 20 to 30 is therefore usually too small for stable factor loadings, though it can still support preliminary item screening." ), list( prompt = "If an item correlates negatively with the total score and with other items, what is the most likely explanation?", options = c("The item has high content validity", "The item may be reverse-coded incorrectly or measuring an opposite construct", "The sample size is too large", "The item has a ceiling effect"), answer = 2L, explanation = "\"If an item correlates negatively with the total or with other items, it may be reverse-coded incorrectly or measuring an opposite construct.\" Negative correlations suggest coding errors or conceptual misalignment with the scale." ), list( prompt = "What does criterion validity ask?", options = c("Whether scale scores align with an external benchmark or relevant outcome", "Whether experts agree an item is essential", "Whether all items load on one factor", "Whether a scale has no missing values"), answer = 1L, explanation = "Criterion validity asks whether scores relate to an external benchmark, either at the same time (concurrent validity) or later (predictive validity). In small studies, this evidence should be reported cautiously with effect sizes and uncertainty." ), list( prompt = "In the example of the 5-item job satisfaction scale, Item 3 had a restricted range (responses only from 4–7 on a 1–7 scale). What does this suggest?", options = c("Item 3 has perfect reliability", "Item 3 may have a ceiling effect, limiting its ability to differentiate among respondents", "Item 3 should be kept because high scores are desirable", "The sample size should be increased to 1,000"), answer = 2L, explanation = "\"Item 3 has a restricted range (4–7), which may indicate a ceiling effect.\" Restricted ranges (especially at the high end) indicate ceiling effects that limit the item's ability to differentiate among respondents." ) )) ```