Inspirational journeys

Follow the stories of academics and their research expeditions

Are You Measuring What You Think? — Validity & Measurement in Quantitative Research

Levi Cheptora

Tue, 16 Dec 2025

Are You Measuring What You Think? — Validity & Measurement in Quantitative Research

Abstract

Quantitative research rests on a deceptively simple but crucial question: Are you measuring what you think you’re measuring? Measurement is not automatic; scores are meaningful only if the instrument validly and reliably represents the construct of interest. This article presents an expanded, practical framework for designing, adapting, testing, analysing, and reporting measurement instruments in multi-language and multi-country research. Key topics include operational definitions; instrument selection and adaptation; reliability (internal consistency, test–retest); multiple facets of validity (content, criterion-related, construct); cross-cultural adaptation steps (translation/back-translation, cognitive interviews); and statistical tests such as exploratory and confirmatory factor analysis, measurement invariance (configural, metric, scalar), and differential item functioning. The piece ends with actionable checklists, a worked vignette, common pitfalls, and an APA-formatted reference list to guide researchers, evaluators, and practitioners working across cultural contexts.


Introduction — the single most important question

In quantitative research, the most consequential decision often happens before data collection: defining what we want to measure. A reliable data collection process and sophisticated analysis are wasted if the instrument does not capture the intended construct. Mistaken measurement leads to biased estimates, invalid inferences, wasted resources, and—when decisions affect health, education, or policy—potential harm (Messick, 1995; Cronbach & Meehl, 1955). This article synthesizes conceptual guidance and practical steps for researchers working across languages and cultures.


1. Start here: define the construct precisely

Before writing items or selecting a scale, write a concise operational definition (1–3 sentences) that describes:

  • The theoretical domain (what the construct is), and

  • The observable behaviors, symptoms, or responses that indicate the construct.

Example: “Depressive symptom severity — the frequency and intensity of affective, cognitive, and somatic symptoms of depression experienced over the past two weeks, as expressed in daily functioning and mood.” This definition should guide item selection and content coverage (Cronbach & Meehl, 1955; AERA et al., 2014).


2. Choose or adapt instruments intentionally

Prefer validated instruments that match your construct and context. If adaptation is necessary, follow a structured process:

  • Review existing measures and their psychometric evidence.

  • Translation / back-translation with bilingual experts.

  • Committee review to reconcile linguistic/semantic differences.

  • Cognitive interviews with members of each target population to check comprehension and cultural relevance.

  • Pilot testing in each language/group before large-scale administration (Streiner & Norman, 2008; van de Vijver & Leung, 1997).


3. Reliability — consistency is necessary but not sufficient

Establish reliability to ensure scores are consistent:

  • Internal consistency (e.g., Cronbach’s α, but complement with ω if possible). Cronbach’s α depends on scale length and tau-equivalence; interpret cautiously.

  • Test–retest reliability for constructs expected to be stable over the retest interval.

  • Interrater reliability when human raters score responses.
    High reliability is a prerequisite for validity but does not guarantee that the instrument measures the intended construct (Lord & Novick, 1968).


4. Validity — multiple sources of evidence

Validity is a unified concept composed of multiple evidentiary strands (Messick, 1995; AERA et al., 2014):

  • Content validity: Do items represent the construct domain? Use expert panels and content mapping.

  • Criterion-related validity: Do scores predict relevant outcomes (concurrent or predictive)? Establish correlations with gold-standard measures when available.

  • Construct validity: Include convergent and discriminant evidence (Campbell & Fiske, 1959). Use multitrait–multimethod matrices where feasible.

  • Consequential validity: Consider the implications and uses of scores, especially across cultures (Messick, 1995).


5. Cross-cultural adaptation and equivalence

Working across languages/cultures requires more than literal translation (van de Vijver & Leung, 1997). Steps include:

  1. Pre-translation concept mapping — ensure the construct has similar conceptual meaning across contexts.

  2. Forward and backward translation with independent translators.

  3. Committee reconciliation to resolve discrepancies.

  4. Cognitive interviews in each language to test comprehension, relevance, and cultural norms.

  5. Pilot administration and early psychometric checks.

When constructs have different cultural salience or form, you may need modified items or locally validated scales rather than forced direct comparisons.


6. Statistical tools for evaluating measurement

Key analyses (with pragmatic guidance):

Exploratory Factor Analysis (EFA)

Use EFA during instrument development to reveal underlying factor structure when theory is tentative. Sample size guidance: several recommendations exist (common rule: 5–10 participants per item), but focus on factor loadings and communalities rather than rules of thumb alone (Streiner & Norman, 2008).

Confirmatory Factor Analysis (CFA) & Structural Equation Modeling (SEM)

CFA tests hypothesized measurement models; SEM integrates measurement and structural relations (Kline, 2016). Report model fit indices (χ², RMSEA, CFI, TLI, SRMR) and modification indices cautiously.

Measurement invariance (multi-group CFA)

To compare scores across groups (e.g., languages, countries), sequentially test:

  • Configural invariance — same factor structure across groups.

  • Metric invariance — equal factor loadings; supports comparison of relationships (e.g., correlations, regressions) across groups.

  • Scalar invariance — equal item intercepts; required for comparing latent means.
    If full invariance fails, consider partial invariance methods or group-specific calibration, and always report what level was achieved (Vandenberg & Lance, 2000; Byrne, 2012).

Differential Item Functioning (DIF)

Use item response theory (IRT) or Mantel–Haenszel / logistic regression methods to identify items that perform differently across groups, controlling for overall trait level. Address DIF by rewording or removing biased items.


7. Piloting and cognitive interviewing — the qualitative bridge

Cognitive interviews (think-aloud, probing) help detect misunderstandings, idioms, or culturally inappropriate content. A recommended approach:

  • 5–15 interviews per language for initial checks.

  • Combine probing questions (“What does this phrase mean to you?”) with observations of response behaviour.
    Use findings to revise items before larger pilots.


8. When scores “don’t behave” — troubleshooting

If analyses show poor psychometrics:

  • Re-examine item wording and translation.

  • Consult cognitive interview notes for cultural misfit or misinterpretation.

  • Check for floor/ceiling effects and lack of variance.

  • Look for method effects (e.g., wording direction, response scales).

  • Consider separate norms or scoring for groups if invariance is untenable — but report this transparently.


9. Transparent reporting — what reviewers and users expect

Include in manuscripts and reports:

  • Clear construct definition and theoretical rationale.

  • Full description of item development/adaptation (who, how, languages).

  • Sample characteristics per group (age, sex/gender, education, language).

  • Reliability statistics for each group (α, ω, test–retest).

  • Factor analysis procedures and fit indices.

  • Measurement invariance testing results and DIF analyses.

  • Limitations regarding generalizability and cross-group comparisons (AERA et al., 2014).


10. Practical workflows (quick templates)

Workflow A — Adapting an existing scale across 3 languages

  1. Map construct and select candidate instrument.

  2. Forward/back-translation and committee reconciliation.

  3. 10 cognitive interviews per language.

  4. Pilot N ≈ 150 per language — run EFA/CFA, reliability.

  5. Run multi-group CFA for invariance; inspect DIF.

  6. Revise items; re-pilot if major changes.

Workflow B — Creating a short new measure (8–12 items)

  1. Define construct and specify subdomains.

  2. Draft 2–3 items per subdomain with SME input.

  3. Cognitive interviews (5–10 participants).

  4. Pilot N ≈ 200; run EFA, estimate reliability.

  5. CFA in new sample to confirm structure.


Short case vignette — worked example

A multinational public-health team wants to measure “healthcare access barriers” in three countries. They selected an English instrument originally developed in Country A. After forward/back-translation and cognitive interviews in Countries B and C, they found an item referencing “telehealth apps” was meaningless where smartphone penetration was low. The team replaced that item with a contextually relevant one about “availability of clinic transportation,” re-piloted, and then ran CFA. Multi-group CFA showed configural and metric invariance but not scalar invariance; DIF analysis flagged two items. The team reported these limitations, compared relations (not means) across countries, and planned follow-up qualitative work to unpack cultural differences.


Common pitfalls (and how to avoid them)

  • Assuming literal translation equals equivalence. Avoid — test and adapt.

  • Relying on Cronbach’s α alone. Complement with other reliability estimates.

  • Skipping cognitive interviews to save time. False economy — costly later.

  • Forcing mean comparisons without scalar invariance. Don’t. Report appropriately.

  • Neglecting sample composition differences. Control or stratify analyses.


Actionable checklist (one-page summary)

  1. Write a 1–3 sentence operational definition.

  2. Search for validated instruments; document evidence.

  3. Plan translation/back-translation and cognitive interviews if needed.

  4. Pilot (qualitative then quantitative) in each language/group.

  5. Compute reliability (α, ω, test–retest).

  6. Run EFA (development) and CFA (confirmation).

  7. Test measurement invariance (configural → metric → scalar).

  8. Check DIF on flagged items.

  9. If invariance fails, use partial invariance or group-specific strategies with transparent reporting.

  10. Include full psychometric appendix in reports.


Conclusion

Measurement is the foundation of quantitative inference. International research adds complexity: concepts may not map neatly across cultures, translations can change nuance, and psychometric properties can differ by group. By combining careful construct definition, qualitative adaptation (cognitive interviewing), rigorous piloting, and appropriate statistical testing (CFA, invariance, DIF), researchers can produce more trustworthy measures and safer, more accurate conclusions (Messick, 1995; AERA et al., 2014). Transparent reporting of what works — and what doesn’t — enables the field to learn and iterate.


References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing. American Educational Research Association.

Byrne, B. M. (2012). Structural equation modeling with Mplus: Basic concepts, applications, and programming (2nd ed.). Routledge.

Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56(2), 81–105.

Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302.

Kline, R. B. (2016). Principles and practice of structural equation modeling (4th ed.). Guilford Press.

Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Addison-Wesley.

Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741

Streiner, D. L., & Norman, G. R. (2008). Health measurement scales: A practical guide to their development and use (4th ed.). Oxford University Press.

van de Vijver, F., & Leung, K. (1997). Methods and data analysis for cross-cultural research. Sage.

Vandenberg, R. J., & Lance, C. E. (2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70.

0 Comments

Leave a comment