Human Expertise and Large Language Model Embeddings in the Content Validity Assessment of Personality Tests.

IF 2.3 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement Pub Date : 2025-08-14 DOI:10.1177/00131644251355485

Nicola Milano, Michela Ponticorvo, Davide Marocco

{"title":"Human Expertise and Large Language Model Embeddings in the Content Validity Assessment of Personality Tests.","authors":"Nicola Milano, Michela Ponticorvo, Davide Marocco","doi":"10.1177/00131644251355485","DOIUrl":null,"url":null,"abstract":"<p><p>In this article, we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlight the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251355485"},"PeriodicalIF":2.3000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356817/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational and Psychological Measurement","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1177/00131644251355485","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

In this article, we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlight the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.

查看原文本刊更多论文

人格测验内容效度评估中的人类专业知识与大语言模型嵌入。

本文探讨了大语言模型（LLMs）在心理测量工具内容效度评估中的应用，重点研究了大五问卷（BFQ）和大五量表（BFI）。内容效度是测试结构的基石，确保心理测量充分覆盖其预期结构。使用人类专家评估和高级llm，我们比较了语义项目-构造对齐的准确性。心理学研究生采用内容效度比对测试项目进行评分，形成人的基线。同时，最先进的llm，包括多语言和微调模型，分析项目嵌入来预测结构映射。结果揭示了人类和人工智能方法的独特优势和局限性。人类验证者在对齐行为丰富的BFI项目方面表现出色，而llm在对齐语言简洁的BFI项目方面表现更好。训练策略显著影响LLM的表现，为词汇关系量身定制的模型表现优于通用LLM。在这里，我们强调了整合人类专业知识和人工智能精度的混合验证系统的互补潜力。这些发现强调了法学硕士在心理评估中的变革作用，为可扩展的、客观的、健壮的测试开发方法铺平了道路。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Educational and Psychological Measurement 医学-数学跨学科应用

CiteScore

5.50

自引率

7.40%

发文量

审稿时长

6-12 weeks

期刊介绍： Educational and Psychological Measurement (EPM) publishes referred scholarly work from all academic disciplines interested in the study of measurement theory, problems, and issues. Theoretical articles address new developments and techniques, and applied articles deal with innovation applications.