Human Expertise and Large Language Model Embeddings in the Content Validity Assessment of Personality Tests.

IF 2.3 3区 心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS
Nicola Milano, Michela Ponticorvo, Davide Marocco
{"title":"Human Expertise and Large Language Model Embeddings in the Content Validity Assessment of Personality Tests.","authors":"Nicola Milano, Michela Ponticorvo, Davide Marocco","doi":"10.1177/00131644251355485","DOIUrl":null,"url":null,"abstract":"<p><p>In this article, we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlight the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251355485"},"PeriodicalIF":2.3000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12356817/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational and Psychological Measurement","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1177/00131644251355485","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

In this article, we explore the application of Large Language Models (LLMs) in assessing the content validity of psychometric instruments, focusing on the Big Five Questionnaire (BFQ) and Big Five Inventory (BFI). Content validity, a cornerstone of test construction, ensures that psychological measures adequately cover their intended constructs. Using both human expert evaluations and advanced LLMs, we compared the accuracy of semantic item-construct alignment. Graduate psychology students employed the Content Validity Ratio to rate test items, forming the human baseline. In parallel, state-of-the-art LLMs, including multilingual and fine-tuned models, analyzed item embeddings to predict construct mappings. The results reveal distinct strengths and limitations of human and AI approaches. Human validators excelled in aligning the behaviorally rich BFQ items, while LLMs performed better with the linguistically concise BFI items. Training strategies significantly influenced LLM performance, with models tailored for lexical relationships outperforming general-purpose LLMs. Here we highlight the complementary potential of hybrid validation systems that integrate human expertise and AI precision. The findings underscore the transformative role of LLMs in psychological assessment, paving the way for scalable, objective, and robust test development methodologies.

人格测验内容效度评估中的人类专业知识与大语言模型嵌入。
本文探讨了大语言模型(LLMs)在心理测量工具内容效度评估中的应用,重点研究了大五问卷(BFQ)和大五量表(BFI)。内容效度是测试结构的基石,确保心理测量充分覆盖其预期结构。使用人类专家评估和高级llm,我们比较了语义项目-构造对齐的准确性。心理学研究生采用内容效度比对测试项目进行评分,形成人的基线。同时,最先进的llm,包括多语言和微调模型,分析项目嵌入来预测结构映射。结果揭示了人类和人工智能方法的独特优势和局限性。人类验证者在对齐行为丰富的BFI项目方面表现出色,而llm在对齐语言简洁的BFI项目方面表现更好。训练策略显著影响LLM的表现,为词汇关系量身定制的模型表现优于通用LLM。在这里,我们强调了整合人类专业知识和人工智能精度的混合验证系统的互补潜力。这些发现强调了法学硕士在心理评估中的变革作用,为可扩展的、客观的、健壮的测试开发方法铺平了道路。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Educational and Psychological Measurement
Educational and Psychological Measurement 医学-数学跨学科应用
CiteScore
5.50
自引率
7.40%
发文量
49
审稿时长
6-12 weeks
期刊介绍: Educational and Psychological Measurement (EPM) publishes referred scholarly work from all academic disciplines interested in the study of measurement theory, problems, and issues. Theoretical articles address new developments and techniques, and applied articles deal with innovation applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信