Item Difficulty Modeling Using Fine-tuned Small and Large Language Models.

IF 2.3 3区心理学 Q2 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

Educational and Psychological Measurement Pub Date : 2025-07-06 DOI:10.1177/00131644251344973

Ming Li, Hong Jiao, Tianyi Zhou, Nan Zhang, Sydney Peters, Robert W Lissitz

{"title":"Item Difficulty Modeling Using Fine-tuned Small and Large Language Models.","authors":"Ming Li, Hong Jiao, Tianyi Zhou, Nan Zhang, Sydney Peters, Robert W Lissitz","doi":"10.1177/00131644251344973","DOIUrl":null,"url":null,"abstract":"<p><p>This study investigates methods for item difficulty modeling in large-scale assessments using both small and large language models (LLMs). We introduce novel data augmentation strategies, including augmentation on the fly and distribution balancing, that surpass benchmark performances, demonstrating their effectiveness in mitigating data imbalance and improving model performance. Our results showed that fine-tuned small language models (SLMs) such as Bidirectional Encoder Representations from Transformers (BERT) and RoBERTa yielded lower root mean squared error than the first-place model in the BEA 2024 Shared Task competition, whereas domain-specific models like BioClinicalBERT and PubMedBERT did not provide significant improvements due to distributional gaps. Majority voting among SLMs enhanced prediction accuracy, reinforcing the benefits of ensemble learning. LLMs, such as GPT-4, exhibited strong generalization capabilities but struggled with item difficulty prediction, likely due to limited training data and the absence of explicit difficulty-related context. Chain-of-thought prompting and rationale generation approaches were explored but did not yield substantial improvements, suggesting that additional training data or more sophisticated reasoning techniques may be necessary. Embedding-based methods, particularly using NV-Embed-v2, showed promise but did not outperform our best augmentation strategies, indicating that capturing nuanced difficulty-related features remains a challenge.</p>","PeriodicalId":11502,"journal":{"name":"Educational and Psychological Measurement","volume":" ","pages":"00131644251344973"},"PeriodicalIF":2.3000,"publicationDate":"2025-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12230038/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Educational and Psychological Measurement","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.1177/00131644251344973","RegionNum":3,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

This study investigates methods for item difficulty modeling in large-scale assessments using both small and large language models (LLMs). We introduce novel data augmentation strategies, including augmentation on the fly and distribution balancing, that surpass benchmark performances, demonstrating their effectiveness in mitigating data imbalance and improving model performance. Our results showed that fine-tuned small language models (SLMs) such as Bidirectional Encoder Representations from Transformers (BERT) and RoBERTa yielded lower root mean squared error than the first-place model in the BEA 2024 Shared Task competition, whereas domain-specific models like BioClinicalBERT and PubMedBERT did not provide significant improvements due to distributional gaps. Majority voting among SLMs enhanced prediction accuracy, reinforcing the benefits of ensemble learning. LLMs, such as GPT-4, exhibited strong generalization capabilities but struggled with item difficulty prediction, likely due to limited training data and the absence of explicit difficulty-related context. Chain-of-thought prompting and rationale generation approaches were explored but did not yield substantial improvements, suggesting that additional training data or more sophisticated reasoning techniques may be necessary. Embedding-based methods, particularly using NV-Embed-v2, showed promise but did not outperform our best augmentation strategies, indicating that capturing nuanced difficulty-related features remains a challenge.

查看原文本刊更多论文

道具难度建模使用微调的大小语言模型。

本研究探讨了大型和小型语言模型（llm）在大规模评估中项目难度建模的方法。我们引入了新的数据增强策略，包括动态增强和分布平衡，这些策略超越了基准性能，证明了它们在缓解数据不平衡和提高模型性能方面的有效性。我们的研究结果表明，在BEA 2024共享任务竞赛中，经过微调的小型语言模型（SLMs），如来自Transformers的双向编码器表示（BERT）和RoBERTa，产生的均方根误差低于第一名的模型，而特定领域的模型，如BioClinicalBERT和PubMedBERT，由于分布差距没有提供显着的改进。slm中的多数投票提高了预测精度，强化了集成学习的好处。lms，如GPT-4，表现出较强的泛化能力，但在项目难度预测方面存在困难，这可能是由于有限的训练数据和缺乏明确的难度相关背景。研究人员探索了思维链提示和基本原理生成方法，但没有产生实质性的改进，这表明可能需要额外的训练数据或更复杂的推理技术。基于嵌入的方法，特别是使用NV-Embed-v2的方法，表现出了希望，但并没有超过我们最好的增强策略，这表明捕捉细微的困难相关特征仍然是一个挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Educational and Psychological Measurement 医学-数学跨学科应用

CiteScore

5.50

自引率

7.40%

发文量

审稿时长

6-12 weeks

期刊介绍： Educational and Psychological Measurement (EPM) publishes referred scholarly work from all academic disciplines interested in the study of measurement theory, problems, and issues. Theoretical articles address new developments and techniques, and applied articles deal with innovation applications.