使用以人为本的人工智能方法分析学生对概念问题的简答解释的理解

IF 3.4 2区工程技术 Q1 EDUCATION & EDUCATIONAL RESEARCH

Journal of Engineering Education Pub Date : 2025-08-31 DOI:10.1002/jee.70032

Harpreet Auby, Namrata Shivagunde, Vijeta Deshpande, Anna Rumshisky, Milo D. Koretsky

{"title":"使用以人为本的人工智能方法分析学生对概念问题的简答解释的理解","authors":"Harpreet Auby, Namrata Shivagunde, Vijeta Deshpande, Anna Rumshisky, Milo D. Koretsky","doi":"10.1002/jee.70032","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Analyzing student short-answer written justifications to conceptually challenging questions has proven helpful to understand student thinking and improve conceptual understanding. However, qualitative analyses are limited by the burden of analyzing large amounts of text.</p>\n </section>\n \n <section>\n \n <h3> Purpose</h3>\n \n <p>We apply dense and sparse Large Language Models (LLMs) to explore how machine learning can automate coding for responses in engineering mechanics and thermodynamics.</p>\n </section>\n \n <section>\n \n <h3> Design/Method</h3>\n \n <p>We first identify the cognitive resources students use through human coding of seven questions. We then compare the performance of four dense LLMs and a sparse Mixture of Experts (Mixtral) model to automate coding. Finally, we investigate the extent to which domain-specific training is necessary for accurate coding.</p>\n </section>\n \n <section>\n \n <h3> Findings</h3>\n \n <p>In a sample question, we analyze 904 responses to identify 48 unique cognitive resources, which we then organize into six themes. In contrast to recommendations in the literature, students who activate molecular resources were less likely to answer correctly. This example illustrates the usefulness of qualitatively analyzing large datasets. Of the LLMs, Mixtral and Llama-3 performed best at within the same-dataset, in-domain coding tasks, especially as the training set size increases. Phi-3.5-mini, while effective in mechanics, shows inconsistent improvements with additional data and struggles in thermodynamics. In contrast, GPT-4 and GPT-4o-mini stand out for their robust generalization across in- and cross-domain tasks.</p>\n </section>\n \n <section>\n \n <h3> Conclusions</h3>\n \n <p>Open-source models like Mixtral have the potential to perform well when coding short-answer justifications to challenging concept questions. However, more fine-tuning is needed so that they can be robust enough to be utilized with a resources-based framing.</p>\n </section>\n </div>","PeriodicalId":50206,"journal":{"name":"Journal of Engineering Education","volume":"114 4","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Analysis of student understanding in short-answer explanations to concept questions using a human-centered AI approach\",\"authors\":\"Harpreet Auby, Namrata Shivagunde, Vijeta Deshpande, Anna Rumshisky, Milo D. Koretsky\",\"doi\":\"10.1002/jee.70032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Background</h3>\\n \\n <p>Analyzing student short-answer written justifications to conceptually challenging questions has proven helpful to understand student thinking and improve conceptual understanding. However, qualitative analyses are limited by the burden of analyzing large amounts of text.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Purpose</h3>\\n \\n <p>We apply dense and sparse Large Language Models (LLMs) to explore how machine learning can automate coding for responses in engineering mechanics and thermodynamics.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Design/Method</h3>\\n \\n <p>We first identify the cognitive resources students use through human coding of seven questions. We then compare the performance of four dense LLMs and a sparse Mixture of Experts (Mixtral) model to automate coding. Finally, we investigate the extent to which domain-specific training is necessary for accurate coding.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Findings</h3>\\n \\n <p>In a sample question, we analyze 904 responses to identify 48 unique cognitive resources, which we then organize into six themes. In contrast to recommendations in the literature, students who activate molecular resources were less likely to answer correctly. This example illustrates the usefulness of qualitatively analyzing large datasets. Of the LLMs, Mixtral and Llama-3 performed best at within the same-dataset, in-domain coding tasks, especially as the training set size increases. Phi-3.5-mini, while effective in mechanics, shows inconsistent improvements with additional data and struggles in thermodynamics. In contrast, GPT-4 and GPT-4o-mini stand out for their robust generalization across in- and cross-domain tasks.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusions</h3>\\n \\n <p>Open-source models like Mixtral have the potential to perform well when coding short-answer justifications to challenging concept questions. However, more fine-tuning is needed so that they can be robust enough to be utilized with a resources-based framing.</p>\\n </section>\\n </div>\",\"PeriodicalId\":50206,\"journal\":{\"name\":\"Journal of Engineering Education\",\"volume\":\"114 4\",\"pages\":\"\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Engineering Education\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1002/jee.70032\",\"RegionNum\":2,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Engineering Education","FirstCategoryId":"5","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1002/jee.70032","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

摘要

分析学生对概念性挑战性问题的简短回答，有助于理解学生的思维，提高对概念的理解。然而，定性分析受到分析大量文本的负担的限制。我们应用密集和稀疏的大型语言模型（llm）来探索机器学习如何自动编码工程力学和热力学中的响应。设计/方法我们首先通过对七个问题进行人工编码来确定学生使用的认知资源。然后，我们比较了四个密集llm和一个稀疏的混合专家（Mixtral）模型在自动编码方面的性能。最后，我们研究了特定领域的训练在多大程度上是准确编码所必需的。在一个样本问题中，我们分析了904个回答，确定了48个独特的认知资源，然后我们将其组织成六个主题。与文献中的建议相反，激活分子资源的学生不太可能回答正确。这个例子说明了定性分析大型数据集的有用性。在llm中，Mixtral和lama-3在相同数据集的域内编码任务中表现最好，特别是当训练集大小增加时。phil -3.5-mini虽然在力学方面很有效，但在热力学方面表现出不一致的改进和额外数据的挣扎。相比之下，GPT-4和gpt - 40 -mini在跨域和跨域任务中具有强大的泛化能力。像Mixtral这样的开源模型在编写具有挑战性的概念问题的简短答案时具有良好的表现。但是，需要进行更多的微调，以便它们足够健壮，能够与基于资源的框架一起使用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Analysis of student understanding in short-answer explanations to concept questions using a human-centered AI approach

Background

Analyzing student short-answer written justifications to conceptually challenging questions has proven helpful to understand student thinking and improve conceptual understanding. However, qualitative analyses are limited by the burden of analyzing large amounts of text.

Purpose

We apply dense and sparse Large Language Models (LLMs) to explore how machine learning can automate coding for responses in engineering mechanics and thermodynamics.

Design/Method

We first identify the cognitive resources students use through human coding of seven questions. We then compare the performance of four dense LLMs and a sparse Mixture of Experts (Mixtral) model to automate coding. Finally, we investigate the extent to which domain-specific training is necessary for accurate coding.

Findings

In a sample question, we analyze 904 responses to identify 48 unique cognitive resources, which we then organize into six themes. In contrast to recommendations in the literature, students who activate molecular resources were less likely to answer correctly. This example illustrates the usefulness of qualitatively analyzing large datasets. Of the LLMs, Mixtral and Llama-3 performed best at within the same-dataset, in-domain coding tasks, especially as the training set size increases. Phi-3.5-mini, while effective in mechanics, shows inconsistent improvements with additional data and struggles in thermodynamics. In contrast, GPT-4 and GPT-4o-mini stand out for their robust generalization across in- and cross-domain tasks.

Conclusions

Open-source models like Mixtral have the potential to perform well when coding short-answer justifications to challenging concept questions. However, more fine-tuning is needed so that they can be robust enough to be utilized with a resources-based framing.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊