微调大型语言模型，加强医学研究生教育的项目评估。

The journal of education in perioperative medicine : JEPM Pub Date : 2024-09-30 eCollection Date: 2024-07-01 DOI:10.46374/VolXXVI_Issue3_Moore

Gregory J Booth, Thomas Hauert, Mike Mynes, John Hodgson, Elizabeth Slama, Ashton Goldman, Jeffrey Moore

{"title":"微调大型语言模型，加强医学研究生教育的项目评估。","authors":"Gregory J Booth, Thomas Hauert, Mike Mynes, John Hodgson, Elizabeth Slama, Ashton Goldman, Jeffrey Moore","doi":"10.46374/VolXXVI_Issue3_Moore","DOIUrl":null,"url":null,"abstract":"Background: Natural language processing is a collection of techniques designed to empower computer systems to comprehend and/or produce human language. The purpose of this investigation was to train several large language models (LLMs) to explore the tradeoff between model complexity and performance while classifying narrative feedback on trainees into the Accreditation Council for Graduate Medical Education subcompetencies. We hypothesized that classification accuracy would increase with model complexity.Methods: The authors fine-tuned several transformer-based LLMs (Bidirectional Encoder Representations from Transformers [BERT]-base, BERT-medium, BERT-small, BERT-mini, BERT-tiny, and SciBERT) to predict Accreditation Council for Graduate Medical Education subcompetencies on a curated dataset of 10 218 feedback comments. Performance was compared with the authors' previous work, which trained a FastText model on the same dataset. Performance metrics included F1 score for global model performance and area under the receiver operating characteristic curve for each competency.Results: No models were superior to FastText. Only BERT-tiny performed worse than FastText. The smallest model with comparable performance to FastText, BERT-mini, was 94% smaller. Area under the receiver operating characteristic curve for each competency was similar on BERT-mini and FastText with the exceptions of Patient Care 7 (Situational Awareness and Crisis Management) and Systems-Based Practice.Discussion: Transformer-based LLMs were fine-tuned to understand anesthesiology graduate medical education language. Complex LLMs did not outperform FastText. However, equivalent performance was achieved with a model that was 94% smaller, which may allow model deployment on personal devices to enhance speed and data privacy. This work advances our understanding of best practices when integrating LLMs into graduate medical education.","PeriodicalId":75067,"journal":{"name":"The journal of education in perioperative medicine : JEPM","volume":"26 3","pages":"E729"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11441632/pdf/","citationCount":"0","resultStr":"{\"title\":\"Fine-Tuning Large Language Models to Enhance Programmatic Assessment in Graduate Medical Education.\",\"authors\":\"Gregory J Booth, Thomas Hauert, Mike Mynes, John Hodgson, Elizabeth Slama, Ashton Goldman, Jeffrey Moore\",\"doi\":\"10.46374/VolXXVI_Issue3_Moore\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Natural language processing is a collection of techniques designed to empower computer systems to comprehend and/or produce human language. The purpose of this investigation was to train several large language models (LLMs) to explore the tradeoff between model complexity and performance while classifying narrative feedback on trainees into the Accreditation Council for Graduate Medical Education subcompetencies. We hypothesized that classification accuracy would increase with model complexity.Methods: The authors fine-tuned several transformer-based LLMs (Bidirectional Encoder Representations from Transformers [BERT]-base, BERT-medium, BERT-small, BERT-mini, BERT-tiny, and SciBERT) to predict Accreditation Council for Graduate Medical Education subcompetencies on a curated dataset of 10 218 feedback comments. Performance was compared with the authors' previous work, which trained a FastText model on the same dataset. Performance metrics included F1 score for global model performance and area under the receiver operating characteristic curve for each competency.Results: No models were superior to FastText. Only BERT-tiny performed worse than FastText. The smallest model with comparable performance to FastText, BERT-mini, was 94% smaller. Area under the receiver operating characteristic curve for each competency was similar on BERT-mini and FastText with the exceptions of Patient Care 7 (Situational Awareness and Crisis Management) and Systems-Based Practice.Discussion: Transformer-based LLMs were fine-tuned to understand anesthesiology graduate medical education language. Complex LLMs did not outperform FastText. However, equivalent performance was achieved with a model that was 94% smaller, which may allow model deployment on personal devices to enhance speed and data privacy. This work advances our understanding of best practices when integrating LLMs into graduate medical education.\",\"PeriodicalId\":75067,\"journal\":{\"name\":\"The journal of education in perioperative medicine : JEPM\",\"volume\":\"26 3\",\"pages\":\"E729\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11441632/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The journal of education in perioperative medicine : JEPM\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46374/VolXXVI_Issue3_Moore\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/7/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The journal of education in perioperative medicine : JEPM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46374/VolXXVI_Issue3_Moore","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

背景：自然语言处理是一系列旨在增强计算机系统理解和/或生成人类语言能力的技术。这项调查的目的是训练几个大型语言模型（LLM），以探索在将受训者的叙述性反馈归类为毕业后医学教育认证委员会的子能力时，模型复杂性与性能之间的权衡。我们假设分类准确率会随着模型复杂度的增加而提高：作者微调了几种基于变压器的 LLM（变压器双向编码器表征 [BERT]- base、BERT-medium、BERT-small、BERT-mini、BERT-tiny 和 SciBERT），以预测由 10 218 条反馈意见组成的数据集上的毕业医学教育评审委员会的子能力。性能与作者之前的工作进行了比较，后者在同一数据集上训练了一个 FastText 模型。性能指标包括全局模型性能的 F1 分数和每项能力的接收者工作特征曲线下面积：结果：没有任何模型优于 FastText。只有 BERT-tiny 的性能比 FastText 差。与 FastText 性能相当的最小模型 BERT-mini 比 FastText 小 94%。BERT-mini和FastText的各项能力的接收器操作特征曲线下面积相似，但病人护理7（态势感知和危机管理）和基于系统的实践除外：基于转换器的 LLMs 经过了微调，以理解麻醉学研究生医学教育语言。复杂 LLM 的性能没有超过 FastText。不过，在模型体积缩小94%的情况下，其性能与FastText相当，这可能允许在个人设备上部署模型，以提高速度和数据私密性。这项工作加深了我们对将 LLM 整合到毕业医学教育中的最佳实践的理解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Fine-Tuning Large Language Models to Enhance Programmatic Assessment in Graduate Medical Education.

Background: Natural language processing is a collection of techniques designed to empower computer systems to comprehend and/or produce human language. The purpose of this investigation was to train several large language models (LLMs) to explore the tradeoff between model complexity and performance while classifying narrative feedback on trainees into the Accreditation Council for Graduate Medical Education subcompetencies. We hypothesized that classification accuracy would increase with model complexity.

Methods: The authors fine-tuned several transformer-based LLMs (Bidirectional Encoder Representations from Transformers [BERT]-base, BERT-medium, BERT-small, BERT-mini, BERT-tiny, and SciBERT) to predict Accreditation Council for Graduate Medical Education subcompetencies on a curated dataset of 10 218 feedback comments. Performance was compared with the authors' previous work, which trained a FastText model on the same dataset. Performance metrics included F1 score for global model performance and area under the receiver operating characteristic curve for each competency.

Results: No models were superior to FastText. Only BERT-tiny performed worse than FastText. The smallest model with comparable performance to FastText, BERT-mini, was 94% smaller. Area under the receiver operating characteristic curve for each competency was similar on BERT-mini and FastText with the exceptions of Patient Care 7 (Situational Awareness and Crisis Management) and Systems-Based Practice.

Discussion: Transformer-based LLMs were fine-tuned to understand anesthesiology graduate medical education language. Complex LLMs did not outperform FastText. However, equivalent performance was achieved with a model that was 94% smaller, which may allow model deployment on personal devices to enhance speed and data privacy. This work advances our understanding of best practices when integrating LLMs into graduate medical education.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The journal of education in perioperative medicine : JEPM

自引率

0.00%

发文量