Ana M Maitin, Alberto Nogales, Sergio Fernández-Rincón, Enrique Aranguren, Emilio Cervera-Barba, Sophia Denizon-Arranz, Alonso Mateos-Rodríguez, Álvaro J García-Tejedor
{"title":"Application of large language models in clinical record correction: a comprehensive study on various retraining methods.","authors":"Ana M Maitin, Alberto Nogales, Sergio Fernández-Rincón, Enrique Aranguren, Emilio Cervera-Barba, Sophia Denizon-Arranz, Alonso Mateos-Rodríguez, Álvaro J García-Tejedor","doi":"10.1093/jamia/ocae302","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>We evaluate the effectiveness of large language models (LLMs), specifically GPT-based (GPT-3.5 and GPT-4) and Llama-2 models (13B and 7B architectures), in autonomously assessing clinical records (CRs) to enhance medical education and diagnostic skills.</p><p><strong>Materials and methods: </strong>Various techniques, including prompt engineering, fine-tuning (FT), and low-rank adaptation (LoRA), were implemented and compared on Llama-2 7B. These methods were assessed using prompts in both English and Spanish to determine their adaptability to different languages. Performance was benchmarked against GPT-3.5, GPT-4, and Llama-2 13B.</p><p><strong>Results: </strong>GPT-based models, particularly GPT-4, demonstrated promising performance closely aligned with specialist evaluations. Application of FT on Llama-2 7B improved text comprehension in Spanish, equating its performance to that of Llama-2 13B with English prompts. Low-rank adaptation significantly enhanced performance, surpassing GPT-3.5 results when combined with FT. This indicates LoRA's effectiveness in adapting open-source models for specific tasks.</p><p><strong>Discussion: </strong>While GPT-4 showed superior performance, FT and LoRA on Llama-2 7B proved crucial in improving language comprehension and task-specific accuracy. Identified limitations highlight the need for further research.</p><p><strong>Conclusion: </strong>This study underscores the potential of LLMs in medical education, providing an innovative, effective approach to CR correction. Low-rank adaptation emerged as the most effective technique, enabling open-source models to perform on par with proprietary models. Future research should focus on overcoming current limitations to further improve model performance.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7000,"publicationDate":"2024-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocae302","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Objectives: We evaluate the effectiveness of large language models (LLMs), specifically GPT-based (GPT-3.5 and GPT-4) and Llama-2 models (13B and 7B architectures), in autonomously assessing clinical records (CRs) to enhance medical education and diagnostic skills.
Materials and methods: Various techniques, including prompt engineering, fine-tuning (FT), and low-rank adaptation (LoRA), were implemented and compared on Llama-2 7B. These methods were assessed using prompts in both English and Spanish to determine their adaptability to different languages. Performance was benchmarked against GPT-3.5, GPT-4, and Llama-2 13B.
Results: GPT-based models, particularly GPT-4, demonstrated promising performance closely aligned with specialist evaluations. Application of FT on Llama-2 7B improved text comprehension in Spanish, equating its performance to that of Llama-2 13B with English prompts. Low-rank adaptation significantly enhanced performance, surpassing GPT-3.5 results when combined with FT. This indicates LoRA's effectiveness in adapting open-source models for specific tasks.
Discussion: While GPT-4 showed superior performance, FT and LoRA on Llama-2 7B proved crucial in improving language comprehension and task-specific accuracy. Identified limitations highlight the need for further research.
Conclusion: This study underscores the potential of LLMs in medical education, providing an innovative, effective approach to CR correction. Low-rank adaptation emerged as the most effective technique, enabling open-source models to perform on par with proprietary models. Future research should focus on overcoming current limitations to further improve model performance.
期刊介绍:
JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.