Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.

IF 8.1 1区生物学 Q1 GENETICS & HEREDITY

American journal of human genetics Pub Date : 2024-09-05 Epub Date: 2024-08-14 DOI:10.1016/j.ajhg.2024.07.011

Kendall A Flaharty, Ping Hu, Suzanna Ledgister Hanchard, Molly E Ripper, Dat Duong, Rebekah L Waikel, Benjamin D Solomon

{"title":"Evaluating large language models on medical, lay-language, and self-reported descriptions of genetic conditions.","authors":"Kendall A Flaharty, Ping Hu, Suzanna Ledgister Hanchard, Molly E Ripper, Dat Duong, Rebekah L Waikel, Benjamin D Solomon","doi":"10.1016/j.ajhg.2024.07.011","DOIUrl":null,"url":null,"abstract":"<p><p>Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%-49%, 41%-51%, and 54%-68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%-90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals' descriptions; LLMs answered these questions with a maximum 21% accuracy.</p>","PeriodicalId":7659,"journal":{"name":"American journal of human genetics","volume":" ","pages":"1819-1833"},"PeriodicalIF":8.1000,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11393706/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"American journal of human genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1016/j.ajhg.2024.07.011","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/14 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Large language models (LLMs) are generating interest in medical settings. For example, LLMs can respond coherently to medical queries by providing plausible differential diagnoses based on clinical notes. However, there are many questions to explore, such as evaluating differences between open- and closed-source LLMs as well as LLM performance on queries from both medical and non-medical users. In this study, we assessed multiple LLMs, including Llama-2-chat, Vicuna, Medllama2, Bard/Gemini, Claude, ChatGPT3.5, and ChatGPT-4, as well as non-LLM approaches (Google search and Phenomizer) regarding their ability to identify genetic conditions from textbook-like clinician questions and their corresponding layperson translations related to 63 genetic conditions. For open-source LLMs, larger models were more accurate than smaller LLMs: 7b, 13b, and larger than 33b parameter models obtained accuracy ranges from 21%-49%, 41%-51%, and 54%-68%, respectively. Closed-source LLMs outperformed open-source LLMs, with ChatGPT-4 performing best (89%-90%). Three of 11 LLMs and Google search had significant performance gaps between clinician and layperson prompts. We also evaluated how in-context prompting and keyword removal affected open-source LLM performance. Models were provided with 2 types of in-context prompts: list-type prompts, which improved LLM performance, and definition-type prompts, which did not. We further analyzed removal of rare terms from descriptions, which decreased accuracy for 5 of 7 evaluated LLMs. Finally, we observed much lower performance with real individuals' descriptions; LLMs answered these questions with a maximum 21% accuracy.

查看原文本刊更多论文

在医学、非专业语言和自我报告的遗传条件描述上评估大型语言模型。

大语言模型（LLMs）在医疗领域正引起人们的兴趣。例如，LLM 可以根据临床笔记提供可信的鉴别诊断，从而连贯地响应医疗查询。然而，还有许多问题需要探索，例如评估开放源代码和封闭源代码 LLM 之间的差异，以及 LLM 在医疗和非医疗用户查询时的性能。在本研究中，我们评估了多种 LLM（包括 Llama-2-chat、Vicuna、Medllama2、Bard/Gemini、Claude、ChatGPT3.5 和 ChatGPT-4）以及非 LLM 方法（谷歌搜索和 Phenomizer）从类似于文本簿的临床医生问题中识别遗传病的能力，以及与 63 种遗传病相关的相应非专业人士翻译。就开源 LLM 而言，大型模型比小型 LLM 更准确：7b、13b 和大于 33b 参数模型的准确率范围分别为 21%-49%、41%-51% 和 54%-68%。闭源 LLM 的表现优于开源 LLM，其中 ChatGPT-4 的表现最好（89%-90%）。在 11 种 LLM 中，有 3 种 LLM 和谷歌搜索在临床医生和非专业人员的提示之间存在明显的性能差距。我们还评估了上下文提示和关键词移除对开源 LLM 性能的影响。为模型提供了两种类型的上下文提示：列表型提示可提高 LLM 性能，而定义型提示则不会。我们进一步分析了从描述中删除罕见术语的情况，在 7 个接受评估的 LLM 中，有 5 个的准确率有所下降。最后，我们观察到，使用真实个人描述时，LLM 的表现要差得多；LLM 回答这些问题的准确率最高只有 21%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

American journal of human genetics 生物-遗传学

CiteScore

14.70

自引率

4.10%

发文量

185

审稿时长

1 months

期刊介绍： The American Journal of Human Genetics (AJHG) is a monthly journal published by Cell Press, chosen by The American Society of Human Genetics (ASHG) as its premier publication starting from January 2008. AJHG represents Cell Press's first society-owned journal, and both ASHG and Cell Press anticipate significant synergies between AJHG content and that of other Cell Press titles.