Assessing large language model performance related to aging in genetic conditions.

IF 6 Q2 GERIATRICS & GERONTOLOGY

npj aging Pub Date : 2025-05-03 DOI:10.1038/s41514-025-00226-z

Amna A Othman, Kendall A Flaharty, Suzanna E Ledgister Hanchard, Ping Hu, Dat Duong, Rebekah L Waikel, Benjamin D Solomon

{"title":"Assessing large language model performance related to aging in genetic conditions.","authors":"Amna A Othman, Kendall A Flaharty, Suzanna E Ledgister Hanchard, Ping Hu, Dat Duong, Rebekah L Waikel, Benjamin D Solomon","doi":"10.1038/s41514-025-00226-z","DOIUrl":null,"url":null,"abstract":"<p><p>Most genetic conditions are described in pediatric populations, leaving a gap in understanding their clinical progression and management in adulthood. Motivated by other applications of large language models (LLMs), we evaluated whether Llama-2-70b-chat (70b) and GPT-3.5 (GPT) could generate plausible medical vignettes, patient-geneticist dialogues and management plans for a hypothetical child and adult patients across 282 genetic conditions (selected by prevalence and categorized based on age-related characteristics). Results showed that LLMs provided appropriate age-based responses in both child and adult outputs based on Correctness and Completeness scores graded by clinicians. Sub-analysis of metabolic conditions including those typically presents neonatally with crisis also showed age-appropriate LLM responses. However 70b and GPT obtained low Correctness and Completeness scores at producing plausible management plans (55-66% for 70b and a wider range, 50-90%, for GPT). This suggests that LLMs still have some limitations in clinical applications.</p>","PeriodicalId":94160,"journal":{"name":"npj aging","volume":"11 1","pages":"33"},"PeriodicalIF":6.0000,"publicationDate":"2025-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12049513/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"npj aging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1038/s41514-025-00226-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GERIATRICS & GERONTOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Most genetic conditions are described in pediatric populations, leaving a gap in understanding their clinical progression and management in adulthood. Motivated by other applications of large language models (LLMs), we evaluated whether Llama-2-70b-chat (70b) and GPT-3.5 (GPT) could generate plausible medical vignettes, patient-geneticist dialogues and management plans for a hypothetical child and adult patients across 282 genetic conditions (selected by prevalence and categorized based on age-related characteristics). Results showed that LLMs provided appropriate age-based responses in both child and adult outputs based on Correctness and Completeness scores graded by clinicians. Sub-analysis of metabolic conditions including those typically presents neonatally with crisis also showed age-appropriate LLM responses. However 70b and GPT obtained low Correctness and Completeness scores at producing plausible management plans (55-66% for 70b and a wider range, 50-90%, for GPT). This suggests that LLMs still have some limitations in clinical applications.

Abstract Image

查看原文本刊更多论文

评估遗传条件下与衰老相关的大型语言模型性能。

大多数遗传病是在儿科人群中描述的，在了解其临床进展和成年后的管理方面存在差距。受大型语言模型（LLMs）的其他应用的激励，我们评估了lama-2-70b-chat （70b）和GPT-3.5 （GPT）是否可以为282种遗传疾病（按患病率选择并根据年龄相关特征分类）的假设儿童和成人患者生成可信的医学插图、患者-遗传学家对话和管理计划。结果显示，根据临床医生评分的正确性和完整性评分，llm在儿童和成人输出中都提供了适当的基于年龄的反应。代谢状况的亚分析，包括那些典型的新生儿危象，也显示出与年龄相适应的LLM反应。然而，70b和GPT在制定合理的管理计划方面获得了较低的正确性和完整性得分（70b为55-66%，而GPT的范围更广，为50-90%）。这表明llm在临床应用中仍有一定的局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

npj aging

CiteScore

8.90

自引率

0.00%

发文量