大型语言模型知识在品牌和非专利抗癌药物名称中的可靠性。

IF 3.3 Q2 ONCOLOGY
JCO Clinical Cancer Informatics Pub Date : 2025-06-01 Epub Date: 2025-06-16 DOI:10.1200/CCI-24-00257
Jack Gallifant, Shan Chen, Sandeep K Jain, Pedro Moreira, Umit Topaloglu, Hugo J W L Aerts, Jeremy L Warner, William G La Cava, Danielle S Bitterman
{"title":"大型语言模型知识在品牌和非专利抗癌药物名称中的可靠性。","authors":"Jack Gallifant, Shan Chen, Sandeep K Jain, Pedro Moreira, Umit Topaloglu, Hugo J W L Aerts, Jeremy L Warner, William G La Cava, Danielle S Bitterman","doi":"10.1200/CCI-24-00257","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the performance and consistency of large language models (LLMs) across brand and generic oncology drug names in various clinical tasks, addressing concerns about potential fluctuations in LLM performance because of subtle phrasing differences that could affect patient care.</p><p><strong>Methods: </strong>This study evaluated three LLMs (GPT-3.5-turbo-0125, GPT-4-turbo, and GPT-4o) using drug names from HemOnc ontology. The assessment included 367 generic-to-brand and 2,516 brand-to-generic pairs, 1,000 drug-drug interaction (DDI) synthetic patient cases, and 2,438 immune-related adverse event (irAE) cases. LLMs were tested on drug name recognition, word association, DDI (DDI) detection, and irAE diagnosis using both brand and generic drug names.</p><p><strong>Results: </strong>LLMs demonstrated high accuracy in matching brand and generic names (GPT-4o: 97.38% for brand, 94.71% for generic, <i>P</i> < .01). However, they showed significant inconsistencies in word association tasks. GPT-3.5-turbo-0125 exhibited biases favoring brand names for effectiveness (odds ratio [OR], 1.43, <i>P</i> < .05) and being side-effect-free (OR, 1.76, <i>P</i> < .05). DDI detection accuracy was poor across all models (<26%), with no significant differences between brand and generic names. Sentiment analysis revealed significant differences, particularly in GPT-3.5-turbo-0125 (brand mean 0.67, generic mean 0.95, <i>P</i> < .01). Consistency in irAE diagnosis varied across models.</p><p><strong>Conclusion: </strong>Despite high proficiency in name-matching, LLMs exhibit inconsistencies when processing brand versus generic drug names in more complex tasks. These findings highlight the need for increased awareness, improved robustness assessment methods, and the development of more consistent systems for handling nomenclature variations in clinical applications of LLMs.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400257"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Reliability of Large Language Model Knowledge Across Brand and Generic Cancer Drug Names.\",\"authors\":\"Jack Gallifant, Shan Chen, Sandeep K Jain, Pedro Moreira, Umit Topaloglu, Hugo J W L Aerts, Jeremy L Warner, William G La Cava, Danielle S Bitterman\",\"doi\":\"10.1200/CCI-24-00257\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>To evaluate the performance and consistency of large language models (LLMs) across brand and generic oncology drug names in various clinical tasks, addressing concerns about potential fluctuations in LLM performance because of subtle phrasing differences that could affect patient care.</p><p><strong>Methods: </strong>This study evaluated three LLMs (GPT-3.5-turbo-0125, GPT-4-turbo, and GPT-4o) using drug names from HemOnc ontology. The assessment included 367 generic-to-brand and 2,516 brand-to-generic pairs, 1,000 drug-drug interaction (DDI) synthetic patient cases, and 2,438 immune-related adverse event (irAE) cases. LLMs were tested on drug name recognition, word association, DDI (DDI) detection, and irAE diagnosis using both brand and generic drug names.</p><p><strong>Results: </strong>LLMs demonstrated high accuracy in matching brand and generic names (GPT-4o: 97.38% for brand, 94.71% for generic, <i>P</i> < .01). However, they showed significant inconsistencies in word association tasks. GPT-3.5-turbo-0125 exhibited biases favoring brand names for effectiveness (odds ratio [OR], 1.43, <i>P</i> < .05) and being side-effect-free (OR, 1.76, <i>P</i> < .05). DDI detection accuracy was poor across all models (<26%), with no significant differences between brand and generic names. Sentiment analysis revealed significant differences, particularly in GPT-3.5-turbo-0125 (brand mean 0.67, generic mean 0.95, <i>P</i> < .01). Consistency in irAE diagnosis varied across models.</p><p><strong>Conclusion: </strong>Despite high proficiency in name-matching, LLMs exhibit inconsistencies when processing brand versus generic drug names in more complex tasks. These findings highlight the need for increased awareness, improved robustness assessment methods, and the development of more consistent systems for handling nomenclature variations in clinical applications of LLMs.</p>\",\"PeriodicalId\":51626,\"journal\":{\"name\":\"JCO Clinical Cancer Informatics\",\"volume\":\"9 \",\"pages\":\"e2400257\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JCO Clinical Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1200/CCI-24-00257\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/6/16 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00257","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/16 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

目的:评估跨品牌和非专利肿瘤药物名称的大型语言模型(LLM)在各种临床任务中的表现和一致性,解决LLM表现的潜在波动问题,因为细微的措辞差异可能影响患者护理。方法:本研究使用HemOnc本体中的药物名称对三种llm (GPT-3.5-turbo-0125、GPT-4-turbo和gpt - 40)进行评价。该评估包括367对仿制药对和2516对仿制药对,1000例药物-药物相互作用(DDI)合成患者病例和2438例免疫相关不良事件(irAE)病例。采用品牌药名和仿制药名对llm进行药品名称识别、单词关联、DDI (DDI)检测和irAE诊断。结果:LLMs对品牌名和仿制名的匹配准确率较高(gpt - 40对品牌名的匹配准确率为97.38%,对仿制名的匹配准确率为94.71%,P < 0.01)。然而,他们在单词联想任务中表现出显著的不一致性。GPT-3.5-turbo-0125在疗效(比值比[OR], 1.43, P < 0.05)和无副作用(比值比[OR], 1.76, P < 0.05)方面表现出偏好品牌名称的倾向。所有模型的DDI检测准确率均较差(P < 0.01)。rae诊断的一致性因模型而异。结论:尽管法学硕士对名称匹配的熟练程度很高,但在更复杂的任务中,当处理品牌药和仿制药名称时,法学硕士表现出不一致性。这些发现强调需要提高认识,改进稳健性评估方法,以及开发更一致的系统来处理llm临床应用中的命名变化。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Reliability of Large Language Model Knowledge Across Brand and Generic Cancer Drug Names.

Purpose: To evaluate the performance and consistency of large language models (LLMs) across brand and generic oncology drug names in various clinical tasks, addressing concerns about potential fluctuations in LLM performance because of subtle phrasing differences that could affect patient care.

Methods: This study evaluated three LLMs (GPT-3.5-turbo-0125, GPT-4-turbo, and GPT-4o) using drug names from HemOnc ontology. The assessment included 367 generic-to-brand and 2,516 brand-to-generic pairs, 1,000 drug-drug interaction (DDI) synthetic patient cases, and 2,438 immune-related adverse event (irAE) cases. LLMs were tested on drug name recognition, word association, DDI (DDI) detection, and irAE diagnosis using both brand and generic drug names.

Results: LLMs demonstrated high accuracy in matching brand and generic names (GPT-4o: 97.38% for brand, 94.71% for generic, P < .01). However, they showed significant inconsistencies in word association tasks. GPT-3.5-turbo-0125 exhibited biases favoring brand names for effectiveness (odds ratio [OR], 1.43, P < .05) and being side-effect-free (OR, 1.76, P < .05). DDI detection accuracy was poor across all models (<26%), with no significant differences between brand and generic names. Sentiment analysis revealed significant differences, particularly in GPT-3.5-turbo-0125 (brand mean 0.67, generic mean 0.95, P < .01). Consistency in irAE diagnosis varied across models.

Conclusion: Despite high proficiency in name-matching, LLMs exhibit inconsistencies when processing brand versus generic drug names in more complex tasks. These findings highlight the need for increased awareness, improved robustness assessment methods, and the development of more consistent systems for handling nomenclature variations in clinical applications of LLMs.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.20
自引率
4.80%
发文量
190
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信