Jack Gallifant, Shan Chen, Sandeep K Jain, Pedro Moreira, Umit Topaloglu, Hugo J W L Aerts, Jeremy L Warner, William G La Cava, Danielle S Bitterman
{"title":"Reliability of Large Language Model Knowledge Across Brand and Generic Cancer Drug Names.","authors":"Jack Gallifant, Shan Chen, Sandeep K Jain, Pedro Moreira, Umit Topaloglu, Hugo J W L Aerts, Jeremy L Warner, William G La Cava, Danielle S Bitterman","doi":"10.1200/CCI-24-00257","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the performance and consistency of large language models (LLMs) across brand and generic oncology drug names in various clinical tasks, addressing concerns about potential fluctuations in LLM performance because of subtle phrasing differences that could affect patient care.</p><p><strong>Methods: </strong>This study evaluated three LLMs (GPT-3.5-turbo-0125, GPT-4-turbo, and GPT-4o) using drug names from HemOnc ontology. The assessment included 367 generic-to-brand and 2,516 brand-to-generic pairs, 1,000 drug-drug interaction (DDI) synthetic patient cases, and 2,438 immune-related adverse event (irAE) cases. LLMs were tested on drug name recognition, word association, DDI (DDI) detection, and irAE diagnosis using both brand and generic drug names.</p><p><strong>Results: </strong>LLMs demonstrated high accuracy in matching brand and generic names (GPT-4o: 97.38% for brand, 94.71% for generic, <i>P</i> < .01). However, they showed significant inconsistencies in word association tasks. GPT-3.5-turbo-0125 exhibited biases favoring brand names for effectiveness (odds ratio [OR], 1.43, <i>P</i> < .05) and being side-effect-free (OR, 1.76, <i>P</i> < .05). DDI detection accuracy was poor across all models (<26%), with no significant differences between brand and generic names. Sentiment analysis revealed significant differences, particularly in GPT-3.5-turbo-0125 (brand mean 0.67, generic mean 0.95, <i>P</i> < .01). Consistency in irAE diagnosis varied across models.</p><p><strong>Conclusion: </strong>Despite high proficiency in name-matching, LLMs exhibit inconsistencies when processing brand versus generic drug names in more complex tasks. These findings highlight the need for increased awareness, improved robustness assessment methods, and the development of more consistent systems for handling nomenclature variations in clinical applications of LLMs.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":"9 ","pages":"e2400257"},"PeriodicalIF":3.3000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI-24-00257","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/16 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: To evaluate the performance and consistency of large language models (LLMs) across brand and generic oncology drug names in various clinical tasks, addressing concerns about potential fluctuations in LLM performance because of subtle phrasing differences that could affect patient care.
Methods: This study evaluated three LLMs (GPT-3.5-turbo-0125, GPT-4-turbo, and GPT-4o) using drug names from HemOnc ontology. The assessment included 367 generic-to-brand and 2,516 brand-to-generic pairs, 1,000 drug-drug interaction (DDI) synthetic patient cases, and 2,438 immune-related adverse event (irAE) cases. LLMs were tested on drug name recognition, word association, DDI (DDI) detection, and irAE diagnosis using both brand and generic drug names.
Results: LLMs demonstrated high accuracy in matching brand and generic names (GPT-4o: 97.38% for brand, 94.71% for generic, P < .01). However, they showed significant inconsistencies in word association tasks. GPT-3.5-turbo-0125 exhibited biases favoring brand names for effectiveness (odds ratio [OR], 1.43, P < .05) and being side-effect-free (OR, 1.76, P < .05). DDI detection accuracy was poor across all models (<26%), with no significant differences between brand and generic names. Sentiment analysis revealed significant differences, particularly in GPT-3.5-turbo-0125 (brand mean 0.67, generic mean 0.95, P < .01). Consistency in irAE diagnosis varied across models.
Conclusion: Despite high proficiency in name-matching, LLMs exhibit inconsistencies when processing brand versus generic drug names in more complex tasks. These findings highlight the need for increased awareness, improved robustness assessment methods, and the development of more consistent systems for handling nomenclature variations in clinical applications of LLMs.