Comparative evaluation of artificial intelligence platforms and drug interaction screening databases using real-world patient data

IF 1.8 Q3 PHARMACOLOGY & PHARMACY

Exploratory research in clinical and social pharmacy Pub Date : 2025-09-08 DOI:10.1016/j.rcsop.2025.100655

Bálint Márk Domián , Amir Reza Ashraf , András Tamás Fittler , Mátyás Káplár , Róbert György Vida

{"title":"Comparative evaluation of artificial intelligence platforms and drug interaction screening databases using real-world patient data","authors":"Bálint Márk Domián , Amir Reza Ashraf , András Tamás Fittler , Mátyás Káplár , Róbert György Vida","doi":"10.1016/j.rcsop.2025.100655","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The use of multiple medications increases the risk of harmful drug-drug interactions (DDIs). Conventional DDI screening databases vary in coverage and often trigger low-relevance alerts, contributing to alert fatigue. Large language models (LLMs) have emerged as potential tools for DDI identification, however, their performance compared to established databases using real-world patient data remains under-explored.</div></div><div><h3>Methods</h3><div>In this exploratory study, we compared conventional database screening with LLM-based screening using anonymized medication lists from rheumatology patients. Lexicomp, Medscape and <span><span>Drugs.com</span><svg><path></path></svg></span> were used to compile a reference set of 204 clinically relevant interactions across 57 cases. Using identical prompts, we then queried ChatGPT, Google Gemini and Microsoft Copilot for interactions potentially requiring pharmacists' intervention. We calculated sensitivity, specificity, precision and F1 score.</div></div><div><h3>Results</h3><div>Compared to the reference set of 204 DDIs, ChatGPT identified 439, Gemini 1556, and Copilot 1813 potential interactions. While Gemini achieved the highest sensitivity (0.697), ChatGPT demonstrated higher specificity (0.868). All three platforms demonstrated low precision scores. Overall, ChatGPT achieved the highest performance by F1 score (0.2520), followed by Gemini (0.1933) and Copilot (0.1153). Our results suggest that no AI systems assessed achieve the required balance of precision and sensitivity for reliable clinical decision-making in DDI screening.</div></div><div><h3>Conclusion</h3><div>Although LLMs show promise as complementary tools in DDI screening, as they proved effective in identifying true interactions, they generate clinically inaccurate information due to hallucinations, which limits their reliability as standalone screening tools. Consequently, while LLMs could support clinical pharmacists in polypharmacy management, their outputs must always undergo professional validation to ensure patient safety.</div></div>","PeriodicalId":73003,"journal":{"name":"Exploratory research in clinical and social pharmacy","volume":"20 ","pages":"Article 100655"},"PeriodicalIF":1.8000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Exploratory research in clinical and social pharmacy","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2667276625000964","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}

引用次数: 0

Abstract

Background

The use of multiple medications increases the risk of harmful drug-drug interactions (DDIs). Conventional DDI screening databases vary in coverage and often trigger low-relevance alerts, contributing to alert fatigue. Large language models (LLMs) have emerged as potential tools for DDI identification, however, their performance compared to established databases using real-world patient data remains under-explored.

Methods

In this exploratory study, we compared conventional database screening with LLM-based screening using anonymized medication lists from rheumatology patients. Lexicomp, Medscape and Drugs.com were used to compile a reference set of 204 clinically relevant interactions across 57 cases. Using identical prompts, we then queried ChatGPT, Google Gemini and Microsoft Copilot for interactions potentially requiring pharmacists' intervention. We calculated sensitivity, specificity, precision and F1 score.

Results

Compared to the reference set of 204 DDIs, ChatGPT identified 439, Gemini 1556, and Copilot 1813 potential interactions. While Gemini achieved the highest sensitivity (0.697), ChatGPT demonstrated higher specificity (0.868). All three platforms demonstrated low precision scores. Overall, ChatGPT achieved the highest performance by F1 score (0.2520), followed by Gemini (0.1933) and Copilot (0.1153). Our results suggest that no AI systems assessed achieve the required balance of precision and sensitivity for reliable clinical decision-making in DDI screening.

Conclusion

Although LLMs show promise as complementary tools in DDI screening, as they proved effective in identifying true interactions, they generate clinically inaccurate information due to hallucinations, which limits their reliability as standalone screening tools. Consequently, while LLMs could support clinical pharmacists in polypharmacy management, their outputs must always undergo professional validation to ensure patient safety.

查看原文本刊更多论文

使用真实患者数据的人工智能平台和药物相互作用筛选数据库的比较评估

背景多种药物的使用增加了有害药物-药物相互作用（ddi）的风险。传统的DDI筛选数据库覆盖范围各不相同，经常触发低相关性警报，导致警报疲劳。大型语言模型（llm）已经成为DDI识别的潜在工具，然而，与使用真实患者数据的已建立数据库相比，它们的性能仍有待探索。方法在这项探索性研究中，我们比较了传统的数据库筛选和基于llm的筛选，使用风湿病患者的匿名药物清单。使用Lexicomp、Medscape和Drugs.com编制了一套涉及57例204例临床相关相互作用的参考集。然后，我们使用相同的提示查询ChatGPT，谷歌Gemini和Microsoft Copilot，以了解可能需要药剂师干预的交互。计算敏感性、特异性、精密度和F1评分。结果与204个ddi的参考集相比，ChatGPT识别了439个，双子座1556个，副驾驶1813个潜在的相互作用。Gemini的灵敏度最高（0.697），ChatGPT的特异性更高（0.868）。所有三个平台都显示出较低的精度分数。总的来说，ChatGPT在F1得分上取得了最高的成绩（0.2520），其次是Gemini（0.1933）和Copilot（0.1153）。我们的研究结果表明，评估的人工智能系统没有达到在DDI筛查中可靠的临床决策所需的精确度和灵敏度的平衡。尽管llm在DDI筛查中显示出作为补充工具的前景，因为它们被证明在识别真正的相互作用方面是有效的，但由于幻觉，它们会产生临床不准确的信息，这限制了它们作为独立筛查工具的可靠性。因此，虽然法学硕士可以支持临床药剂师进行多药房管理，但他们的产出必须始终经过专业验证，以确保患者安全。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊