用于个案安全报告因果关系评估的现成大型语言模型:COVID-19疫苗的概念验证

IF 4 2区 医学 Q1 PHARMACOLOGY & PHARMACY
Andrea Abate, Elisa Poncato, Maria Antonietta Barbieri, Greg Powell, Andrea Rossi, Simay Peker, Anders Hviid, Andrew Bate, Maurizio Sessa
{"title":"用于个案安全报告因果关系评估的现成大型语言模型:COVID-19疫苗的概念验证","authors":"Andrea Abate, Elisa Poncato, Maria Antonietta Barbieri, Greg Powell, Andrea Rossi, Simay Peker, Anders Hviid, Andrew Bate, Maurizio Sessa","doi":"10.1007/s40264-025-01531-y","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>This study evaluated the feasibility of ChatGPT and Gemini, two off-the-shelf large language models (LLMs), to automate causality assessments, focusing on Adverse Events Following Immunizations (AEFIs) of myocarditis and pericarditis related to COVID-19 vaccines.</p><p><strong>Methods: </strong>We assessed 150 COVID-19-related cases of myocarditis and pericarditis reported to the Vaccine Adverse Event Reporting System (VAERS) in the United States of America (USA). Both LLMs and human experts conducted the World Health Organization (WHO) algorithm for vaccine causality assessments, and inter-rater agreement was measured using percentage agreement. Adherence to the WHO algorithm was evaluated by comparing LLM responses to the expected sequence of the algorithm. Statistical analyses, including descriptive statistics and Random Forest modeling, explored case complexity (e.g., string length measurements) and factors affecting LLM performance and adherence.</p><p><strong>Results: </strong>ChatGPT showed higher adherence to the WHO algorithm (34%) compared to Gemini (7%) and had moderate agreement (71%) with human experts, whereas Gemini had fair agreement (53%). Both LLMs often failed to recognize listed AEFIs, with ChatGPT and Gemini incorrectly identifying 6.7% and 13.3% of AEFIs, respectively. ChatGPT showed inconsistencies in 8.0% of cases and Gemini in 46.7%. For ChatGPT, adherence to the algorithm was associated with lower string complexity in prompt sections. The random forest analysis achieved an accuracy of 55% (95% confidence interval: 35.7-73.5) for predicting adherence to the WHO algorithm for ChatGPT.</p><p><strong>Conclusion: </strong>Notable limitations of ChatGPT and Gemini have been identified in their use for aiding causality assessments in vaccine safety. ChatGPT performed better, with higher adherence and agreement with human experts. In the investigated scenario, both models are better suited as complementary tools to human expertise.</p>","PeriodicalId":11382,"journal":{"name":"Drug Safety","volume":" ","pages":""},"PeriodicalIF":4.0000,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof-of-Concept with COVID-19 Vaccines.\",\"authors\":\"Andrea Abate, Elisa Poncato, Maria Antonietta Barbieri, Greg Powell, Andrea Rossi, Simay Peker, Anders Hviid, Andrew Bate, Maurizio Sessa\",\"doi\":\"10.1007/s40264-025-01531-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>This study evaluated the feasibility of ChatGPT and Gemini, two off-the-shelf large language models (LLMs), to automate causality assessments, focusing on Adverse Events Following Immunizations (AEFIs) of myocarditis and pericarditis related to COVID-19 vaccines.</p><p><strong>Methods: </strong>We assessed 150 COVID-19-related cases of myocarditis and pericarditis reported to the Vaccine Adverse Event Reporting System (VAERS) in the United States of America (USA). Both LLMs and human experts conducted the World Health Organization (WHO) algorithm for vaccine causality assessments, and inter-rater agreement was measured using percentage agreement. Adherence to the WHO algorithm was evaluated by comparing LLM responses to the expected sequence of the algorithm. Statistical analyses, including descriptive statistics and Random Forest modeling, explored case complexity (e.g., string length measurements) and factors affecting LLM performance and adherence.</p><p><strong>Results: </strong>ChatGPT showed higher adherence to the WHO algorithm (34%) compared to Gemini (7%) and had moderate agreement (71%) with human experts, whereas Gemini had fair agreement (53%). Both LLMs often failed to recognize listed AEFIs, with ChatGPT and Gemini incorrectly identifying 6.7% and 13.3% of AEFIs, respectively. ChatGPT showed inconsistencies in 8.0% of cases and Gemini in 46.7%. For ChatGPT, adherence to the algorithm was associated with lower string complexity in prompt sections. The random forest analysis achieved an accuracy of 55% (95% confidence interval: 35.7-73.5) for predicting adherence to the WHO algorithm for ChatGPT.</p><p><strong>Conclusion: </strong>Notable limitations of ChatGPT and Gemini have been identified in their use for aiding causality assessments in vaccine safety. ChatGPT performed better, with higher adherence and agreement with human experts. In the investigated scenario, both models are better suited as complementary tools to human expertise.</p>\",\"PeriodicalId\":11382,\"journal\":{\"name\":\"Drug Safety\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.0000,\"publicationDate\":\"2025-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Drug Safety\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s40264-025-01531-y\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PHARMACOLOGY & PHARMACY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Drug Safety","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s40264-025-01531-y","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0

摘要

背景:本研究评估了ChatGPT和Gemini这两种现成的大型语言模型(llm)用于自动化因果关系评估的可行性,重点关注与COVID-19疫苗相关的心肌炎和心包炎免疫后不良事件(AEFIs)。方法:对美国疫苗不良事件报告系统(VAERS)报告的150例与covid -19相关的心肌炎和心包炎病例进行评估。法学硕士和人类专家都进行了世界卫生组织(WHO)疫苗因果关系评估算法,并使用百分比一致性来衡量评分者之间的一致性。通过比较LLM反应与算法的预期序列来评估对WHO算法的依从性。统计分析,包括描述性统计和随机森林模型,探讨了案例复杂性(例如,字符串长度测量)和影响LLM性能和依从性的因素。结果:与Gemini(7%)相比,ChatGPT对WHO算法的依从性更高(34%),与人类专家的一致性中等(71%),而Gemini的一致性一般(53%)。这两个法学硕士经常无法识别列出的aefi, ChatGPT和Gemini分别错误识别了6.7%和13.3%的aefi。ChatGPT在8.0%的病例中显示不一致,Gemini在46.7%。对于ChatGPT,遵守算法与提示部分中较低的字符串复杂性相关。随机森林分析预测ChatGPT的WHO算法依从性的准确率为55%(95%置信区间:35.7-73.5)。结论:ChatGPT和Gemini在用于辅助疫苗安全性因果关系评估方面存在明显的局限性。ChatGPT表现得更好,与人类专家的依从性和一致性更高。在所调查的场景中,这两个模型更适合作为人类专业知识的补充工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Off-the-Shelf Large Language Models for Causality Assessment of Individual Case Safety Reports: A Proof-of-Concept with COVID-19 Vaccines.

Background: This study evaluated the feasibility of ChatGPT and Gemini, two off-the-shelf large language models (LLMs), to automate causality assessments, focusing on Adverse Events Following Immunizations (AEFIs) of myocarditis and pericarditis related to COVID-19 vaccines.

Methods: We assessed 150 COVID-19-related cases of myocarditis and pericarditis reported to the Vaccine Adverse Event Reporting System (VAERS) in the United States of America (USA). Both LLMs and human experts conducted the World Health Organization (WHO) algorithm for vaccine causality assessments, and inter-rater agreement was measured using percentage agreement. Adherence to the WHO algorithm was evaluated by comparing LLM responses to the expected sequence of the algorithm. Statistical analyses, including descriptive statistics and Random Forest modeling, explored case complexity (e.g., string length measurements) and factors affecting LLM performance and adherence.

Results: ChatGPT showed higher adherence to the WHO algorithm (34%) compared to Gemini (7%) and had moderate agreement (71%) with human experts, whereas Gemini had fair agreement (53%). Both LLMs often failed to recognize listed AEFIs, with ChatGPT and Gemini incorrectly identifying 6.7% and 13.3% of AEFIs, respectively. ChatGPT showed inconsistencies in 8.0% of cases and Gemini in 46.7%. For ChatGPT, adherence to the algorithm was associated with lower string complexity in prompt sections. The random forest analysis achieved an accuracy of 55% (95% confidence interval: 35.7-73.5) for predicting adherence to the WHO algorithm for ChatGPT.

Conclusion: Notable limitations of ChatGPT and Gemini have been identified in their use for aiding causality assessments in vaccine safety. ChatGPT performed better, with higher adherence and agreement with human experts. In the investigated scenario, both models are better suited as complementary tools to human expertise.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Drug Safety
Drug Safety 医学-毒理学
CiteScore
7.60
自引率
7.10%
发文量
112
审稿时长
6-12 weeks
期刊介绍: Drug Safety is the official journal of the International Society of Pharmacovigilance. The journal includes: Overviews of contentious or emerging issues. Comprehensive narrative reviews that provide an authoritative source of information on epidemiology, clinical features, prevention and management of adverse effects of individual drugs and drug classes. In-depth benefit-risk assessment of adverse effect and efficacy data for a drug in a defined therapeutic area. Systematic reviews (with or without meta-analyses) that collate empirical evidence to answer a specific research question, using explicit, systematic methods as outlined by the PRISMA statement. Original research articles reporting the results of well-designed studies in disciplines such as pharmacoepidemiology, pharmacovigilance, pharmacology and toxicology, and pharmacogenomics. Editorials and commentaries on topical issues. Additional digital features (including animated abstracts, video abstracts, slide decks, audio slides, instructional videos, infographics, podcasts and animations) can be published with articles; these are designed to increase the visibility, readership and educational value of the journal’s content. In addition, articles published in Drug Safety Drugs may be accompanied by plain language summaries to assist readers who have some knowledge of, but not in-depth expertise in, the area to understand important medical advances.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信