用大型语言模型评估疾病流行病学信息反应的准确性。

IF 2.4 4区 医学 Q3 PHARMACOLOGY & PHARMACY
Kexin Zhu, Jiajie Zhang, Anton Klishin, Mario Esser, William A Blumentals, Juhaeri Juhaeri, Corinne Jouquelet-Royer, Sarah-Jo Sinnott
{"title":"用大型语言模型评估疾病流行病学信息反应的准确性。","authors":"Kexin Zhu, Jiajie Zhang, Anton Klishin, Mario Esser, William A Blumentals, Juhaeri Juhaeri, Corinne Jouquelet-Royer, Sarah-Jo Sinnott","doi":"10.1002/pds.70111","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency.</p><p><strong>Methods: </strong>A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting \"gold-standard\" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated.</p><p><strong>Results: </strong>Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references.</p><p><strong>Conclusions: </strong>ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.</p>","PeriodicalId":19782,"journal":{"name":"Pharmacoepidemiology and Drug Safety","volume":"34 2","pages":"e70111"},"PeriodicalIF":2.4000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11791122/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology.\",\"authors\":\"Kexin Zhu, Jiajie Zhang, Anton Klishin, Mario Esser, William A Blumentals, Juhaeri Juhaeri, Corinne Jouquelet-Royer, Sarah-Jo Sinnott\",\"doi\":\"10.1002/pds.70111\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency.</p><p><strong>Methods: </strong>A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting \\\"gold-standard\\\" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated.</p><p><strong>Results: </strong>Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references.</p><p><strong>Conclusions: </strong>ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.</p>\",\"PeriodicalId\":19782,\"journal\":{\"name\":\"Pharmacoepidemiology and Drug Safety\",\"volume\":\"34 2\",\"pages\":\"e70111\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11791122/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pharmacoepidemiology and Drug Safety\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1002/pds.70111\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"PHARMACOLOGY & PHARMACY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pharmacoepidemiology and Drug Safety","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pds.70111","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0

摘要

目的:药物流行病学研究需要准确的疾病背景流行病学资料。当被提示有关疾病频率的问题时,我们评估了大型语言模型(llm)的性能,包括ChatGPT-3.5、ChatGPT-4和b谷歌Bard。方法:编制关于常见病和罕见病患病率和发病率的共21个问题,并在不同日期向各LLM提交两次。基准数据是从针对“金标准”参考文献(例如,政府统计数据、同行评议文章)的文献搜索中获得的。通过比较法学硕士对基准数据的反应来评估准确性。一致性是通过比较不同日期提交的同一查询的响应来确定的。评估文献的相关性和真实性。结果:3个LLMs产生126个响应。在ChatGPT-4中,76.2%的回答是准确的,而在Bard中为50.0%,在ChatGPT-3.5中为45.2%。ChatGPT-4的一致性(71.4%)高于Bard(57.9%)或ChatGPT-3.5(46.7%)。ChatGPT-4共提供52篇参考文献,其中27篇(51.9%)提供了相关信息,均为真实文献。只有9.2%(10/109)的Bard文献是相关的。在65/109唯一参考文献中,67.7%为真实文献,7.7%为信息不足文献,10.8%为引文不准确文献,13.8%为不存在/捏造文献。ChatGPT-3.5没有提供任何参考资料。结论:与Bard和ChatGPT-3.5相比,ChatGPT-4在检索疾病流行病学信息方面表现更好。然而,这三位法学硕士的回答都不准确,包括不相关、不完整或捏造的参考文献。这些限制阻碍了当前形式的法学硕士在制药行业、学术界或监管机构的研究人员获得准确的疾病流行病学方面的效用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology.

Purpose: Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency.

Methods: A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting "gold-standard" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated.

Results: Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references.

Conclusions: ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.80
自引率
7.70%
发文量
173
审稿时长
3 months
期刊介绍: The aim of Pharmacoepidemiology and Drug Safety is to provide an international forum for the communication and evaluation of data, methods and opinion in the discipline of pharmacoepidemiology. The Journal publishes peer-reviewed reports of original research, invited reviews and a variety of guest editorials and commentaries embracing scientific, medical, statistical, legal and economic aspects of pharmacoepidemiology and post-marketing surveillance of drug safety. Appropriate material in these categories may also be considered for publication as a Brief Report. Particular areas of interest include: design, analysis, results, and interpretation of studies looking at the benefit or safety of specific pharmaceuticals, biologics, or medical devices, including studies in pharmacovigilance, postmarketing surveillance, pharmacoeconomics, patient safety, molecular pharmacoepidemiology, or any other study within the broad field of pharmacoepidemiology; comparative effectiveness research relating to pharmaceuticals, biologics, and medical devices. Comparative effectiveness research is the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat, and monitor a clinical condition, as these methods are truly used in the real world; methodologic contributions of relevance to pharmacoepidemiology, whether original contributions, reviews of existing methods, or tutorials for how to apply the methods of pharmacoepidemiology; assessments of harm versus benefit in drug therapy; patterns of drug utilization; relationships between pharmacoepidemiology and the formulation and interpretation of regulatory guidelines; evaluations of risk management plans and programmes relating to pharmaceuticals, biologics and medical devices.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信