Kexin Zhu, Jiajie Zhang, Anton Klishin, Mario Esser, William A Blumentals, Juhaeri Juhaeri, Corinne Jouquelet-Royer, Sarah-Jo Sinnott
{"title":"Evaluating the Accuracy of Responses by Large Language Models for Information on Disease Epidemiology.","authors":"Kexin Zhu, Jiajie Zhang, Anton Klishin, Mario Esser, William A Blumentals, Juhaeri Juhaeri, Corinne Jouquelet-Royer, Sarah-Jo Sinnott","doi":"10.1002/pds.70111","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency.</p><p><strong>Methods: </strong>A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting \"gold-standard\" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated.</p><p><strong>Results: </strong>Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references.</p><p><strong>Conclusions: </strong>ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.</p>","PeriodicalId":19782,"journal":{"name":"Pharmacoepidemiology and Drug Safety","volume":"34 2","pages":"e70111"},"PeriodicalIF":2.4000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11791122/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pharmacoepidemiology and Drug Safety","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1002/pds.70111","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"PHARMACOLOGY & PHARMACY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Accurate background epidemiology of diseases are required in pharmacoepidemiologic research. We evaluated the performance of large language models (LLMs), including ChatGPT-3.5, ChatGPT-4, and Google Bard, when prompted with questions on disease frequency.
Methods: A total of 21 questions on the prevalence and incidence of common and rare diseases were developed and submitted to each LLM twice on different dates. Benchmark data were obtained from literature searches targeting "gold-standard" references (e.g., government statistics, peer-reviewed articles). Accuracy was evaluated by comparing LLMs' responses to the benchmark data. Consistency was determined by comparing the responses to the same query submitted on different dates. The relevance and authenticity of references were evaluated.
Results: Three LLMs generated 126 responses. In ChatGPT-4, 76.2% of responses were accurate, which was higher compared to 50.0% in Bard and 45.2% in ChatGPT-3.5. ChatGPT-4 exhibited higher consistency (71.4%) than Bard (57.9%) or ChatGPT-3.5 (46.7%). ChatGPT-4 provided 52 references with 27 (51.9%) providing relevant information, and all were authentic. Only 9.2% (10/109) of references from Bard were relevant. Of 65/109 unique references, 67.7% were authentic, 7.7% provided insufficient information for access, 10.8% provided inaccurate citation, and 13.8% were non-existent/fabricated. ChatGPT-3.5 did not provide any references.
Conclusions: ChatGPT-4 outperformed in retrieving information on disease epidemiology compared to Bard and ChatGPT-3.5. However, all three LLMs presented inaccurate responses, including irrelevant, incomplete, or fabricated references. Such limitations preclude the utility of the current forms of LLMs in obtaining accurate disease epidemiology by researchers in the pharmaceutical industry, in academia, or in the regulatory setting.
期刊介绍:
The aim of Pharmacoepidemiology and Drug Safety is to provide an international forum for the communication and evaluation of data, methods and opinion in the discipline of pharmacoepidemiology. The Journal publishes peer-reviewed reports of original research, invited reviews and a variety of guest editorials and commentaries embracing scientific, medical, statistical, legal and economic aspects of pharmacoepidemiology and post-marketing surveillance of drug safety. Appropriate material in these categories may also be considered for publication as a Brief Report.
Particular areas of interest include:
design, analysis, results, and interpretation of studies looking at the benefit or safety of specific pharmaceuticals, biologics, or medical devices, including studies in pharmacovigilance, postmarketing surveillance, pharmacoeconomics, patient safety, molecular pharmacoepidemiology, or any other study within the broad field of pharmacoepidemiology;
comparative effectiveness research relating to pharmaceuticals, biologics, and medical devices. Comparative effectiveness research is the generation and synthesis of evidence that compares the benefits and harms of alternative methods to prevent, diagnose, treat, and monitor a clinical condition, as these methods are truly used in the real world;
methodologic contributions of relevance to pharmacoepidemiology, whether original contributions, reviews of existing methods, or tutorials for how to apply the methods of pharmacoepidemiology;
assessments of harm versus benefit in drug therapy;
patterns of drug utilization;
relationships between pharmacoepidemiology and the formulation and interpretation of regulatory guidelines;
evaluations of risk management plans and programmes relating to pharmaceuticals, biologics and medical devices.