Avi Toiv, Zachary Saleh, Angela Ishak, Eva Alsheik, Deepak Venkat, Neilanjan Nandi, Tobias E Zuchelli
{"title":"Digesting Digital Health: A Study of Appropriateness and Readability of ChatGPT-Generated Gastroenterological Information.","authors":"Avi Toiv, Zachary Saleh, Angela Ishak, Eva Alsheik, Deepak Venkat, Neilanjan Nandi, Tobias E Zuchelli","doi":"10.14309/ctg.0000000000000765","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The advent of artificial intelligence-powered large language models capable of generating interactive responses to intricate queries marks a groundbreaking development in how patients access medical information. Our aim was to evaluate the appropriateness and readability of gastroenterological information generated by Chat Generative Pretrained Transformer (ChatGPT).</p><p><strong>Methods: </strong>We analyzed responses generated by ChatGPT to 16 dialog-based queries assessing symptoms and treatments for gastrointestinal conditions and 13 definition-based queries on prevalent topics in gastroenterology. Three board-certified gastroenterologists evaluated output appropriateness with a 5-point Likert-scale proxy measurement of currency, relevance, accuracy, comprehensiveness, clarity, and urgency/next steps. Outputs with a score of 4 or 5 in all 6 categories were designated as \"appropriate.\" Output readability was assessed with Flesch Reading Ease score, Flesch-Kinkaid Reading Level, and Simple Measure of Gobbledygook scores.</p><p><strong>Results: </strong>ChatGPT responses to 44% of the 16 dialog-based and 69% of the 13 definition-based questions were deemed appropriate, and the proportion of appropriate responses within the 2 groups of questions was not significantly different ( P = 0.17). Notably, none of ChatGPT's responses to questions related to gastrointestinal emergencies were designated appropriate. The mean readability scores showed that outputs were written at a college-level reading proficiency.</p><p><strong>Discussion: </strong>ChatGPT can produce generally fitting responses to gastroenterological medical queries, but responses were constrained in appropriateness and readability, which limits the current utility of this large language model. Substantial development is essential before these models can be unequivocally endorsed as reliable sources of medical information.</p>","PeriodicalId":10278,"journal":{"name":"Clinical and Translational Gastroenterology","volume":null,"pages":null},"PeriodicalIF":3.0000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical and Translational Gastroenterology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.14309/ctg.0000000000000765","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GASTROENTEROLOGY & HEPATOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: The advent of artificial intelligence-powered large language models capable of generating interactive responses to intricate queries marks a groundbreaking development in how patients access medical information. Our aim was to evaluate the appropriateness and readability of gastroenterological information generated by Chat Generative Pretrained Transformer (ChatGPT).
Methods: We analyzed responses generated by ChatGPT to 16 dialog-based queries assessing symptoms and treatments for gastrointestinal conditions and 13 definition-based queries on prevalent topics in gastroenterology. Three board-certified gastroenterologists evaluated output appropriateness with a 5-point Likert-scale proxy measurement of currency, relevance, accuracy, comprehensiveness, clarity, and urgency/next steps. Outputs with a score of 4 or 5 in all 6 categories were designated as "appropriate." Output readability was assessed with Flesch Reading Ease score, Flesch-Kinkaid Reading Level, and Simple Measure of Gobbledygook scores.
Results: ChatGPT responses to 44% of the 16 dialog-based and 69% of the 13 definition-based questions were deemed appropriate, and the proportion of appropriate responses within the 2 groups of questions was not significantly different ( P = 0.17). Notably, none of ChatGPT's responses to questions related to gastrointestinal emergencies were designated appropriate. The mean readability scores showed that outputs were written at a college-level reading proficiency.
Discussion: ChatGPT can produce generally fitting responses to gastroenterological medical queries, but responses were constrained in appropriateness and readability, which limits the current utility of this large language model. Substantial development is essential before these models can be unequivocally endorsed as reliable sources of medical information.
期刊介绍:
Clinical and Translational Gastroenterology (CTG), published on behalf of the American College of Gastroenterology (ACG), is a peer-reviewed open access online journal dedicated to innovative clinical work in the field of gastroenterology and hepatology. CTG hopes to fulfill an unmet need for clinicians and scientists by welcoming novel cohort studies, early-phase clinical trials, qualitative and quantitative epidemiologic research, hypothesis-generating research, studies of novel mechanisms and methodologies including public health interventions, and integration of approaches across organs and disciplines. CTG also welcomes hypothesis-generating small studies, methods papers, and translational research with clear applications to human physiology or disease.
Colon and small bowel
Endoscopy and novel diagnostics
Esophagus
Functional GI disorders
Immunology of the GI tract
Microbiology of the GI tract
Inflammatory bowel disease
Pancreas and biliary tract
Liver
Pathology
Pediatrics
Preventative medicine
Nutrition/obesity
Stomach.