A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review.

IF 1.9 Q3 MEDICINE, RESEARCH & EXPERIMENTAL

Interactive Journal of Medical Research Pub Date : 2024-02-15 DOI:10.2196/54704

Malik Sallam, Muna Barakat, Mohammed Sallam

{"title":"A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence-Based Models in Health Care Education and Practice: Development Study Involving a Literature Review.","authors":"Malik Sallam, Muna Barakat, Mohammed Sallam","doi":"10.2196/54704","DOIUrl":null,"url":null,"abstract":"Background: Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence.Objective: This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice.Methods: A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with \"ChatGPT,\" \"Bing,\" or \"Bard\" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability.Results: The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the \"Model\" item, followed by the \"Specificity\" item, while the lowest scores were recorded for the \"Randomization\" item (classified as suboptimal) and \"Individual factors\" item (classified as satisfactory).Conclusions: The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.","PeriodicalId":51757,"journal":{"name":"Interactive Journal of Medical Research","volume":" ","pages":"e54704"},"PeriodicalIF":1.9000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10905357/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interactive Journal of Medical Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/54704","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MEDICINE, RESEARCH & EXPERIMENTAL","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence.

Objective: This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice.

Methods: A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with "ChatGPT," "Bing," or "Bard" in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability.

Results: The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the "Model" item, followed by the "Specificity" item, while the lowest scores were recorded for the "Randomization" item (classified as suboptimal) and "Individual factors" item (classified as satisfactory).

Conclusions: The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic.

查看原文本刊更多论文

指标：建立一个初步核对表，以规范医疗保健教育和实践中基于生成式人工智能的研究的设计和报告。

背景：在医疗保健领域，坚持循证实践是不可或缺的。最近，基于人工智能（AI）的生成模型在医疗保健领域的实用性得到了广泛评估。然而，在这些研究的设计和结果报告方面缺乏一致的指导原则，这给证据的解释和综合带来了挑战：制定一份初步清单，以规范医疗保健教育和实践中基于生成式人工智能的研究报告：在 Scopus、PubMed 和 Google Scholar 上进行了文献综述。检索了标题中包含 "ChatGPT"、"Bing "或 "Bard "的已发表记录。对收录记录中采用的方法进行了仔细研究，以确定共同的相关主题和报告中可能存在的差距。随后进行了小组讨论，为医疗保健领域人工智能研究的报告制定了统一而全面的核对表。最终确定的核对表由两名独立评定员对纳入的记录进行评估。结果：结果：作为相关主题识别和分析基础的最终数据集共包含 34 条记录。最终确定的核对表包括九个相关主题，统称为 "METRICS"：（1）使用的模型及其确切设置；（2）生成内容的评估方法；（3）测试模型的时间；（4）数据源的透明度；（5）测试主题的范围；（6）随机选择查询；（7）选择查询的个人因素和评分者之间的可靠性；（8）为测试模型而执行的查询次数；（9）提示和所用语言的具体性。METRICS 的总平均分为 3.0±0.58。测试的 METRICS 得分在 0.558-0.962 的 Cohen's κ 范围内是可以接受的（PConclusions：METRICS 核对表有助于设计此类研究，指导研究人员采用最佳方法报告研究结果。研究结果表明，基于方法论和报告中观察到的变异性，医疗保健领域基于生成式人工智能的研究需要标准化的报告算法。拟议的 METRICS 核对表可能是初步的有益步骤，有助于建立一种普遍接受的方法，以规范基于生成式人工智能的医疗保健研究的设计和报告，这是一个迅速发展的研究课题：

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Interactive Journal of Medical Research MEDICINE, RESEARCH & EXPERIMENTAL-

自引率

0.00%

发文量

审稿时长

12 weeks