Developing artificial intelligence tools for institutional review board pre-review: A pilot study on ChatGPT's accuracy and reproducibility.

IF 7.7

PLOS digital health Pub Date : 2025-06-30 eCollection Date: 2025-06-01 DOI:10.1371/journal.pdig.0000695

Yasuko Fukataki, Wakako Hayashi, Naoki Nishimoto, Yoichi M Ito

{"title":"Developing artificial intelligence tools for institutional review board pre-review: A pilot study on ChatGPT's accuracy and reproducibility.","authors":"Yasuko Fukataki, Wakako Hayashi, Naoki Nishimoto, Yoichi M Ito","doi":"10.1371/journal.pdig.0000695","DOIUrl":null,"url":null,"abstract":"<p><p>This pilot study is the first phase of a broader project aimed at developing an explainable artificial intelligence (AI) tool to support the ethical evaluation of Japanese-language clinical research documents. The tool is explicitly not intended to assist document drafting. We assessed the baseline performance of generative AI-Generative Pre-trained Transformer (GPT)-4 and GPT-4o-in analyzing clinical research protocols and informed consent forms (ICFs). The goal was to determine whether these models could accurately and consistently extract ethically relevant information, including the research objectives and background, research design, and participant-related risks and benefits. First, we compared the performance of GPT-4 and GPT-4o using custom agents developed via OpenAI's Custom GPT functionality (hereafter \"GPTs\"). Then, using GPT-4o alone, we compared outputs generated by GPTs optimized with customized Japanese prompts to those generated by standard prompts. GPT-4o achieved 80% agreement in extracting research objectives and background and 100% in extracting research design, while both models demonstrated high reproducibility across ten trials. GPTs with customized prompts produced more accurate and consistent outputs than standard prompts. This study suggests the potential utility of generative AI in pre-institutional review board (IRB) review tasks; it also provides foundational data for future validation and standardization efforts involving retrieval-augmented generation and fine-tuning. Importantly, this tool is intended not to automate ethical review but rather to support IRB decision-making. Limitations include the absence of gold standard reference data, reliance on a single evaluator, lack of convergence and inter-rater reliability analysis, and the inability of AI to substitute for in-person elements such as site visits.</p>","PeriodicalId":74465,"journal":{"name":"PLOS digital health","volume":"4 6","pages":"e0000695"},"PeriodicalIF":7.7000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12208443/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"PLOS digital health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1371/journal.pdig.0000695","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

This pilot study is the first phase of a broader project aimed at developing an explainable artificial intelligence (AI) tool to support the ethical evaluation of Japanese-language clinical research documents. The tool is explicitly not intended to assist document drafting. We assessed the baseline performance of generative AI-Generative Pre-trained Transformer (GPT)-4 and GPT-4o-in analyzing clinical research protocols and informed consent forms (ICFs). The goal was to determine whether these models could accurately and consistently extract ethically relevant information, including the research objectives and background, research design, and participant-related risks and benefits. First, we compared the performance of GPT-4 and GPT-4o using custom agents developed via OpenAI's Custom GPT functionality (hereafter "GPTs"). Then, using GPT-4o alone, we compared outputs generated by GPTs optimized with customized Japanese prompts to those generated by standard prompts. GPT-4o achieved 80% agreement in extracting research objectives and background and 100% in extracting research design, while both models demonstrated high reproducibility across ten trials. GPTs with customized prompts produced more accurate and consistent outputs than standard prompts. This study suggests the potential utility of generative AI in pre-institutional review board (IRB) review tasks; it also provides foundational data for future validation and standardization efforts involving retrieval-augmented generation and fine-tuning. Importantly, this tool is intended not to automate ethical review but rather to support IRB decision-making. Limitations include the absence of gold standard reference data, reliance on a single evaluator, lack of convergence and inter-rater reliability analysis, and the inability of AI to substitute for in-person elements such as site visits.

查看原文本刊更多论文

开发机构审查委员会预审查的人工智能工具：ChatGPT准确性和可重复性的试点研究。

这项试点研究是一个更广泛项目的第一阶段，该项目旨在开发一种可解释的人工智能（AI）工具，以支持对日语临床研究文件的伦理评估。该工具显然不是用来协助文档起草的。我们评估了生成式人工智能的基线性能——生成式预训练转换器(GPT)-4和GPT- 40 -分析临床研究方案和知情同意书（icf）。目的是确定这些模型是否能够准确和一致地提取伦理相关信息，包括研究目标和背景、研究设计以及参与者相关的风险和收益。首先，我们使用通过OpenAI的custom GPT功能（以下简称“GPT”）开发的自定义代理来比较GPT-4和GPT- 40的性能。然后，仅使用gpt - 40，我们比较了使用自定义日语提示进行优化的gpt生成的输出与使用标准提示生成的输出。gpt - 40在提取研究目标和背景方面达到80%的一致性，在提取研究设计方面达到100%的一致性，并且两种模型在10个试验中都具有很高的可重复性。带有定制提示的gpt比标准提示产生更准确和一致的输出。该研究表明，生成式人工智能在机构前审查委员会（IRB）审查任务中的潜在效用；它还为涉及检索增强生成和微调的未来验证和标准化工作提供了基础数据。重要的是，该工具的目的不是自动化伦理审查，而是支持IRB决策。局限性包括缺乏黄金标准参考数据，依赖单一评估者，缺乏融合和评估者之间的可靠性分析，以及人工智能无法替代现场访问等亲自元素。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

PLOS digital health

自引率

0.00%

发文量