Evaluating ‘Pair’: A Generative AI Chatbot for Standardizing Radiographic Protocols

IF 2 Q3 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING
Tan Eugene, Crystal Chin Jing, Celine Tan Ying Yi
{"title":"Evaluating ‘Pair’: A Generative AI Chatbot for Standardizing Radiographic Protocols","authors":"Tan Eugene,&nbsp;Crystal Chin Jing,&nbsp;Celine Tan Ying Yi","doi":"10.1016/j.jmir.2025.102053","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>The “Pair” chatbot, introduced as a government AI assistant, marks a significant step forward in leveraging large language models (LLMs) in healthcare. This innovation has driven the exploration of generative AI (GenAI) to tackle challenges in radiography, such as fragmented information-sharing and an over-reliance on senior radiographers for clarifications. Discrepancies in protocol interpretation among senior radiographers further hinder the standardization of imaging procedures.</div><div>To address these challenges, the ““Pair”” chatbot is being piloted as a potential solution. However, deploying GenAI in high-stakes fields like radiography involves risks, as inaccurate guidance could jeopardize patient safety. Many GenAI models generate plausible yet potentially incorrect answers, underscoring the importance of rigorous validation and evaluation before clinical implementation.</div><div>This study seeks to rigorously evaluate the chatbot's performance in terms of accuracy, appropriateness, and consistency by analyzing its responses across various radiographic scenarios. The evaluation involves comparing its outputs to established expert consensus and assessing consistency across different query formulations. The ultimate goal is to ensure the chatbot delivers accurate, relevant, and contextually appropriate responses that align with clinical standards.</div></div><div><h3>Methods</h3><div>A dataset of 100 clinical questions, covering image acquisition, patient positioning, and protocol adherence, was developed to represent real-world radiographic scenarios. The chatbot’s performance was evaluated using three key metrics: accuracy, appropriateness, and semantic consistency. Accuracy was measured using the F1 score, which balances precision and recall. Appropriateness was assessed through the Intraclass Correlation Coefficient (ICC) and Fleiss' Kappa, evaluating consistency and inter-rater reliability. Semantic consistency examined the chatbot’s ability to provide consistent answers across rephrased questions, ensuring its adaptability to various question formulations in clinical practice.</div></div><div><h3>Results</h3><div>The chatbot's performance reflects a substantial level of accuracy, with an F1 score of 0.7013, although recall could be further optimized to address all aspects of the queries. The Intraclass Correlation Coefficient (ICC) of 0.5075 indicates moderate inter-rater reliability, while Fleiss' Kappa of 0.2424 suggests fair agreement among raters, highlighting the challenges in defining universal standards for radiographic practice. Notably, the chatbot demonstrated high semantic consistency, achieving 90.37%, which underscores its ability to provide consistent responses despite variations in question phrasing.</div></div><div><h3>Conclusion</h3><div>The evaluation of the chatbot’s performance reveals both strengths and areas for improvement. It demonstrated strong consistency, with a semantic consistency score of 90.37%, and was generally accurate, achieving an F1 score of 0.7013, making it a valuable tool for standardizing radiographic practices. However, there is room for improvement in capturing all relevant details, as indicated by the slightly lower recall and moderate F1 score. Additionally, variability in inter-rater reliability highlights the challenges of ensuring the chatbot meets diverse clinical criteria, pointing to the need for enhanced contextual appropriateness.</div><div>The variability in expert assessments further underscores the necessity for standardized evaluation criteria, particularly in specialized fields like radiography, to ensure consistent and reliable AI assessments. Despite these challenges, the chatbot’s ability to provide consistent and accurate guidance positions it as a promising tool for reducing reliance on senior radiographers and streamlining decision-making.</div><div>Future studies should focus on enhancing the chatbot’s recall, refining its contextual appropriateness in clinical settings, and ensuring alignment with ethical and safety standards. There is also potential for ““Pair”” to serve as an educational tool for training new staff and supporting clinical decision-making in high-pressure scenarios.</div></div>","PeriodicalId":46420,"journal":{"name":"Journal of Medical Imaging and Radiation Sciences","volume":"56 2","pages":"Article 102053"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Imaging and Radiation Sciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1939865425002024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Aim

The “Pair” chatbot, introduced as a government AI assistant, marks a significant step forward in leveraging large language models (LLMs) in healthcare. This innovation has driven the exploration of generative AI (GenAI) to tackle challenges in radiography, such as fragmented information-sharing and an over-reliance on senior radiographers for clarifications. Discrepancies in protocol interpretation among senior radiographers further hinder the standardization of imaging procedures.
To address these challenges, the ““Pair”” chatbot is being piloted as a potential solution. However, deploying GenAI in high-stakes fields like radiography involves risks, as inaccurate guidance could jeopardize patient safety. Many GenAI models generate plausible yet potentially incorrect answers, underscoring the importance of rigorous validation and evaluation before clinical implementation.
This study seeks to rigorously evaluate the chatbot's performance in terms of accuracy, appropriateness, and consistency by analyzing its responses across various radiographic scenarios. The evaluation involves comparing its outputs to established expert consensus and assessing consistency across different query formulations. The ultimate goal is to ensure the chatbot delivers accurate, relevant, and contextually appropriate responses that align with clinical standards.

Methods

A dataset of 100 clinical questions, covering image acquisition, patient positioning, and protocol adherence, was developed to represent real-world radiographic scenarios. The chatbot’s performance was evaluated using three key metrics: accuracy, appropriateness, and semantic consistency. Accuracy was measured using the F1 score, which balances precision and recall. Appropriateness was assessed through the Intraclass Correlation Coefficient (ICC) and Fleiss' Kappa, evaluating consistency and inter-rater reliability. Semantic consistency examined the chatbot’s ability to provide consistent answers across rephrased questions, ensuring its adaptability to various question formulations in clinical practice.

Results

The chatbot's performance reflects a substantial level of accuracy, with an F1 score of 0.7013, although recall could be further optimized to address all aspects of the queries. The Intraclass Correlation Coefficient (ICC) of 0.5075 indicates moderate inter-rater reliability, while Fleiss' Kappa of 0.2424 suggests fair agreement among raters, highlighting the challenges in defining universal standards for radiographic practice. Notably, the chatbot demonstrated high semantic consistency, achieving 90.37%, which underscores its ability to provide consistent responses despite variations in question phrasing.

Conclusion

The evaluation of the chatbot’s performance reveals both strengths and areas for improvement. It demonstrated strong consistency, with a semantic consistency score of 90.37%, and was generally accurate, achieving an F1 score of 0.7013, making it a valuable tool for standardizing radiographic practices. However, there is room for improvement in capturing all relevant details, as indicated by the slightly lower recall and moderate F1 score. Additionally, variability in inter-rater reliability highlights the challenges of ensuring the chatbot meets diverse clinical criteria, pointing to the need for enhanced contextual appropriateness.
The variability in expert assessments further underscores the necessity for standardized evaluation criteria, particularly in specialized fields like radiography, to ensure consistent and reliable AI assessments. Despite these challenges, the chatbot’s ability to provide consistent and accurate guidance positions it as a promising tool for reducing reliance on senior radiographers and streamlining decision-making.
Future studies should focus on enhancing the chatbot’s recall, refining its contextual appropriateness in clinical settings, and ensuring alignment with ethical and safety standards. There is also potential for ““Pair”” to serve as an educational tool for training new staff and supporting clinical decision-making in high-pressure scenarios.
评估“配对”:用于标准化放射照相协议的生成式AI聊天机器人
作为政府人工智能助手推出的“Pair”聊天机器人,标志着在医疗保健领域利用大型语言模型(llm)迈出了重要一步。这一创新推动了生成式人工智能(GenAI)的探索,以解决放射照相中的挑战,例如碎片化的信息共享和过度依赖高级放射技师的澄清。高级放射技师在协议解释上的差异进一步阻碍了成像程序的标准化。为了应对这些挑战,“Pair”聊天机器人正在作为一种潜在的解决方案进行试验。然而,将GenAI应用于像放射学这样的高风险领域涉及风险,因为不准确的指导可能危及患者的安全。许多GenAI模型产生看似合理但可能不正确的答案,强调了在临床实施之前严格验证和评估的重要性。本研究旨在通过分析聊天机器人在各种放射成像场景中的反应,严格评估聊天机器人在准确性、适当性和一致性方面的表现。评估包括将其输出与已建立的专家共识进行比较,并评估不同查询公式之间的一致性。最终目标是确保聊天机器人提供准确、相关和符合临床标准的上下文适当的反应。方法建立一个包含100个临床问题的数据集,涵盖图像采集、患者定位和方案依从性,以代表真实的放射学场景。该聊天机器人的性能通过三个关键指标进行评估:准确性、适当性和语义一致性。准确性是用F1分数来衡量的,它平衡了准确性和召回率。通过类内相关系数(ICC)和Fleiss Kappa来评估适当性,评估一致性和等级间的信度。语义一致性测试了聊天机器人在重新措辞的问题上提供一致答案的能力,确保了它在临床实践中对各种问题表述的适应性。结果聊天机器人的表现反映了相当高的准确性,F1得分为0.7013,尽管召回可以进一步优化以解决查询的各个方面。类别内相关系数(ICC)为0.5075,表明评分者之间的信度适中,而Fleiss Kappa为0.2424,表明评分者之间的一致,突出了定义放射学实践通用标准的挑战。值得注意的是,聊天机器人表现出了很高的语义一致性,达到了90.37%,这强调了它在问题措辞变化的情况下提供一致响应的能力。结论通过对聊天机器人性能的评估,可以看出该机器人的优势和有待改进的地方。该方法具有较强的一致性,语义一致性得分为90.37%,准确率较高,F1得分为0.7013,是标准化放射学实践的重要工具。然而,在捕捉所有相关细节方面还有改进的空间,正如稍低的回忆和中等的F1分数所表明的那样。此外,评分者之间可靠性的可变性突出了确保聊天机器人满足不同临床标准的挑战,指出需要增强上下文适应性。专家评估的可变性进一步强调了标准化评估标准的必要性,特别是在放射学等专业领域,以确保一致和可靠的人工智能评估。尽管存在这些挑战,但聊天机器人提供一致和准确指导的能力使其成为减少对高级放射技师依赖和简化决策的有前途的工具。未来的研究应该集中在提高聊天机器人的回忆能力,改进其在临床环境中的适应性,并确保符合道德和安全标准。“配对”也有可能成为培训新员工的教育工具,并在高压情况下支持临床决策。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Medical Imaging and Radiation Sciences
Journal of Medical Imaging and Radiation Sciences RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING-
CiteScore
2.30
自引率
11.10%
发文量
231
审稿时长
53 days
期刊介绍: Journal of Medical Imaging and Radiation Sciences is the official peer-reviewed journal of the Canadian Association of Medical Radiation Technologists. This journal is published four times a year and is circulated to approximately 11,000 medical radiation technologists, libraries and radiology departments throughout Canada, the United States and overseas. The Journal publishes articles on recent research, new technology and techniques, professional practices, technologists viewpoints as well as relevant book reviews.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信