{"title":"Evaluating ‘Pair’: A Generative AI Chatbot for Standardizing Radiographic Protocols","authors":"Tan Eugene, Crystal Chin Jing, Celine Tan Ying Yi","doi":"10.1016/j.jmir.2025.102053","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>The “Pair” chatbot, introduced as a government AI assistant, marks a significant step forward in leveraging large language models (LLMs) in healthcare. This innovation has driven the exploration of generative AI (GenAI) to tackle challenges in radiography, such as fragmented information-sharing and an over-reliance on senior radiographers for clarifications. Discrepancies in protocol interpretation among senior radiographers further hinder the standardization of imaging procedures.</div><div>To address these challenges, the ““Pair”” chatbot is being piloted as a potential solution. However, deploying GenAI in high-stakes fields like radiography involves risks, as inaccurate guidance could jeopardize patient safety. Many GenAI models generate plausible yet potentially incorrect answers, underscoring the importance of rigorous validation and evaluation before clinical implementation.</div><div>This study seeks to rigorously evaluate the chatbot's performance in terms of accuracy, appropriateness, and consistency by analyzing its responses across various radiographic scenarios. The evaluation involves comparing its outputs to established expert consensus and assessing consistency across different query formulations. The ultimate goal is to ensure the chatbot delivers accurate, relevant, and contextually appropriate responses that align with clinical standards.</div></div><div><h3>Methods</h3><div>A dataset of 100 clinical questions, covering image acquisition, patient positioning, and protocol adherence, was developed to represent real-world radiographic scenarios. The chatbot’s performance was evaluated using three key metrics: accuracy, appropriateness, and semantic consistency. Accuracy was measured using the F1 score, which balances precision and recall. Appropriateness was assessed through the Intraclass Correlation Coefficient (ICC) and Fleiss' Kappa, evaluating consistency and inter-rater reliability. Semantic consistency examined the chatbot’s ability to provide consistent answers across rephrased questions, ensuring its adaptability to various question formulations in clinical practice.</div></div><div><h3>Results</h3><div>The chatbot's performance reflects a substantial level of accuracy, with an F1 score of 0.7013, although recall could be further optimized to address all aspects of the queries. The Intraclass Correlation Coefficient (ICC) of 0.5075 indicates moderate inter-rater reliability, while Fleiss' Kappa of 0.2424 suggests fair agreement among raters, highlighting the challenges in defining universal standards for radiographic practice. Notably, the chatbot demonstrated high semantic consistency, achieving 90.37%, which underscores its ability to provide consistent responses despite variations in question phrasing.</div></div><div><h3>Conclusion</h3><div>The evaluation of the chatbot’s performance reveals both strengths and areas for improvement. It demonstrated strong consistency, with a semantic consistency score of 90.37%, and was generally accurate, achieving an F1 score of 0.7013, making it a valuable tool for standardizing radiographic practices. However, there is room for improvement in capturing all relevant details, as indicated by the slightly lower recall and moderate F1 score. Additionally, variability in inter-rater reliability highlights the challenges of ensuring the chatbot meets diverse clinical criteria, pointing to the need for enhanced contextual appropriateness.</div><div>The variability in expert assessments further underscores the necessity for standardized evaluation criteria, particularly in specialized fields like radiography, to ensure consistent and reliable AI assessments. Despite these challenges, the chatbot’s ability to provide consistent and accurate guidance positions it as a promising tool for reducing reliance on senior radiographers and streamlining decision-making.</div><div>Future studies should focus on enhancing the chatbot’s recall, refining its contextual appropriateness in clinical settings, and ensuring alignment with ethical and safety standards. There is also potential for ““Pair”” to serve as an educational tool for training new staff and supporting clinical decision-making in high-pressure scenarios.</div></div>","PeriodicalId":46420,"journal":{"name":"Journal of Medical Imaging and Radiation Sciences","volume":"56 2","pages":"Article 102053"},"PeriodicalIF":2.0000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Imaging and Radiation Sciences","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1939865425002024","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Aim
The “Pair” chatbot, introduced as a government AI assistant, marks a significant step forward in leveraging large language models (LLMs) in healthcare. This innovation has driven the exploration of generative AI (GenAI) to tackle challenges in radiography, such as fragmented information-sharing and an over-reliance on senior radiographers for clarifications. Discrepancies in protocol interpretation among senior radiographers further hinder the standardization of imaging procedures.
To address these challenges, the ““Pair”” chatbot is being piloted as a potential solution. However, deploying GenAI in high-stakes fields like radiography involves risks, as inaccurate guidance could jeopardize patient safety. Many GenAI models generate plausible yet potentially incorrect answers, underscoring the importance of rigorous validation and evaluation before clinical implementation.
This study seeks to rigorously evaluate the chatbot's performance in terms of accuracy, appropriateness, and consistency by analyzing its responses across various radiographic scenarios. The evaluation involves comparing its outputs to established expert consensus and assessing consistency across different query formulations. The ultimate goal is to ensure the chatbot delivers accurate, relevant, and contextually appropriate responses that align with clinical standards.
Methods
A dataset of 100 clinical questions, covering image acquisition, patient positioning, and protocol adherence, was developed to represent real-world radiographic scenarios. The chatbot’s performance was evaluated using three key metrics: accuracy, appropriateness, and semantic consistency. Accuracy was measured using the F1 score, which balances precision and recall. Appropriateness was assessed through the Intraclass Correlation Coefficient (ICC) and Fleiss' Kappa, evaluating consistency and inter-rater reliability. Semantic consistency examined the chatbot’s ability to provide consistent answers across rephrased questions, ensuring its adaptability to various question formulations in clinical practice.
Results
The chatbot's performance reflects a substantial level of accuracy, with an F1 score of 0.7013, although recall could be further optimized to address all aspects of the queries. The Intraclass Correlation Coefficient (ICC) of 0.5075 indicates moderate inter-rater reliability, while Fleiss' Kappa of 0.2424 suggests fair agreement among raters, highlighting the challenges in defining universal standards for radiographic practice. Notably, the chatbot demonstrated high semantic consistency, achieving 90.37%, which underscores its ability to provide consistent responses despite variations in question phrasing.
Conclusion
The evaluation of the chatbot’s performance reveals both strengths and areas for improvement. It demonstrated strong consistency, with a semantic consistency score of 90.37%, and was generally accurate, achieving an F1 score of 0.7013, making it a valuable tool for standardizing radiographic practices. However, there is room for improvement in capturing all relevant details, as indicated by the slightly lower recall and moderate F1 score. Additionally, variability in inter-rater reliability highlights the challenges of ensuring the chatbot meets diverse clinical criteria, pointing to the need for enhanced contextual appropriateness.
The variability in expert assessments further underscores the necessity for standardized evaluation criteria, particularly in specialized fields like radiography, to ensure consistent and reliable AI assessments. Despite these challenges, the chatbot’s ability to provide consistent and accurate guidance positions it as a promising tool for reducing reliance on senior radiographers and streamlining decision-making.
Future studies should focus on enhancing the chatbot’s recall, refining its contextual appropriateness in clinical settings, and ensuring alignment with ethical and safety standards. There is also potential for ““Pair”” to serve as an educational tool for training new staff and supporting clinical decision-making in high-pressure scenarios.
期刊介绍:
Journal of Medical Imaging and Radiation Sciences is the official peer-reviewed journal of the Canadian Association of Medical Radiation Technologists. This journal is published four times a year and is circulated to approximately 11,000 medical radiation technologists, libraries and radiology departments throughout Canada, the United States and overseas. The Journal publishes articles on recent research, new technology and techniques, professional practices, technologists viewpoints as well as relevant book reviews.