Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for AI-generated Radiology Reports.

IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics Pub Date : 2024-06-01 Epub Date: 2024-08-22 DOI:10.1109/ichi61247.2024.00058

Qingqing Zhu, Xiuying Chen, Qiao Jin, Benjamin Hou, Tejas Sudharshan Mathai, Pritam Mukherjee, Xin Gao, Ronald M Summers, Zhiyong Lu

{"title":"Leveraging Professional Radiologists' Expertise to Enhance LLMs' Evaluation for AI-generated Radiology Reports.","authors":"Qingqing Zhu, Xiuying Chen, Qiao Jin, Benjamin Hou, Tejas Sudharshan Mathai, Pritam Mukherjee, Xin Gao, Ronald M Summers, Zhiyong Lu","doi":"10.1109/ichi61247.2024.00058","DOIUrl":null,"url":null,"abstract":"<p><p>In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI-generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our \"Detailed GPT-4 (5-shot)\" model achieves a correlation that is 0.48, outperforming the METEOR metric by 0.19, while our \"Regressed GPT-4\" model shows even greater alignment(0.64) with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.</p>","PeriodicalId":73284,"journal":{"name":"IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics","volume":"2024 ","pages":"402-411"},"PeriodicalIF":0.0000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11651630/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ichi61247.2024.00058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/8/22 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In radiology, Artificial Intelligence (AI) has significantly advanced report generation, but automatic evaluation of these AI-produced reports remains challenging. Current metrics, such as Conventional Natural Language Generation (NLG) and Clinical Efficacy (CE), often fall short in capturing the semantic intricacies of clinical contexts or overemphasize clinical details, undermining report clarity. To overcome these issues, our proposed method synergizes the expertise of professional radiologists with Large Language Models (LLMs), like GPT-3.5 and GPT-4. Utilizing In-Context Instruction Learning (ICIL) and Chain of Thought (CoT) reasoning, our approach aligns LLM evaluations with radiologist standards, enabling detailed comparisons between human and AI-generated reports. This is further enhanced by a Regression model that aggregates sentence evaluation scores. Experimental results show that our "Detailed GPT-4 (5-shot)" model achieves a correlation that is 0.48, outperforming the METEOR metric by 0.19, while our "Regressed GPT-4" model shows even greater alignment(0.64) with expert evaluations, exceeding the best existing metric by a 0.35 margin. Moreover, the robustness of our explanations has been validated through a thorough iterative strategy. We plan to publicly release annotations from radiology experts, setting a new standard for accuracy in future assessments. This underscores the potential of our approach in enhancing the quality assessment of AI-driven medical reports.

查看原文本刊更多论文

利用专业放射学专家的专业知识，加强法律硕士对人工智能生成的放射学报告的评估。

在放射学中，人工智能（AI）在报告生成方面取得了显著进展，但对这些人工智能生成的报告进行自动评估仍然具有挑战性。目前的指标，如常规自然语言生成（NLG）和临床疗效（CE），往往在捕捉临床上下文的语义复杂性或过分强调临床细节方面不足，破坏了报告的清晰度。为了克服这些问题，我们提出的方法将专业放射科医生的专业知识与大型语言模型（llm）（如GPT-3.5和GPT-4）相结合。利用情境教学（ICIL）和思维链（CoT）推理，我们的方法将LLM评估与放射科医生的标准保持一致，从而可以详细比较人类和人工智能生成的报告。这是进一步增强的回归模型，汇总句子评价分数。实验结果表明，我们的“详细GPT-4 （5-shot）”模型实现了0.48的相关性，比METEOR指标高出0.19，而我们的“回归GPT-4”模型与专家评估的一致性更高（0.64），比现有的最佳指标高出0.35。此外，我们的解释的健壮性已经通过一个彻底的迭代策略得到了验证。我们计划公开发布放射学专家的注释，为未来评估的准确性制定新的标准。这凸显了我们的方法在加强人工智能驱动的医疗报告的质量评估方面的潜力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Conference on Healthcare Informatics. IEEE International Conference on Healthcare Informatics

自引率

0.00%

发文量