Visual-language artificial intelligence system for knee radiograph diagnosis and interpretation: a collaborative system with humans.

Radiology advances Pub Date : 2025-08-07 eCollection Date: 2025-09-01 DOI:10.1093/radadv/umaf027

Xingxin He, Zachary E Stewart, Nikitha Crasta, Varun Nukala, Albert Jang, Zhaoye Zhou, Richard Kijowski, Li Feng, Wei Peng, Rianne A van der Heijden, Kenneth S Lee, Shasha Li, Miho J Tanaka, Fang Liu

{"title":"Visual-language artificial intelligence system for knee radiograph diagnosis and interpretation: a collaborative system with humans.","authors":"Xingxin He, Zachary E Stewart, Nikitha Crasta, Varun Nukala, Albert Jang, Zhaoye Zhou, Richard Kijowski, Li Feng, Wei Peng, Rianne A van der Heijden, Kenneth S Lee, Shasha Li, Miho J Tanaka, Fang Liu","doi":"10.1093/radadv/umaf027","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) have shown promising abilities in text-based clinical tasks but they do not inherently interpret medical images such as knee radiographs.Purpose: To develop a human-artificial intelligence interactive diagnostic approach, named radiology generative pretrained transformer (RadGPT), aimed at assisting and synergizing with human users for the interpretation of knee radiological images.Materials and methods: A total of 22 512 knee roentgen ray images and reports were retrieved from Massachusetts General Hospital; 80% of these were used for model training and 10% were used for model testing and validation, respectively. Fifteen diagnostic imaging features (eg, osteoarthritis, effusion, joint space narrowing, osteophyte) were selected to label images based on their high frequency and clinical relevance in the retrieved official reports. Area under the curve scores were calculated for each feature to assess the diagnostic performance. To evaluate the quality of the generated medical text, historical clinical reports were used as the reference text. Several metrics for text generation tasks are applied, including BiLingual Evaluation Understudy, Recall-Oriented Understudy for Gisting Evaluation, Metric for Evaluation of Translation with Explicit Ordering, and Semantic Propositional Image Caption Evaluation.Results: RadGPT, in collaboration with human users, achieved area under the curve scores ranging from 0.76 for osteonecrosis to 0.91 for arthroplasty across 15 diagnostic categories for knee conditions. Compared with the baseline LLM method, RadGPT achieved higher scores, specifically 0.18 in BiLingual Evaluation Understudy score, 0.30 in Recall-Oriented Understudy for Gisting Evaluation-L, 0.10 in Metric for Evaluation of Translation with Explicit Ordering, and 0.15 in Semantic Propositional Image Caption Evaluation, which is significantly higher than the baseline LLM method, demonstrating good linguistic overlap and clinical consistency with the reference reports.Conclusion: RadGPT has achieved advanced results in knee roentgen ray image feature recognition, illustrating the potential of LLMs in medical image interpretation. The study establishes a training protocol for developing artificial intelligence-assisted tools specifically focusing on the diagnosis and interpretation of knee radiological images.","PeriodicalId":519940,"journal":{"name":"Radiology advances","volume":"2 5","pages":"umaf027"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12483153/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/radadv/umaf027","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) have shown promising abilities in text-based clinical tasks but they do not inherently interpret medical images such as knee radiographs.

Purpose: To develop a human-artificial intelligence interactive diagnostic approach, named radiology generative pretrained transformer (RadGPT), aimed at assisting and synergizing with human users for the interpretation of knee radiological images.

Materials and methods: A total of 22 512 knee roentgen ray images and reports were retrieved from Massachusetts General Hospital; 80% of these were used for model training and 10% were used for model testing and validation, respectively. Fifteen diagnostic imaging features (eg, osteoarthritis, effusion, joint space narrowing, osteophyte) were selected to label images based on their high frequency and clinical relevance in the retrieved official reports. Area under the curve scores were calculated for each feature to assess the diagnostic performance. To evaluate the quality of the generated medical text, historical clinical reports were used as the reference text. Several metrics for text generation tasks are applied, including BiLingual Evaluation Understudy, Recall-Oriented Understudy for Gisting Evaluation, Metric for Evaluation of Translation with Explicit Ordering, and Semantic Propositional Image Caption Evaluation.

Results: RadGPT, in collaboration with human users, achieved area under the curve scores ranging from 0.76 for osteonecrosis to 0.91 for arthroplasty across 15 diagnostic categories for knee conditions. Compared with the baseline LLM method, RadGPT achieved higher scores, specifically 0.18 in BiLingual Evaluation Understudy score, 0.30 in Recall-Oriented Understudy for Gisting Evaluation-L, 0.10 in Metric for Evaluation of Translation with Explicit Ordering, and 0.15 in Semantic Propositional Image Caption Evaluation, which is significantly higher than the baseline LLM method, demonstrating good linguistic overlap and clinical consistency with the reference reports.

Conclusion: RadGPT has achieved advanced results in knee roentgen ray image feature recognition, illustrating the potential of LLMs in medical image interpretation. The study establishes a training protocol for developing artificial intelligence-assisted tools specifically focusing on the diagnosis and interpretation of knee radiological images.

Abstract Image

查看原文本刊更多论文

用于膝关节x线片诊断和解释的视觉语言人工智能系统：与人类的协作系统。

背景：大型语言模型（llm）在基于文本的临床任务中显示出很好的能力，但它们本身不能解释医学图像，如膝关节x线片。目的：开发一种人类-人工智能交互诊断方法，称为放射学生成预训练变压器（RadGPT），旨在协助和协同人类用户解释膝关节放射图像。材料和方法：从马萨诸塞州总医院检索膝关节x线图像和报告共22 512张；其中80%用于模型训练，10%用于模型测试和验证。根据检索到的官方报告中15种诊断性影像学特征（如骨关节炎、积液、关节间隙狭窄、骨赘）的高频率和临床相关性，选择标记图像。计算每个特征的曲线下面积评分，以评估诊断性能。为了评估生成的医学文本的质量，历史临床报告被用作参考文本。在文本生成任务中应用了几种度量，包括双语评价替代度量、面向记忆的注册评价替代度量、显式排序翻译评价度量和语义命题图像标题评价度量。结果：RadGPT与人类用户合作，实现了曲线下面积评分范围从骨坏死的0.76到关节置换术的0.91，涵盖了膝关节疾病的15个诊断类别。与基线LLM方法相比，RadGPT获得了更高的分数，其中双语评价替补得分为0.18，标记评价- l的回忆导向替补得分为0.30，显式排序翻译评价度量得分为0.10，语义命题图像标题评价得分为0.15，显著高于基线LLM方法，与参考报告表现出良好的语言重叠和临床一致性。结论：RadGPT在膝关节x线图像特征识别方面取得了较好的效果，说明了LLMs在医学图像解释中的潜力。该研究为开发人工智能辅助工具建立了一个训练方案，特别关注膝关节放射图像的诊断和解释。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Radiology advances

自引率

0.00%

发文量