Guankun Wang , Long Bai , Junyi Wang , Kun Yuan , Zhen Li , Tianxu Jiang , Xiting He , Jinlin Wu , Zhen Chen , Zhen Lei , Hongbin Liu , Jiazheng Wang , Fan Zhang , Nicolas Padoy , Nassir Navab , Hongliang Ren
{"title":"EndoChat: Grounded multimodal large language model for endoscopic surgery","authors":"Guankun Wang , Long Bai , Junyi Wang , Kun Yuan , Zhen Li , Tianxu Jiang , Xiting He , Jinlin Wu , Zhen Chen , Zhen Lei , Hongbin Liu , Jiazheng Wang , Fan Zhang , Nicolas Padoy , Nassir Navab , Hongliang Ren","doi":"10.1016/j.media.2025.103789","DOIUrl":null,"url":null,"abstract":"<div><div>Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a deficiency of MLLMs specialized for surgical scene understanding in endoscopic procedures. To this end, we present EndoChat, an MLLM tailored to address various dialogue paradigms and subtasks in understanding endoscopic procedures. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model’s representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and seven surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, who provide positive feedback on the majority of conversation cases generated by EndoChat. Overall, these results demonstrate that EndoChat has the potential to advance training and automation in robotic-assisted surgery. Our dataset and model are publicly available at <span><span>https://github.com/gkw0010/EndoChat</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"107 ","pages":"Article 103789"},"PeriodicalIF":11.8000,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841525003354","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Recently, Multimodal Large Language Models (MLLMs) have demonstrated their immense potential in computer-aided diagnosis and decision-making. In the context of robotic-assisted surgery, MLLMs can serve as effective tools for surgical training and guidance. However, there is still a deficiency of MLLMs specialized for surgical scene understanding in endoscopic procedures. To this end, we present EndoChat, an MLLM tailored to address various dialogue paradigms and subtasks in understanding endoscopic procedures. To train our EndoChat, we construct the Surg-396K dataset through a novel pipeline that systematically extracts surgical information and generates structured annotations based on large-scale endoscopic surgery datasets. Furthermore, we introduce a multi-scale visual token interaction mechanism and a visual contrast-based reasoning mechanism to enhance the model’s representation learning and reasoning capabilities. Our model achieves state-of-the-art performance across five dialogue paradigms and seven surgical scene understanding tasks. Additionally, we conduct evaluations with professional surgeons, who provide positive feedback on the majority of conversation cases generated by EndoChat. Overall, these results demonstrate that EndoChat has the potential to advance training and automation in robotic-assisted surgery. Our dataset and model are publicly available at https://github.com/gkw0010/EndoChat.
近年来,多模态大语言模型(Multimodal Large Language Models, MLLMs)在计算机辅助诊断和决策方面显示出了巨大的潜力。在机器人辅助手术的背景下,mllm可以作为手术训练和指导的有效工具。然而,在内窥镜手术过程中,仍然缺乏专门用于手术场景理解的mllm。为此,我们提出了EndoChat,这是一个定制的mlm,用于解决理解内窥镜手术中的各种对话范例和子任务。为了训练我们的EndoChat,我们通过一个新颖的管道构建了surgical - 396k数据集,该管道系统地提取手术信息并基于大规模内窥镜手术数据集生成结构化注释。此外,我们引入了多尺度视觉标记交互机制和基于视觉对比的推理机制,以增强模型的表示学习和推理能力。我们的模型在五个对话范例和七个手术场景理解任务中实现了最先进的性能。此外,我们与专业外科医生进行评估,他们对EndoChat产生的大多数对话病例提供积极的反馈。总的来说,这些结果表明,EndoChat有潜力推进机器人辅助手术的训练和自动化。我们的数据集和模型可以在https://github.com/gkw0010/EndoChat上公开获取。
期刊介绍:
Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.