Text-driven adaptation of foundation models for few-shot surgical workflow analysis.

IF 2.3 3区医学 Q3 ENGINEERING, BIOMEDICAL

International Journal of Computer Assisted Radiology and Surgery Pub Date : 2025-04-17 DOI:10.1007/s11548-025-03341-0

Tingxuan Chen, Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy

{"title":"Text-driven adaptation of foundation models for few-shot surgical workflow analysis.","authors":"Tingxuan Chen, Kun Yuan, Vinkle Srivastav, Nassir Navab, Nicolas Padoy","doi":"10.1007/s11548-025-03341-0","DOIUrl":null,"url":null,"abstract":"Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data.Methods: Our approach has two key components. First, few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs.Results: We evaluate our approach on generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks.Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in https://github.com/CAMMApublic/Surg-FTDA .","PeriodicalId":51251,"journal":{"name":"International Journal of Computer Assisted Radiology and Surgery","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Assisted Radiology and Surgery","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1007/s11548-025-03341-0","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Surgical workflow analysis is crucial for improving surgical efficiency and safety. However, previous studies rely heavily on large-scale annotated datasets, posing challenges in cost, scalability, and reliance on expert annotations. To address this, we propose Surg-FTDA (Few-shot Text-driven Adaptation), designed to handle various surgical workflow analysis tasks with minimal paired image-label data.

Methods: Our approach has two key components. First, few-shot selection-based modality alignment selects a small subset of images and aligns their embeddings with text embeddings from the downstream task, bridging the modality gap. Second, text-driven adaptation leverages only text data to train a decoder, eliminating the need for paired image-text data. This decoder is then applied to aligned image embeddings, enabling image-related tasks without explicit image-text pairs.

Results: We evaluate our approach on generative tasks (image captioning) and discriminative tasks (triplet recognition and phase recognition). Results show that Surg-FTDA outperforms baselines and generalizes well across downstream tasks.

Conclusion: We propose a text-driven adaptation approach that mitigates the modality gap and handles multiple downstream tasks in surgical workflow analysis, with minimal reliance on large annotated datasets. The code and dataset will be released in https://github.com/CAMMApublic/Surg-FTDA .

查看原文本刊更多论文

基于文本驱动的基于基础模型的少针手术工作流分析。

目的：手术流程分析对提高手术效率和安全性至关重要。然而，以往的研究严重依赖于大规模的标注数据集，在成本、可扩展性和对专家标注的依赖方面存在挑战。为了解决这个问题，我们提出了surgical - ftda (Few-shot Text-driven Adaptation)，旨在用最小的成对图像标签数据处理各种手术工作流程分析任务。方法：我们的方法有两个关键组成部分。首先，基于少拍选择的模态对齐选择一小部分图像，并将其嵌入与下游任务的文本嵌入对齐，弥合模态差距。其次，文本驱动的自适应仅利用文本数据来训练解码器，从而消除了对图像-文本数据配对的需要。然后将此解码器应用于对齐的图像嵌入，从而无需显式图像-文本对即可实现与图像相关的任务。结果：我们在生成任务（图像字幕）和判别任务（三联体识别和阶段识别）上评估了我们的方法。结果表明，Surg-FTDA优于基线，并且可以很好地推广到下游任务。结论：我们提出了一种文本驱动的自适应方法，该方法可以减轻手术工作流程分析中的模态差距并处理多个下游任务，同时最大限度地减少对大型注释数据集的依赖。代码和数据集将在https://github.com/CAMMApublic/Surg-FTDA上发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Assisted Radiology and Surgery ENGINEERING, BIOMEDICAL-RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING

CiteScore

5.90

自引率

6.70%

发文量

243

审稿时长

6-12 weeks

期刊介绍： The International Journal for Computer Assisted Radiology and Surgery (IJCARS) is a peer-reviewed journal that provides a platform for closing the gap between medical and technical disciplines, and encourages interdisciplinary research and development activities in an international environment.