MSBATN: Multi-Stage Boundary-Aware Transformer Network for action segmentation in untrimmed surgical videos

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-10-06 DOI:10.1016/j.cviu.2025.104521

Rezowan Shuvo, M.S. Mekala, Eyad Elyan

{"title":"MSBATN: Multi-Stage Boundary-Aware Transformer Network for action segmentation in untrimmed surgical videos","authors":"Rezowan Shuvo, M.S. Mekala, Eyad Elyan","doi":"10.1016/j.cviu.2025.104521","DOIUrl":null,"url":null,"abstract":"<div><div>Understanding actions within surgical workflows is critical for evaluating post-operative outcomes and enhancing surgical training and efficiency. Capturing and analysing long sequences of actions in surgical settings is challenging due to the inherent variability in individual surgeon approaches, which are shaped by their expertise and preferences. This variability complicates the identification and segmentation of distinct actions with ambiguous boundary start and end points. The traditional models, such as MS-TCN, which rely on large receptive fields, cause over-segmentation or under-segmentation, where distinct actions are incorrectly aligned. To address these challenges, we propose the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention to improve action segmentation. Our approach effectively manages the complexity of varying action durations and subtle transitions by accurately identifying start and end action boundaries in untrimmed surgical videos. MSBATN introduces a novel unified loss function that optimises action classification and boundary detection as interconnected tasks. Unlike conventional binary boundary detection methods, our innovative boundary weighing mechanism leverages contextual information to precisely identify action boundaries. Extensive experiments on three challenging surgical datasets demonstrate that MSBATN achieves state-of-the-art performance, with superior F1 scores at 25% and 50% thresholds and competitive results across other metrics.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"261 ","pages":"Article 104521"},"PeriodicalIF":3.5000,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225002449","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Understanding actions within surgical workflows is critical for evaluating post-operative outcomes and enhancing surgical training and efficiency. Capturing and analysing long sequences of actions in surgical settings is challenging due to the inherent variability in individual surgeon approaches, which are shaped by their expertise and preferences. This variability complicates the identification and segmentation of distinct actions with ambiguous boundary start and end points. The traditional models, such as MS-TCN, which rely on large receptive fields, cause over-segmentation or under-segmentation, where distinct actions are incorrectly aligned. To address these challenges, we propose the Multi-Stage Boundary-Aware Transformer Network (MSBATN) with hierarchical sliding window attention to improve action segmentation. Our approach effectively manages the complexity of varying action durations and subtle transitions by accurately identifying start and end action boundaries in untrimmed surgical videos. MSBATN introduces a novel unified loss function that optimises action classification and boundary detection as interconnected tasks. Unlike conventional binary boundary detection methods, our innovative boundary weighing mechanism leverages contextual information to precisely identify action boundaries. Extensive experiments on three challenging surgical datasets demonstrate that MSBATN achieves state-of-the-art performance, with superior F1 scores at 25% and 50% thresholds and competitive results across other metrics.

查看原文本刊更多论文

用于未修剪手术视频动作分割的多阶段边界感知变压器网络

了解手术工作流程中的动作对于评估术后结果和提高手术培训和效率至关重要。由于个体外科医生的方法具有内在的可变性，这是由他们的专业知识和偏好形成的，因此在手术环境中捕获和分析长序列的动作是具有挑战性的。这种可变性使具有模糊边界起点和终点的不同动作的识别和分割变得复杂。传统的模型，如MS-TCN，依赖于大的接受域，导致分割过度或分割不足，其中不同的动作不正确地对齐。为了解决这些挑战，我们提出了具有分层滑动窗口关注的多级边界感知变压器网络（MSBATN）来改进动作分割。我们的方法通过准确识别未修剪手术视频中的开始和结束动作边界，有效地管理了不同动作持续时间和微妙过渡的复杂性。MSBATN引入了一种新的统一损失函数，将动作分类和边界检测作为相互关联的任务进行优化。与传统的二值边界检测方法不同，我们创新的边界加权机制利用上下文信息来精确识别动作边界。在三个具有挑战性的手术数据集上进行的大量实验表明，MSBATN达到了最先进的性能，在25%和50%阈值下具有优越的F1分数，并且在其他指标上具有竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems