LoViT: Long Video Transformer for surgical phase recognition

IF 10.7 1区医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Medical image analysis Pub Date : 2024-10-05 DOI:10.1016/j.media.2024.103366

Yang Liu , Maxence Boels , Luis C. Garcia-Peraza-Herrera , Tom Vercauteren , Prokar Dasgupta , Alejandro Granados , Sébastien Ourselin

{"title":"LoViT: Long Video Transformer for surgical phase recognition","authors":"Yang Liu , Maxence Boels , Luis C. Garcia-Peraza-Herrera , Tom Vercauteren , Prokar Dasgupta , Alejandro Granados , Sébastien Ourselin","doi":"10.1016/j.media.2024.103366","DOIUrl":null,"url":null,"abstract":"<div><div>Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT), emphasizing the development of a temporally-rich spatial feature extractor and a phase transition map. The temporally-rich spatial feature extractor is designed to capture critical temporal information within the surgical video frames. The phase transition map provides essential insights into the dynamic transitions between different surgical phases. LoViT combines these innovations with a multiscale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on <em>ProbSparse</em> self-attention for processing global temporal information. The multi-scale temporal head then leverages the temporally-rich spatial features and phase transition map to classify surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.1 pp improvement on AutoLaparo. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics. The project page is available at <span><span>https://github.com/MRUIL/LoViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"99 ","pages":"Article 103366"},"PeriodicalIF":10.7000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841524002913","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT), emphasizing the development of a temporally-rich spatial feature extractor and a phase transition map. The temporally-rich spatial feature extractor is designed to capture critical temporal information within the surgical video frames. The phase transition map provides essential insights into the dynamic transitions between different surgical phases. LoViT combines these innovations with a multiscale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processing global temporal information. The multi-scale temporal head then leverages the temporally-rich spatial features and phase transition map to classify surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.1 pp improvement on AutoLaparo. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics. The project page is available at https://github.com/MRUIL/LoViT.

查看原文本刊更多论文

LoViT：用于手术相位识别的长视频变换器

在线手术阶段识别在构建可量化性能和监督手术工作流程执行的情境工具方面发挥着重要作用。目前的方法存在局限性，因为它们使用帧级监督来训练空间特征提取器，而帧级监督可能会因相似帧出现在不同阶段而导致预测错误，并且由于计算限制而无法很好地融合局部和全局特征，这可能会影响手术干预中常见的长视频分析。在本文中，我们提出了一种名为 "长视频转换器"（LoViT）的两阶段方法，强调开发富含时间的空间特征提取器和相位转换图。时间丰富的空间特征提取器旨在捕捉手术视频帧中的关键时间信息。相位转换图则为了解不同手术阶段之间的动态转换提供了重要依据。LoViT 将这些创新技术与多尺度时间聚合器相结合，多尺度时间聚合器由两个基于自注意的级联 L-Trans 模块组成，之后是一个基于 ProbSparse 自注意的 G-Informer 模块，用于处理全局时间信息。然后，多尺度时间头利用时间丰富的空间特征和相位转换图，通过相位转换感知监督对手术阶段进行分类。在 Cholec80 和 AutoLaparo 数据集上，我们的方法始终优于最先进的方法。与 Trans-SVNet 相比，LoViT 在 Cholec80 上的视频级准确率提高了 2.4 个百分点，在 AutoLaparo 上提高了 3.1 个百分点。我们的结果表明，我们的方法在两个具有不同手术过程和时序特征的数据集上有效地实现了最先进的手术阶段识别性能。项目网页：https://github.com/MRUIL/LoViT。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical image analysis 工程技术-工程：生物医学

CiteScore

22.10

自引率

6.40%

发文量

309

审稿时长

6.6 months

期刊介绍： Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.