LoViT: Long Video Transformer for surgical phase recognition

IF 10.7 1区 医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Yang Liu , Maxence Boels , Luis C. Garcia-Peraza-Herrera , Tom Vercauteren , Prokar Dasgupta , Alejandro Granados , Sébastien Ourselin
{"title":"LoViT: Long Video Transformer for surgical phase recognition","authors":"Yang Liu ,&nbsp;Maxence Boels ,&nbsp;Luis C. Garcia-Peraza-Herrera ,&nbsp;Tom Vercauteren ,&nbsp;Prokar Dasgupta ,&nbsp;Alejandro Granados ,&nbsp;Sébastien Ourselin","doi":"10.1016/j.media.2024.103366","DOIUrl":null,"url":null,"abstract":"<div><div>Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT), emphasizing the development of a temporally-rich spatial feature extractor and a phase transition map. The temporally-rich spatial feature extractor is designed to capture critical temporal information within the surgical video frames. The phase transition map provides essential insights into the dynamic transitions between different surgical phases. LoViT combines these innovations with a multiscale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on <em>ProbSparse</em> self-attention for processing global temporal information. The multi-scale temporal head then leverages the temporally-rich spatial features and phase transition map to classify surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.1 pp improvement on AutoLaparo. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics. The project page is available at <span><span>https://github.com/MRUIL/LoViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":18328,"journal":{"name":"Medical image analysis","volume":"99 ","pages":"Article 103366"},"PeriodicalIF":10.7000,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical image analysis","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1361841524002913","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Online surgical phase recognition plays a significant role towards building contextual tools that could quantify performance and oversee the execution of surgical workflows. Current approaches are limited since they train spatial feature extractors using frame-level supervision that could lead to incorrect predictions due to similar frames appearing at different phases, and poorly fuse local and global features due to computational constraints which can affect the analysis of long videos commonly encountered in surgical interventions. In this paper, we present a two-stage method, called Long Video Transformer (LoViT), emphasizing the development of a temporally-rich spatial feature extractor and a phase transition map. The temporally-rich spatial feature extractor is designed to capture critical temporal information within the surgical video frames. The phase transition map provides essential insights into the dynamic transitions between different surgical phases. LoViT combines these innovations with a multiscale temporal aggregator consisting of two cascaded L-Trans modules based on self-attention, followed by a G-Informer module based on ProbSparse self-attention for processing global temporal information. The multi-scale temporal head then leverages the temporally-rich spatial features and phase transition map to classify surgical phases using phase transition-aware supervision. Our approach outperforms state-of-the-art methods on the Cholec80 and AutoLaparo datasets consistently. Compared to Trans-SVNet, LoViT achieves a 2.4 pp (percentage point) improvement in video-level accuracy on Cholec80 and a 3.1 pp improvement on AutoLaparo. Our results demonstrate the effectiveness of our approach in achieving state-of-the-art performance of surgical phase recognition on two datasets of different surgical procedures and temporal sequencing characteristics. The project page is available at https://github.com/MRUIL/LoViT.
LoViT:用于手术相位识别的长视频变换器
在线手术阶段识别在构建可量化性能和监督手术工作流程执行的情境工具方面发挥着重要作用。目前的方法存在局限性,因为它们使用帧级监督来训练空间特征提取器,而帧级监督可能会因相似帧出现在不同阶段而导致预测错误,并且由于计算限制而无法很好地融合局部和全局特征,这可能会影响手术干预中常见的长视频分析。在本文中,我们提出了一种名为 "长视频转换器"(LoViT)的两阶段方法,强调开发富含时间的空间特征提取器和相位转换图。时间丰富的空间特征提取器旨在捕捉手术视频帧中的关键时间信息。相位转换图则为了解不同手术阶段之间的动态转换提供了重要依据。LoViT 将这些创新技术与多尺度时间聚合器相结合,多尺度时间聚合器由两个基于自注意的级联 L-Trans 模块组成,之后是一个基于 ProbSparse 自注意的 G-Informer 模块,用于处理全局时间信息。然后,多尺度时间头利用时间丰富的空间特征和相位转换图,通过相位转换感知监督对手术阶段进行分类。在 Cholec80 和 AutoLaparo 数据集上,我们的方法始终优于最先进的方法。与 Trans-SVNet 相比,LoViT 在 Cholec80 上的视频级准确率提高了 2.4 个百分点,在 AutoLaparo 上提高了 3.1 个百分点。我们的结果表明,我们的方法在两个具有不同手术过程和时序特征的数据集上有效地实现了最先进的手术阶段识别性能。项目网页:https://github.com/MRUIL/LoViT。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Medical image analysis
Medical image analysis 工程技术-工程:生物医学
CiteScore
22.10
自引率
6.40%
发文量
309
审稿时长
6.6 months
期刊介绍: Medical Image Analysis serves as a platform for sharing new research findings in the realm of medical and biological image analysis, with a focus on applications of computer vision, virtual reality, and robotics to biomedical imaging challenges. The journal prioritizes the publication of high-quality, original papers contributing to the fundamental science of processing, analyzing, and utilizing medical and biological images. It welcomes approaches utilizing biomedical image datasets across all spatial scales, from molecular/cellular imaging to tissue/organ imaging.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信