Triple-Supervised Convolutional Transformer Aggregation for Robust Monocular Endoscopic Dense Depth Estimation

IF 3.4 Q2 ENGINEERING, BIOMEDICAL

IEEE transactions on medical robotics and bionics Pub Date : 2024-03-31 DOI:10.1109/TMRB.2024.3407384

Wenkang Fan;Wenjing Jiang;Hong Shi;Hui-Qing Zeng;Yinran Chen;Xiongbiao Luo

{"title":"Triple-Supervised Convolutional Transformer Aggregation for Robust Monocular Endoscopic Dense Depth Estimation","authors":"Wenkang Fan;Wenjing Jiang;Hong Shi;Hui-Qing Zeng;Yinran Chen;Xiongbiao Luo","doi":"10.1109/TMRB.2024.3407384","DOIUrl":null,"url":null,"abstract":"Accurate deeply learned dense depth prediction remains a challenge to monocular vision reconstruction. Compared to monocular depth estimation from natural images, endoscopic dense depth prediction is even more challenging. While it is difficult to annotate endoscopic video data for supervised learning, endoscopic video images certainly suffer from illumination variations (limited lighting source, limited field of viewing, and specular highlight), smooth and textureless surfaces in surgical complex fields. This work explores a new deep learning framework of triple-supervised convolutional transformer aggregation (TSCTA) for monocular endoscopic dense depth recovery without annotating any data. Specifically, TSCTA creates convolutional transformer aggregation networks with a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features, while it builds a local and global aggregation decoder to effectively aggregate global features and local features from coarse to fine. Moreover, we develop a self-supervised learning framework with triple supervision, which integrates minimum photometric consistency and depth consistency with sparse depth self-supervision to train our model by unannotated data. We evaluated TSCTA on unannotated monocular endoscopic images collected from various surgical procedures, with the experimental results showing that our methods can achieve more accurate depth range, more complete depth distribution, more sufficient textures, better qualitative and quantitative assessment results than state-of-the-art deeply learned monocular dense depth estimation methods.","PeriodicalId":73318,"journal":{"name":"IEEE transactions on medical robotics and bionics","volume":null,"pages":null},"PeriodicalIF":3.4000,"publicationDate":"2024-03-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical robotics and bionics","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10545340/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate deeply learned dense depth prediction remains a challenge to monocular vision reconstruction. Compared to monocular depth estimation from natural images, endoscopic dense depth prediction is even more challenging. While it is difficult to annotate endoscopic video data for supervised learning, endoscopic video images certainly suffer from illumination variations (limited lighting source, limited field of viewing, and specular highlight), smooth and textureless surfaces in surgical complex fields. This work explores a new deep learning framework of triple-supervised convolutional transformer aggregation (TSCTA) for monocular endoscopic dense depth recovery without annotating any data. Specifically, TSCTA creates convolutional transformer aggregation networks with a new hybrid encoder that combines dense convolution and scalable transformers to parallel extract local texture features and global spatial-temporal features, while it builds a local and global aggregation decoder to effectively aggregate global features and local features from coarse to fine. Moreover, we develop a self-supervised learning framework with triple supervision, which integrates minimum photometric consistency and depth consistency with sparse depth self-supervision to train our model by unannotated data. We evaluated TSCTA on unannotated monocular endoscopic images collected from various surgical procedures, with the experimental results showing that our methods can achieve more accurate depth range, more complete depth distribution, more sufficient textures, better qualitative and quantitative assessment results than state-of-the-art deeply learned monocular dense depth estimation methods.

查看原文本刊更多论文

三重监督卷积变换器聚合用于稳健的单目内窥镜密集深度估计

精确的深度学习密集深度预测仍然是单目视觉重建的一项挑战。与从自然图像进行单目深度估计相比，内窥镜密集深度预测更具挑战性。虽然很难对内窥镜视频数据进行注释以进行监督学习，但内窥镜视频图像肯定会受到光照变化（有限的光源、有限的视野和镜面高光）、手术复杂领域中光滑和无纹理表面的影响。本研究探索了一种新的深度学习框架--三重监督卷积变换器聚合（TSCTA），用于单眼内窥镜密集深度恢复，无需注释任何数据。具体来说，TSCTA 利用新的混合编码器创建卷积变换器聚合网络，该编码器结合了密集卷积和可扩展变换器，可并行提取局部纹理特征和全局时空特征，同时它还建立了局部和全局聚合解码器，可有效聚合从粗到细的全局特征和局部特征。此外，我们还开发了一个具有三重监督的自监督学习框架，该框架将最小光度一致性和深度一致性与稀疏深度自监督整合在一起，通过无标注数据来训练我们的模型。实验结果表明，与最先进的深度学习单目密集深度估计方法相比，我们的方法能获得更准确的深度范围、更完整的深度分布、更充分的纹理以及更好的定性和定量评估结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE transactions on medical robotics and bionics

CiteScore

6.80

自引率

0.00%

发文量