基于事件立体深度估计的深度线索融合

IF 14.7 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Information Fusion Pub Date : 2024-12-24 DOI:10.1016/j.inffus.2024.102891

Dipon Kumar Ghosh, Yong Ju Jung

{"title":"基于事件立体深度估计的深度线索融合","authors":"Dipon Kumar Ghosh, Yong Ju Jung","doi":"10.1016/j.inffus.2024.102891","DOIUrl":null,"url":null,"abstract":"Inspired by the biological retina, event cameras utilize dynamic vision sensors to capture pixel intensity changes asynchronously. Event cameras offer numerous advantages, such as high dynamic range, high temporal resolution, less motion blur, and low power consumption. These features make event cameras particularly well-suited for depth estimation, especially in challenging scenarios involving rapid motion and high dynamic range imaging conditions. The human visual system perceives the scene depth by combining multiple depth cues such as monocular pictorial depth, stereo depth, and motion parallax. However, most existing algorithms of the event-based depth estimation utilize only single depth cue such as either stereo depth or monocular depth. While it is feasible to estimate depth from a single cue, estimating dense disparity in challenging scenarios and lightning conditions remains a challenging problem. Following this, we conduct extensive experiments to explore various methods for the depth cue fusion. Inspired by the experiment results, in this study, we propose a fusion architecture that systematically incorporates multiple depth cues for the event-based stereo depth estimation. To this end, we propose a depth cue fusion (DCF) network to fuse multiple depth cues by utilizing a novel fusion method called SpadeFormer. The proposed SpadeFormer is a full y context-aware fusion mechanism, which incorporates two modulation techniques (i.e., spatially adaptive denormalization (Spade) and cross-attention) for the depth cue fusion in a transformer block. The adaptive denormalization modulates both input features by adjusting the global statistics of features in a cross manner, and the modulated features are further fused by the cross-attention technique. Experiments conducted on a real-world dataset show that our method reduces the one-pixel error rate by at least 47.63% (3.708 for the best existing method vs. 1.942 for ours) and the mean absolute error by 40.07% (0.302 for the best existing method vs. 0.181 for ours). The results reveal that the depth cue fusion method outperforms the state-of-the-art methods by significant margins and produces better disparity maps.","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"44 1","pages":""},"PeriodicalIF":14.7000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Depth cue fusion for event-based stereo depth estimation\",\"authors\":\"Dipon Kumar Ghosh, Yong Ju Jung\",\"doi\":\"10.1016/j.inffus.2024.102891\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Inspired by the biological retina, event cameras utilize dynamic vision sensors to capture pixel intensity changes asynchronously. Event cameras offer numerous advantages, such as high dynamic range, high temporal resolution, less motion blur, and low power consumption. These features make event cameras particularly well-suited for depth estimation, especially in challenging scenarios involving rapid motion and high dynamic range imaging conditions. The human visual system perceives the scene depth by combining multiple depth cues such as monocular pictorial depth, stereo depth, and motion parallax. However, most existing algorithms of the event-based depth estimation utilize only single depth cue such as either stereo depth or monocular depth. While it is feasible to estimate depth from a single cue, estimating dense disparity in challenging scenarios and lightning conditions remains a challenging problem. Following this, we conduct extensive experiments to explore various methods for the depth cue fusion. Inspired by the experiment results, in this study, we propose a fusion architecture that systematically incorporates multiple depth cues for the event-based stereo depth estimation. To this end, we propose a depth cue fusion (DCF) network to fuse multiple depth cues by utilizing a novel fusion method called SpadeFormer. The proposed SpadeFormer is a full y context-aware fusion mechanism, which incorporates two modulation techniques (i.e., spatially adaptive denormalization (Spade) and cross-attention) for the depth cue fusion in a transformer block. The adaptive denormalization modulates both input features by adjusting the global statistics of features in a cross manner, and the modulated features are further fused by the cross-attention technique. Experiments conducted on a real-world dataset show that our method reduces the one-pixel error rate by at least 47.63% (3.708 for the best existing method vs. 1.942 for ours) and the mean absolute error by 40.07% (0.302 for the best existing method vs. 0.181 for ours). The results reveal that the depth cue fusion method outperforms the state-of-the-art methods by significant margins and produces better disparity maps.\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"44 1\",\"pages\":\"\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1016/j.inffus.2024.102891\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.inffus.2024.102891","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

受生物视网膜的启发，事件相机利用动态视觉传感器来捕捉像素强度的异步变化。事件相机具有许多优点，如高动态范围、高时间分辨率、较少运动模糊和低功耗。这些功能使得事件相机特别适合深度估计，特别是在涉及快速运动和高动态范围成像条件的具有挑战性的场景中。人类视觉系统通过结合多个深度线索，如单目图像深度、立体深度和运动视差，来感知场景深度。然而，现有的基于事件的深度估计算法大多只利用单一的深度线索，如立体深度或单目深度。虽然从单个线索估计深度是可行的，但在具有挑战性的场景和闪电条件下估计密度差仍然是一个具有挑战性的问题。在此基础上，我们进行了大量的实验来探索深度线索融合的各种方法。在实验结果的启发下，本研究提出了一种系统地融合多个深度线索的基于事件的立体深度估计融合架构。为此，我们提出了一个深度线索融合（DCF）网络，利用一种名为SpadeFormer的新型融合方法融合多个深度线索。提出的SpadeFormer是一种完全上下文感知的融合机制，它结合了两种调制技术（即空间自适应反规范化（Spade）和交叉注意），用于变压器块中的深度线索融合。自适应反规格化通过交叉调整特征的全局统计量来调制两个输入特征，并通过交叉注意技术进一步融合调制后的特征。在真实数据集上进行的实验表明，我们的方法将单像素错误率降低了至少47.63%（最佳现有方法为3.708，我们的方法为1.942），平均绝对误差降低了40.07%（最佳现有方法为0.302，我们的方法为0.181）。结果表明，深度线索融合方法明显优于最先进的方法，并产生更好的视差图。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Depth cue fusion for event-based stereo depth estimation

Inspired by the biological retina, event cameras utilize dynamic vision sensors to capture pixel intensity changes asynchronously. Event cameras offer numerous advantages, such as high dynamic range, high temporal resolution, less motion blur, and low power consumption. These features make event cameras particularly well-suited for depth estimation, especially in challenging scenarios involving rapid motion and high dynamic range imaging conditions. The human visual system perceives the scene depth by combining multiple depth cues such as monocular pictorial depth, stereo depth, and motion parallax. However, most existing algorithms of the event-based depth estimation utilize only single depth cue such as either stereo depth or monocular depth. While it is feasible to estimate depth from a single cue, estimating dense disparity in challenging scenarios and lightning conditions remains a challenging problem. Following this, we conduct extensive experiments to explore various methods for the depth cue fusion. Inspired by the experiment results, in this study, we propose a fusion architecture that systematically incorporates multiple depth cues for the event-based stereo depth estimation. To this end, we propose a depth cue fusion (DCF) network to fuse multiple depth cues by utilizing a novel fusion method called SpadeFormer. The proposed SpadeFormer is a full y context-aware fusion mechanism, which incorporates two modulation techniques (i.e., spatially adaptive denormalization (Spade) and cross-attention) for the depth cue fusion in a transformer block. The adaptive denormalization modulates both input features by adjusting the global statistics of features in a cross manner, and the modulated features are further fused by the cross-attention technique. Experiments conducted on a real-world dataset show that our method reduces the one-pixel error rate by at least 47.63% (3.708 for the best existing method vs. 1.942 for ours) and the mean absolute error by 40.07% (0.302 for the best existing method vs. 0.181 for ours). The results reveal that the depth cue fusion method outperforms the state-of-the-art methods by significant margins and produces better disparity maps.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Fusion 工程技术-计算机：理论方法

CiteScore

33.20

自引率

4.30%

发文量

161

审稿时长

7.9 months

期刊介绍： Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.