基于跨模态知识精馏的事件相机深度估计新型高效尖峰变压器网络

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-10-09 DOI:10.1016/j.neucom.2025.131745

Xin Zhang , Liangxiu Han , Sergio Davies , Tam Sobeih , Lianghao Han , Darren Dancey

{"title":"基于跨模态知识精馏的事件相机深度估计新型高效尖峰变压器网络","authors":"Xin Zhang , Liangxiu Han , Sergio Davies , Tam Sobeih , Lianghao Han , Darren Dancey","doi":"10.1016/j.neucom.2025.131745","DOIUrl":null,"url":null,"abstract":"<div><div>Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labeled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131745"},"PeriodicalIF":6.5000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation\",\"authors\":\"Xin Zhang , Liangxiu Han , Sergio Davies , Tam Sobeih , Lianghao Han , Darren Dancey\",\"doi\":\"10.1016/j.neucom.2025.131745\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labeled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"658 \",\"pages\":\"Article 131745\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225024178\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225024178","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

深度估计是计算机视觉中的一项关键任务，在自主导航、机器人和增强现实中都有应用。事件相机将光强度的时间变化编码为异步二进制尖峰，具有低延迟、高动态范围和能效等独特优势。然而，它们的非常规峰值输出和标记数据集的稀缺性对传统的基于图像的深度估计方法构成了重大挑战。为了解决这些挑战，我们提出了一种新型节能的峰值驱动变压器网络（SDT），用于深度估计，利用峰值数据的独特特性。提出的SDT引入了三个关键创新：(1)一个纯峰值驱动的变压器架构，结合了基于峰值的注意力和残余机制，能够以最小的能耗进行精确的深度估计；(2)融合深度估计头，在保证计算效率的同时，结合多阶段特征进行细粒度深度预测；(3)跨模态知识蒸馏框架，该框架利用预训练的视觉基础模型（DINOv2）来增强峰值网络的训练，尽管数据可用性有限。在合成和真实事件数据集上的实验评估证明了我们方法的优越性，与现有模型相比，绝对相对误差（减少49%）和平方相对误差（减少39.77%）有了实质性的改进。SDT还实现了70.2%的能耗降低（每次推理12.43 mJ对41.77 mJ），并将模型参数降低42.4% (20.55 M对35.68 M)，使其非常适合资源受限的环境。这项工作代表了基于变压器的峰值神经网络深度估计的首次探索，为现实世界视觉应用的节能神经形态计算提供了重要的一步。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation

Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labeled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.