Xin Zhang , Liangxiu Han , Sergio Davies , Tam Sobeih , Lianghao Han , Darren Dancey
{"title":"基于跨模态知识精馏的事件相机深度估计新型高效尖峰变压器网络","authors":"Xin Zhang , Liangxiu Han , Sergio Davies , Tam Sobeih , Lianghao Han , Darren Dancey","doi":"10.1016/j.neucom.2025.131745","DOIUrl":null,"url":null,"abstract":"<div><div>Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labeled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"658 ","pages":"Article 131745"},"PeriodicalIF":6.5000,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation\",\"authors\":\"Xin Zhang , Liangxiu Han , Sergio Davies , Tam Sobeih , Lianghao Han , Darren Dancey\",\"doi\":\"10.1016/j.neucom.2025.131745\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labeled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"658 \",\"pages\":\"Article 131745\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-10-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0925231225024178\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231225024178","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
A novel energy-efficient spike transformer network for depth estimation from event cameras via cross-modality knowledge distillation
Depth estimation is a critical task in computer vision, with applications in autonomous navigation, robotics, and augmented reality. Event cameras, which encode temporal changes in light intensity as asynchronous binary spikes, offer unique advantages such as low latency, high dynamic range, and energy efficiency. However, their unconventional spiking output and the scarcity of labeled datasets pose significant challenges to traditional image-based depth estimation methods. To address these challenges, we propose a novel energy-efficient Spike-Driven Transformer Network (SDT) for depth estimation, leveraging the unique properties of spiking data. The proposed SDT introduces three key innovations: (1) a purely spike-driven transformer architecture that incorporates spike-based attention and residual mechanisms, enabling precise depth estimation with minimal energy consumption; (2) a fusion depth estimation head that combines multi-stage features for fine-grained depth prediction while ensuring computational efficiency; and (3) a cross-modality knowledge distillation framework that utilises a pre-trained vision foundation model (DINOv2) to enhance the training of the spiking network despite limited data availability. Experimental evaluations on synthetic and real-world event datasets demonstrate the superiority of our approach, with substantial improvements in Absolute Relative Error (49 % reduction) and Square Relative Error (39.77 % reduction) compared to existing models. The SDT also achieves a 70.2 % reduction in energy consumption (12.43 mJ vs. 41.77 mJ per inference) and reduces model parameters by 42.4 % (20.55 M vs. 35.68 M), making it highly suitable for resource-constrained environments. This work represents the first exploration of transformer-based spiking neural networks for depth estimation, providing a significant step forward in energy-efficient neuromorphic computing for real-world vision applications.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.