Bochen Xie;Yongjian Deng;Zhanpeng Shao;Qingsong Xu;Youfu Li
{"title":"Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams","authors":"Bochen Xie;Yongjian Deng;Zhanpeng Shao;Qingsong Xu;Youfu Li","doi":"10.1109/TCSVT.2024.3448615","DOIUrl":null,"url":null,"abstract":"Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. Most event-based methods project events into dense frames and process them using conventional vision models, resulting in high computational complexity. A recent trend is to develop point-based networks that achieve efficient event processing by learning sparse representations. However, existing works may lack robust local information aggregators and effective feature interaction operations, thus limiting their modeling capabilities. To this end, we propose an attention-aware model named Event Voxel Set Transformer (EVSTr) for efficient spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder that consists of two well-designed components, including the Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and the Voxel Self-Attention Layer (VSAL) for global feature interaction. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy (S2TM) to learn motion patterns from a sequence of segmented voxel sets. The proposed model is evaluated on two recognition tasks, including object classification and action recognition. To provide a convincing model evaluation, we present a new event-based action recognition dataset (NeuroHAR) recorded in challenging scenarios. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 12","pages":"13427-13440"},"PeriodicalIF":8.3000,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10644034/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Event cameras are neuromorphic vision sensors that record a scene as sparse and asynchronous event streams. Most event-based methods project events into dense frames and process them using conventional vision models, resulting in high computational complexity. A recent trend is to develop point-based networks that achieve efficient event processing by learning sparse representations. However, existing works may lack robust local information aggregators and effective feature interaction operations, thus limiting their modeling capabilities. To this end, we propose an attention-aware model named Event Voxel Set Transformer (EVSTr) for efficient spatiotemporal representation learning on event streams. It first converts the event stream into voxel sets and then hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder that consists of two well-designed components, including the Multi-Scale Neighbor Embedding Layer (MNEL) for local information aggregation and the Voxel Self-Attention Layer (VSAL) for global feature interaction. Enabling the network to incorporate a long-range temporal structure, we introduce a segment modeling strategy (S2TM) to learn motion patterns from a sequence of segmented voxel sets. The proposed model is evaluated on two recognition tasks, including object classification and action recognition. To provide a convincing model evaluation, we present a new event-based action recognition dataset (NeuroHAR) recorded in challenging scenarios. Comprehensive experiments show that EVSTr achieves state-of-the-art performance while maintaining low model complexity.
事件相机是一种神经形态的视觉传感器,它将一个场景记录为稀疏和异步的事件流。大多数基于事件的方法将事件映射到密集的帧中,并使用传统的视觉模型进行处理,导致计算复杂度很高。最近的一个趋势是开发基于点的网络,通过学习稀疏表示来实现高效的事件处理。然而,现有的工作可能缺乏强大的本地信息聚合器和有效的特征交互操作,从而限制了它们的建模能力。为此,我们提出了一种名为事件体素集转换器(Event Voxel Set Transformer, EVSTr)的注意力感知模型,用于事件流的高效时空表征学习。它首先将事件流转换成体素集,然后分层地聚合体素特征以获得鲁棒表示。EVSTr的核心是一个事件体素转换编码器,该编码器由两个精心设计的组件组成,包括用于局部信息聚合的多尺度邻居嵌入层(MNEL)和用于全局特征交互的体素自关注层(VSAL)。为了使网络包含一个长期的时间结构,我们引入了一种分段建模策略(S2TM)来从一系列分割的体素集中学习运动模式。在目标分类和动作识别两个识别任务上对该模型进行了评价。为了提供一个令人信服的模型评估,我们提出了一个新的基于事件的动作识别数据集(NeuroHAR),记录在具有挑战性的场景中。综合实验表明,EVSTr在保持较低模型复杂度的同时达到了最先进的性能。
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.