{"title":"MST: A Modified Sparse Transformer with depth-aware attention for multi-modal camera–LiDAR fusion in autonomous vehicles","authors":"Badri Raj Lamichhane , Bibek Paudel , Sushant Paudel , Gun Srijuntongsiri , Teerayut Horanont","doi":"10.1016/j.trip.2025.101571","DOIUrl":null,"url":null,"abstract":"<div><div>Sensor fusion plays a pivotal role in enhancing the accuracy, safety, and decision-making capabilities of autonomous vehicles by integrating camera and LiDAR data. Cameras provide rich semantic information, while LiDAR offers precise depth estimation; their fusion is crucial for robust perception in complex driving scenarios. Transformer-based models have emerged as effective tools for multimodal fusion by leveraging self-attention to capture intricate relationships between sensor data. However, traditional transformers face computational efficiency challenges with long input sequences and sparse data. To address these limitations, we propose the Modified Sparse Transformer (MST) for camera–LiDAR fusion. The MST reduces attention matrix complexity, enabling faster processing while maintaining high performance with fewer parameters. Key innovations include depth-aware attention mechanisms, cross-modal feature alignment, and dynamic instance interaction modules. These collectively enhance object detection accuracy in challenging conditions such as low visibility and dense traffic. Experiments on benchmark datasets demonstrate significant improvements in both accuracy and efficiency compared to existing methods.</div></div>","PeriodicalId":36621,"journal":{"name":"Transportation Research Interdisciplinary Perspectives","volume":"34 ","pages":"Article 101571"},"PeriodicalIF":3.8000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transportation Research Interdisciplinary Perspectives","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590198225002507","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"TRANSPORTATION","Score":null,"Total":0}
引用次数: 0
Abstract
Sensor fusion plays a pivotal role in enhancing the accuracy, safety, and decision-making capabilities of autonomous vehicles by integrating camera and LiDAR data. Cameras provide rich semantic information, while LiDAR offers precise depth estimation; their fusion is crucial for robust perception in complex driving scenarios. Transformer-based models have emerged as effective tools for multimodal fusion by leveraging self-attention to capture intricate relationships between sensor data. However, traditional transformers face computational efficiency challenges with long input sequences and sparse data. To address these limitations, we propose the Modified Sparse Transformer (MST) for camera–LiDAR fusion. The MST reduces attention matrix complexity, enabling faster processing while maintaining high performance with fewer parameters. Key innovations include depth-aware attention mechanisms, cross-modal feature alignment, and dynamic instance interaction modules. These collectively enhance object detection accuracy in challenging conditions such as low visibility and dense traffic. Experiments on benchmark datasets demonstrate significant improvements in both accuracy and efficiency compared to existing methods.