Zhen Lan;Zixing Li;Chao Yan;Xiaojia Xiang;Dengqing Tang;Han Zhou;Jun Lai
{"title":"基于注意力的多模态融合自适应知识蒸馏鲁棒弱小目标检测","authors":"Zhen Lan;Zixing Li;Chao Yan;Xiaojia Xiang;Dengqing Tang;Han Zhou;Jun Lai","doi":"10.1109/TMM.2024.3521793","DOIUrl":null,"url":null,"abstract":"Automated object detection in aerial images is crucial in both civil and military applications. Existing computer vision-based object detection methods are not robust enough to precisely detect dim objects in aerial images due to the cluttered backgrounds, various observing angles, small object scales, and severe occlusions. Recently, electroencephalography (EEG)-based object detection methods have received increasing attention owing to the advanced cognitive capabilities of human vision. However, how to combine the human intelligence with computer intelligence to achieve robust dim object detection is still an open question. In this paper, we propose a novel approach to efficiently fuse and exploit the properties of multi-modal data for dim object detection. Specifically, we first design a brain-computer interface (BCI) paradigm called eye-tracking-based slow serial visual presentation (ESSVP) to simultaneously collect the paired EEG and image data when subjects search for the dim objects in aerial images. Then, we develop an attention-based multi-modal fusion network to selectively aggregate the learned features of EEG and image modalities. Furthermore, we propose an adaptive multi-teacher knowledge distillation method to efficiently train the multi-modal dim object detector for better performance. To evaluate the effectiveness of our method, we conduct extensive experiments on the collected dataset in subject-dependent and subject-independent tasks. The experimental results demonstrate that the proposed dim object detection method exhibits superior effectiveness and robustness compared to the baselines and the state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2083-2096"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Adaptive Knowledge Distillation With Attention-Based Multi-Modal Fusion for Robust Dim Object Detection\",\"authors\":\"Zhen Lan;Zixing Li;Chao Yan;Xiaojia Xiang;Dengqing Tang;Han Zhou;Jun Lai\",\"doi\":\"10.1109/TMM.2024.3521793\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Automated object detection in aerial images is crucial in both civil and military applications. Existing computer vision-based object detection methods are not robust enough to precisely detect dim objects in aerial images due to the cluttered backgrounds, various observing angles, small object scales, and severe occlusions. Recently, electroencephalography (EEG)-based object detection methods have received increasing attention owing to the advanced cognitive capabilities of human vision. However, how to combine the human intelligence with computer intelligence to achieve robust dim object detection is still an open question. In this paper, we propose a novel approach to efficiently fuse and exploit the properties of multi-modal data for dim object detection. Specifically, we first design a brain-computer interface (BCI) paradigm called eye-tracking-based slow serial visual presentation (ESSVP) to simultaneously collect the paired EEG and image data when subjects search for the dim objects in aerial images. Then, we develop an attention-based multi-modal fusion network to selectively aggregate the learned features of EEG and image modalities. Furthermore, we propose an adaptive multi-teacher knowledge distillation method to efficiently train the multi-modal dim object detector for better performance. To evaluate the effectiveness of our method, we conduct extensive experiments on the collected dataset in subject-dependent and subject-independent tasks. The experimental results demonstrate that the proposed dim object detection method exhibits superior effectiveness and robustness compared to the baselines and the state-of-the-art methods.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"2083-2096\"},\"PeriodicalIF\":8.4000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10814704/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814704/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Adaptive Knowledge Distillation With Attention-Based Multi-Modal Fusion for Robust Dim Object Detection
Automated object detection in aerial images is crucial in both civil and military applications. Existing computer vision-based object detection methods are not robust enough to precisely detect dim objects in aerial images due to the cluttered backgrounds, various observing angles, small object scales, and severe occlusions. Recently, electroencephalography (EEG)-based object detection methods have received increasing attention owing to the advanced cognitive capabilities of human vision. However, how to combine the human intelligence with computer intelligence to achieve robust dim object detection is still an open question. In this paper, we propose a novel approach to efficiently fuse and exploit the properties of multi-modal data for dim object detection. Specifically, we first design a brain-computer interface (BCI) paradigm called eye-tracking-based slow serial visual presentation (ESSVP) to simultaneously collect the paired EEG and image data when subjects search for the dim objects in aerial images. Then, we develop an attention-based multi-modal fusion network to selectively aggregate the learned features of EEG and image modalities. Furthermore, we propose an adaptive multi-teacher knowledge distillation method to efficiently train the multi-modal dim object detector for better performance. To evaluate the effectiveness of our method, we conduct extensive experiments on the collected dataset in subject-dependent and subject-independent tasks. The experimental results demonstrate that the proposed dim object detection method exhibits superior effectiveness and robustness compared to the baselines and the state-of-the-art methods.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.