AV-FDTI: Audio-visual fusion for drone threat identification

Journal of Automation and Intelligence Pub Date : 2024-09-01 DOI:10.1016/j.jai.2024.06.002

Yizhuo Yang, Shenghai Yuan, Jianfei Yang, Thien Hoang Nguyen, Muqing Cao, Thien-Minh Nguyen, Han Wang, Lihua Xie

{"title":"AV-FDTI: Audio-visual fusion for drone threat identification","authors":"Yizhuo Yang, Shenghai Yuan, Jianfei Yang, Thien Hoang Nguyen, Muqing Cao, Thien-Minh Nguyen, Han Wang, Lihua Xie","doi":"10.1016/j.jai.2024.06.002","DOIUrl":null,"url":null,"abstract":"<div><p>In response to the evolving challenges posed by small unmanned aerial vehicles (UAVs), which have the potential to transport harmful payloads or cause significant damage, we present AV-FDTI, an innovative Audio-Visual Fusion system designed for Drone Threat Identification. AV-FDTI leverages the fusion of audio and omnidirectional camera feature inputs, providing a comprehensive solution to enhance the precision and resilience of drone classification and 3D localization. Specifically, AV-FDTI employs a CRNN network to capture vital temporal dynamics within the audio domain and utilizes a pretrained ResNet50 model for image feature extraction. Furthermore, we adopt a visual information entropy and cross-attention-based mechanism to enhance the fusion of visual and audio data. Notably, our system is trained based on automated Leica tracking annotations, offering accurate ground truth data with millimeter-level accuracy. Comprehensive comparative evaluations demonstrate the superiority of our solution over the existing systems. In our commitment to advancing this field, we will release this work as open-source code and wearable AV-FDTI design, contributing valuable resources to the research community.</p></div>","PeriodicalId":100755,"journal":{"name":"Journal of Automation and Intelligence","volume":"3 3","pages":"Pages 144-151"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949855424000285/pdfft?md5=684e37e58a4ae5abf55addf8f81639b9&pid=1-s2.0-S2949855424000285-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Automation and Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949855424000285","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In response to the evolving challenges posed by small unmanned aerial vehicles (UAVs), which have the potential to transport harmful payloads or cause significant damage, we present AV-FDTI, an innovative Audio-Visual Fusion system designed for Drone Threat Identification. AV-FDTI leverages the fusion of audio and omnidirectional camera feature inputs, providing a comprehensive solution to enhance the precision and resilience of drone classification and 3D localization. Specifically, AV-FDTI employs a CRNN network to capture vital temporal dynamics within the audio domain and utilizes a pretrained ResNet50 model for image feature extraction. Furthermore, we adopt a visual information entropy and cross-attention-based mechanism to enhance the fusion of visual and audio data. Notably, our system is trained based on automated Leica tracking annotations, offering accurate ground truth data with millimeter-level accuracy. Comprehensive comparative evaluations demonstrate the superiority of our solution over the existing systems. In our commitment to advancing this field, we will release this work as open-source code and wearable AV-FDTI design, contributing valuable resources to the research community.

查看原文本刊更多论文

AV-FDTI：用于无人机威胁识别的视听融合技术

小型无人驾驶飞行器（UAV）有可能运输有害有效载荷或造成重大损失，为应对小型无人驾驶飞行器（UAV）带来的不断变化的挑战，我们推出了 AV-FDTI，这是一种创新的视听融合系统，专为无人机威胁识别而设计。AV-FDTI 利用音频和全向摄像头特征输入的融合，提供了一个全面的解决方案，以提高无人机分类和三维定位的精度和弹性。具体来说，AV-FDTI 采用 CRNN 网络捕捉音频领域的重要时间动态，并利用预训练的 ResNet50 模型进行图像特征提取。此外，我们还采用了基于视觉信息熵和交叉注意力的机制来增强视觉和音频数据的融合。值得注意的是，我们的系统是基于徕卡自动跟踪注释进行训练的，可提供精确到毫米级的地面实况数据。综合比较评估表明，我们的解决方案优于现有系统。为了推动这一领域的发展，我们将以开放源代码和可穿戴 AV-FDTI 设计的形式发布这项工作，为研究界贡献宝贵的资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Automation and Intelligence

自引率

0.00%

发文量