A dynamic cross-modal learning framework for joint text-to-audio grounding and acoustic scene classification in smart city environments

IF 2.9 3区工程技术 Q2 ENGINEERING, ELECTRICAL & ELECTRONIC

Digital Signal Processing Pub Date : 2025-07-02 DOI:10.1016/j.dsp.2025.105444

Yige Zhang, Menglong Wu, Xichang Cai

{"title":"A dynamic cross-modal learning framework for joint text-to-audio grounding and acoustic scene classification in smart city environments","authors":"Yige Zhang, Menglong Wu, Xichang Cai","doi":"10.1016/j.dsp.2025.105444","DOIUrl":null,"url":null,"abstract":"<div><div>As two fundamental components of smart city acoustic perception frameworks, Text-to-Audio Grounding (TAG) and Acoustic Scene Classification (ASC) demonstrate essential capabilities in enabling robust environmental monitoring and anomaly detection. However, existing methods typically treat these tasks independently, leading to increased system complexity and overlooking potential synergies between tasks. Although there has been progress in multi-task joint learning research, these methods are primarily limited to single audio modality and predefined event category libraries, lacking the ability to utilize multimodal information and struggling to meet the diversity requirements of complex acoustic scenes in open environments. This paper presents the first multimodal joint learning framework that integrates TAG with ASC, effectively addressing three significant challenges: cross-modal feature heterogeneity, global-local objective conflicts, and modal-task feature coupling, thereby achieving deep task collaboration. The core contributions of this work include designing an Adaptive Transformer with Scene-aware Fusion (ATSF) that optimizes audio-text cross-modal interaction through dual-modal feature decoupling and scene-adaptive recombination mechanisms; constructing a Multimodal Progressive Layered Expert Network (PLE) that suppresses negative transfer in multi-task learning through task-specific and shared knowledge separation strategies; and proposing a dynamic gradient-balanced joint optimization strategy to support efficient cross-modal multi-objective training. Experiments on the extended AudioGrounding dataset demonstrate that our framework significantly improves performance compared to single-task baseline models, with TAG task PSDS value increasing from 14.7 % to 36.83 % and ASC classification accuracy reaching 79.46 %. The proposed ATSF-PLE framework provides an efficient and precise solution for intelligent urban acoustic perception systems, demonstrating substantial application value in intelligent security, traffic management, and other scenarios.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"167 ","pages":"Article 105444"},"PeriodicalIF":2.9000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S105120042500466X","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

As two fundamental components of smart city acoustic perception frameworks, Text-to-Audio Grounding (TAG) and Acoustic Scene Classification (ASC) demonstrate essential capabilities in enabling robust environmental monitoring and anomaly detection. However, existing methods typically treat these tasks independently, leading to increased system complexity and overlooking potential synergies between tasks. Although there has been progress in multi-task joint learning research, these methods are primarily limited to single audio modality and predefined event category libraries, lacking the ability to utilize multimodal information and struggling to meet the diversity requirements of complex acoustic scenes in open environments. This paper presents the first multimodal joint learning framework that integrates TAG with ASC, effectively addressing three significant challenges: cross-modal feature heterogeneity, global-local objective conflicts, and modal-task feature coupling, thereby achieving deep task collaboration. The core contributions of this work include designing an Adaptive Transformer with Scene-aware Fusion (ATSF) that optimizes audio-text cross-modal interaction through dual-modal feature decoupling and scene-adaptive recombination mechanisms; constructing a Multimodal Progressive Layered Expert Network (PLE) that suppresses negative transfer in multi-task learning through task-specific and shared knowledge separation strategies; and proposing a dynamic gradient-balanced joint optimization strategy to support efficient cross-modal multi-objective training. Experiments on the extended AudioGrounding dataset demonstrate that our framework significantly improves performance compared to single-task baseline models, with TAG task PSDS value increasing from 14.7 % to 36.83 % and ASC classification accuracy reaching 79.46 %. The proposed ATSF-PLE framework provides an efficient and precise solution for intelligent urban acoustic perception systems, demonstrating substantial application value in intelligent security, traffic management, and other scenarios.

查看原文本刊更多论文

智能城市环境中文本-音频联合接地和声学场景分类的动态跨模态学习框架

作为智慧城市声学感知框架的两个基本组成部分，文本到音频接地（TAG）和声学场景分类（ASC）在实现强大的环境监测和异常检测方面展示了必不可少的能力。然而，现有的方法通常是独立地处理这些任务，从而增加了系统的复杂性，并忽略了任务之间潜在的协同作用。尽管在多任务联合学习方面的研究取得了一定进展，但这些方法主要局限于单一的音频模态和预定义的事件类别库，缺乏利用多模态信息的能力，难以满足开放环境中复杂声音场景的多样性要求。本文提出了首个集成TAG和ASC的多模态联合学习框架，有效解决了跨模态特征异质性、全局-局部目标冲突和模态-任务特征耦合三个重大挑战，从而实现了深度任务协作。本工作的核心贡献包括设计了一个具有场景感知融合（ATSF）的自适应变压器，该变压器通过双模态特征解耦和场景自适应重组机制优化了音频-文本跨模态交互；构建多模态渐进式分层专家网络（PLE），通过特定任务和共享知识分离策略抑制多任务学习中的负迁移；提出了一种动态梯度平衡联合优化策略，支持高效的跨模态多目标训练。在扩展的AudioGrounding数据集上的实验表明，与单任务基线模型相比，我们的框架显著提高了性能，TAG任务PSDS值从14.7%提高到36.83%，ASC分类准确率达到79.46%。提出的ATSF-PLE框架为智能城市声感知系统提供了高效、精确的解决方案，在智能安防、交通管理等场景中具有重要的应用价值。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Digital Signal Processing 工程技术-工程：电子与电气

CiteScore

5.30

自引率

17.20%

发文量

435

审稿时长

66 days

期刊介绍： Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal. The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as: • big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,