{"title":"A dynamic cross-modal learning framework for joint text-to-audio grounding and acoustic scene classification in smart city environments","authors":"Yige Zhang, Menglong Wu, Xichang Cai","doi":"10.1016/j.dsp.2025.105444","DOIUrl":null,"url":null,"abstract":"<div><div>As two fundamental components of smart city acoustic perception frameworks, Text-to-Audio Grounding (TAG) and Acoustic Scene Classification (ASC) demonstrate essential capabilities in enabling robust environmental monitoring and anomaly detection. However, existing methods typically treat these tasks independently, leading to increased system complexity and overlooking potential synergies between tasks. Although there has been progress in multi-task joint learning research, these methods are primarily limited to single audio modality and predefined event category libraries, lacking the ability to utilize multimodal information and struggling to meet the diversity requirements of complex acoustic scenes in open environments. This paper presents the first multimodal joint learning framework that integrates TAG with ASC, effectively addressing three significant challenges: cross-modal feature heterogeneity, global-local objective conflicts, and modal-task feature coupling, thereby achieving deep task collaboration. The core contributions of this work include designing an Adaptive Transformer with Scene-aware Fusion (ATSF) that optimizes audio-text cross-modal interaction through dual-modal feature decoupling and scene-adaptive recombination mechanisms; constructing a Multimodal Progressive Layered Expert Network (PLE) that suppresses negative transfer in multi-task learning through task-specific and shared knowledge separation strategies; and proposing a dynamic gradient-balanced joint optimization strategy to support efficient cross-modal multi-objective training. Experiments on the extended AudioGrounding dataset demonstrate that our framework significantly improves performance compared to single-task baseline models, with TAG task PSDS value increasing from 14.7 % to 36.83 % and ASC classification accuracy reaching 79.46 %. The proposed ATSF-PLE framework provides an efficient and precise solution for intelligent urban acoustic perception systems, demonstrating substantial application value in intelligent security, traffic management, and other scenarios.</div></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"167 ","pages":"Article 105444"},"PeriodicalIF":2.9000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S105120042500466X","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
As two fundamental components of smart city acoustic perception frameworks, Text-to-Audio Grounding (TAG) and Acoustic Scene Classification (ASC) demonstrate essential capabilities in enabling robust environmental monitoring and anomaly detection. However, existing methods typically treat these tasks independently, leading to increased system complexity and overlooking potential synergies between tasks. Although there has been progress in multi-task joint learning research, these methods are primarily limited to single audio modality and predefined event category libraries, lacking the ability to utilize multimodal information and struggling to meet the diversity requirements of complex acoustic scenes in open environments. This paper presents the first multimodal joint learning framework that integrates TAG with ASC, effectively addressing three significant challenges: cross-modal feature heterogeneity, global-local objective conflicts, and modal-task feature coupling, thereby achieving deep task collaboration. The core contributions of this work include designing an Adaptive Transformer with Scene-aware Fusion (ATSF) that optimizes audio-text cross-modal interaction through dual-modal feature decoupling and scene-adaptive recombination mechanisms; constructing a Multimodal Progressive Layered Expert Network (PLE) that suppresses negative transfer in multi-task learning through task-specific and shared knowledge separation strategies; and proposing a dynamic gradient-balanced joint optimization strategy to support efficient cross-modal multi-objective training. Experiments on the extended AudioGrounding dataset demonstrate that our framework significantly improves performance compared to single-task baseline models, with TAG task PSDS value increasing from 14.7 % to 36.83 % and ASC classification accuracy reaching 79.46 %. The proposed ATSF-PLE framework provides an efficient and precise solution for intelligent urban acoustic perception systems, demonstrating substantial application value in intelligent security, traffic management, and other scenarios.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,