{"title":"Transformer-Prompted Network: Efficient Audio–Visual Segmentation via Transformer and Prompt Learning","authors":"Yusen Wang;Xiaohong Qian;Wujie Zhou","doi":"10.1109/LSP.2024.3524120","DOIUrl":null,"url":null,"abstract":"Audio–visual segmentation (AVS) is a challenging task that focuses on segmenting sound-producing objects within video frames by leveraging audio signals. Existing convolutional neural networks (CNNs) and Transformer-based methods extract features separately from modality-specific encoders and then use fusion modules to integrate the visual and auditory features. We propose an effective Transformer-prompted network, TPNet, which utilizes prompt learning with a Transformer to guide the CNN in addressing AVS tasks. Specifically, during feature encoding, we incorporate a frequency-based prompt-supplement module to fine-tune and enhance the encoded features through frequency-domain methods. Furthermore, during audio–visual fusion, we integrate a self-supplementing cross-fusion module that uses self-attention, two-dimensional selective scanning, and cross-attention mechanisms to merge and enhance audio–visual features effectively. The prompt features undergo the same processing in cross-modal fusion, further refining the fused features to achieve more accurate segmentation results. Finally, we apply self-knowledge distillation to the network, further enhancing the model performance. Extensive experiments on the AVSBench dataset validate the effectiveness of TPNet.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"516-520"},"PeriodicalIF":3.2000,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Letters","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10820826/","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Audio–visual segmentation (AVS) is a challenging task that focuses on segmenting sound-producing objects within video frames by leveraging audio signals. Existing convolutional neural networks (CNNs) and Transformer-based methods extract features separately from modality-specific encoders and then use fusion modules to integrate the visual and auditory features. We propose an effective Transformer-prompted network, TPNet, which utilizes prompt learning with a Transformer to guide the CNN in addressing AVS tasks. Specifically, during feature encoding, we incorporate a frequency-based prompt-supplement module to fine-tune and enhance the encoded features through frequency-domain methods. Furthermore, during audio–visual fusion, we integrate a self-supplementing cross-fusion module that uses self-attention, two-dimensional selective scanning, and cross-attention mechanisms to merge and enhance audio–visual features effectively. The prompt features undergo the same processing in cross-modal fusion, further refining the fused features to achieve more accurate segmentation results. Finally, we apply self-knowledge distillation to the network, further enhancing the model performance. Extensive experiments on the AVSBench dataset validate the effectiveness of TPNet.
期刊介绍:
The IEEE Signal Processing Letters is a monthly, archival publication designed to provide rapid dissemination of original, cutting-edge ideas and timely, significant contributions in signal, image, speech, language and audio processing. Papers published in the Letters can be presented within one year of their appearance in signal processing conferences such as ICASSP, GlobalSIP and ICIP, and also in several workshop organized by the Signal Processing Society.