{"title":"基于一致性查询的视听分割变压器","authors":"Ying Lv;Zhi Liu;Xiaojun Chang","doi":"10.1109/TIP.2025.3563076","DOIUrl":null,"url":null,"abstract":"Audio-visual segmentation (AVS) aims to segment objects in audio-visual content. The effective interaction between audio and visual features has garnered significant attention from the multimodal domain. Despite significant advancements, most existing AVS methods are hampered by multimodal inconsistencies. These inconsistencies primarily manifest as a mismatch between audio and visual information guided by audio cues, wherein visual features often dominate audio modality. To address this issue, we propose the Consistency-Queried Transformer (CQFormer), a novel framework for AVS tasks that leverages the transformer architecture. This framework features a Consistency Query Generator (CQG) and a Query-Aligned Matching (QAM) module. The Noise Contrastive Estimation (NCE) loss function enhances modality matching and consistency by minimizing the distributional differences between audio and visual features, facilitating effective fusion and interaction between these features. Additionally, introducing the consistency query during the decoding stage enhances consistency constraints and object-level semantic information, further improving the accuracy and stability of audio-visual segmentation. Extensive experiments on the popular benchmark of the audio-visual segmentation dataset demonstrate that the proposed CQFormer achieves state-of-the-art performance.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"2616-2627"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Consistency-Queried Transformer for Audio-Visual Segmentation\",\"authors\":\"Ying Lv;Zhi Liu;Xiaojun Chang\",\"doi\":\"10.1109/TIP.2025.3563076\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Audio-visual segmentation (AVS) aims to segment objects in audio-visual content. The effective interaction between audio and visual features has garnered significant attention from the multimodal domain. Despite significant advancements, most existing AVS methods are hampered by multimodal inconsistencies. These inconsistencies primarily manifest as a mismatch between audio and visual information guided by audio cues, wherein visual features often dominate audio modality. To address this issue, we propose the Consistency-Queried Transformer (CQFormer), a novel framework for AVS tasks that leverages the transformer architecture. This framework features a Consistency Query Generator (CQG) and a Query-Aligned Matching (QAM) module. The Noise Contrastive Estimation (NCE) loss function enhances modality matching and consistency by minimizing the distributional differences between audio and visual features, facilitating effective fusion and interaction between these features. Additionally, introducing the consistency query during the decoding stage enhances consistency constraints and object-level semantic information, further improving the accuracy and stability of audio-visual segmentation. Extensive experiments on the popular benchmark of the audio-visual segmentation dataset demonstrate that the proposed CQFormer achieves state-of-the-art performance.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"2616-2627\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10979212/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10979212/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Consistency-Queried Transformer for Audio-Visual Segmentation
Audio-visual segmentation (AVS) aims to segment objects in audio-visual content. The effective interaction between audio and visual features has garnered significant attention from the multimodal domain. Despite significant advancements, most existing AVS methods are hampered by multimodal inconsistencies. These inconsistencies primarily manifest as a mismatch between audio and visual information guided by audio cues, wherein visual features often dominate audio modality. To address this issue, we propose the Consistency-Queried Transformer (CQFormer), a novel framework for AVS tasks that leverages the transformer architecture. This framework features a Consistency Query Generator (CQG) and a Query-Aligned Matching (QAM) module. The Noise Contrastive Estimation (NCE) loss function enhances modality matching and consistency by minimizing the distributional differences between audio and visual features, facilitating effective fusion and interaction between these features. Additionally, introducing the consistency query during the decoding stage enhances consistency constraints and object-level semantic information, further improving the accuracy and stability of audio-visual segmentation. Extensive experiments on the popular benchmark of the audio-visual segmentation dataset demonstrate that the proposed CQFormer achieves state-of-the-art performance.