Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-10 DOI:10.1109/TMM.2025.3557637

Yue Zhu;Kun Li;Zongxin Yang

{"title":"Exploiting EfficientSAM and Temporal Coherence for Audio-Visual Segmentation","authors":"Yue Zhu;Kun Li;Zongxin Yang","doi":"10.1109/TMM.2025.3557637","DOIUrl":null,"url":null,"abstract":"Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2999-3008"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10960649/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Audio-Visual Segmentation (AVS) aims to accurately identify and segment sound sources within video content at the pixel level and requires a fine-grained semantic understanding of both visual and audio cues. While the Segment Anything Model (SAM) has demonstrated outstanding results across various segmentation tasks, its design is primarily focused on single-image segmentation with points, boxes, and mask prompts. As a result, when SAM is applied directly to AVS, it struggles to effectively leverage contextual information from audio data and capture temporal correlations across video frames. Additionally, its high computational requirements pose challenges to its practical applicability in AVS applications. In this paper, we introduce ESAM-AVS, a new framework built on EfficientSAM, aimed at transferring SAM's prior knowledge to the AVS domain. Specifically, we utilize the EfficientSAM as the backbone to maintain model adaptability while significantly lowering computational and processing costs. To tackle the challenges posed by temporal and audio-visual correlations, we designed the Inter-Frame Coherence module, which independently integrates the temporal information from both visual and audio modalities. Furthermore, we incorporate an audio-guided prompt encoder that generates audio prompts to provide guidance, effectively integrating audio cues into the segmentation process. By combining these components, our model maximizes the potential of SAM's prior knowledge, and adapts it to the more complex AVS task. Extensive experiments on the AVSBench dataset demonstrate that ESAM-AVS outperforms existing state-of-the-art methods.

查看原文本刊更多论文

利用有效的sam和时间相干性进行视听分割

视听分割（AVS）旨在在像素级上准确识别和分割视频内容中的声源，需要对视觉和音频线索进行细粒度的语义理解。虽然分割任意模型（SAM）在各种分割任务中表现出出色的效果，但其设计主要集中在带有点、框和掩码提示的单图像分割上。因此，当SAM直接应用于AVS时，它很难有效地利用音频数据中的上下文信息，并捕获视频帧之间的时间相关性。此外，它的高计算要求对其在AVS应用中的实际适用性提出了挑战。本文介绍了基于EfficientSAM的ESAM-AVS框架，旨在将SAM的先验知识转移到AVS领域。具体来说，我们利用EfficientSAM作为主干来维护模型适应性，同时显著降低计算和处理成本。为了解决时间和视听相关性带来的挑战，我们设计了帧间相干模块，该模块独立集成了来自视觉和音频模式的时间信息。此外，我们结合了一个音频引导提示编码器，该编码器生成音频提示以提供指导，有效地将音频提示集成到分割过程中。通过结合这些组件，我们的模型最大限度地发挥了SAM先验知识的潜力，并使其适应更复杂的AVS任务。在AVSBench数据集上进行的大量实验表明，ESAM-AVS优于现有的最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.