保存：分割视听使用分割任何模型的简单方法

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-08-14 DOI:10.1016/j.cviu.2025.104460

Khanh-Binh Nguyen, Chae Jung Park

{"title":"保存：分割视听使用分割任何模型的简单方法","authors":"Khanh-Binh Nguyen, Chae Jung Park","doi":"10.1016/j.cviu.2025.104460","DOIUrl":null,"url":null,"abstract":"<div><div>Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104460"},"PeriodicalIF":3.5000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SAVE: Segment Audio-Visual Easy way using the Segment Anything Model\",\"authors\":\"Khanh-Binh Nguyen, Chae Jung Park\",\"doi\":\"10.1016/j.cviu.2025.104460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"260 \",\"pages\":\"Article 104460\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001833\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001833","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

视听分割（AVS）主要目的是通过预测像素级分割掩码来准确检测和定位视觉环境中的声音元素。为了有效地完成这项任务，必须彻底考虑数据和模型两个方面。本研究引入了一种简化的方法SAVE，直接修改AVS任务的预训练分段任意模型（SAM）。通过在转换块中集成图像编码器适配器以改进数据集特定的信息捕获，并引入剩余音频编码器适配器将音频特征编码为稀疏提示，我们的模型在编码过程中实现了鲁棒的视听融合和交互。我们的方法通过将输入分辨率从1024像素降低到256像素来提高训练和推理速度，同时在性能上仍然超过以前的最先进技术（SOTA）。大量的实验验证了我们的方法，表明我们的模型明显优于其他SOTA方法。此外，在合成数据上使用预训练模型提高了实际AVSBench数据的性能，在图像输入256像素的S4 （V1S）子集上获得84.59的平均交联（mIoU），在MS3 （V1M）集上获得70.28。在输入1024像素时，S4 （V1S）的性能增加到86.16 mIoU, MS3 （V1M）的性能增加到70.83 mIoU。这些发现表明，简单地调整预训练模型可以增强自动驾驶系统，并支持实际应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

SAVE: Segment Audio-Visual Easy way using the Segment Anything Model

查看原文本刊更多论文

SAVE: Segment Audio-Visual Easy way using the Segment Anything Model

Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems