{"title":"保存:分割视听使用分割任何模型的简单方法","authors":"Khanh-Binh Nguyen, Chae Jung Park","doi":"10.1016/j.cviu.2025.104460","DOIUrl":null,"url":null,"abstract":"<div><div>Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104460"},"PeriodicalIF":3.5000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SAVE: Segment Audio-Visual Easy way using the Segment Anything Model\",\"authors\":\"Khanh-Binh Nguyen, Chae Jung Park\",\"doi\":\"10.1016/j.cviu.2025.104460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"260 \",\"pages\":\"Article 104460\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001833\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001833","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
SAVE: Segment Audio-Visual Easy way using the Segment Anything Model
Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems