保存:分割视听使用分割任何模型的简单方法

IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Khanh-Binh Nguyen, Chae Jung Park
{"title":"保存:分割视听使用分割任何模型的简单方法","authors":"Khanh-Binh Nguyen,&nbsp;Chae Jung Park","doi":"10.1016/j.cviu.2025.104460","DOIUrl":null,"url":null,"abstract":"<div><div>Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"260 ","pages":"Article 104460"},"PeriodicalIF":3.5000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"SAVE: Segment Audio-Visual Easy way using the Segment Anything Model\",\"authors\":\"Khanh-Binh Nguyen,&nbsp;Chae Jung Park\",\"doi\":\"10.1016/j.cviu.2025.104460\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.</div></div>\",\"PeriodicalId\":50633,\"journal\":{\"name\":\"Computer Vision and Image Understanding\",\"volume\":\"260 \",\"pages\":\"Article 104460\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2025-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Vision and Image Understanding\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1077314225001833\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1077314225001833","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

视听分割(AVS)主要目的是通过预测像素级分割掩码来准确检测和定位视觉环境中的声音元素。为了有效地完成这项任务,必须彻底考虑数据和模型两个方面。本研究引入了一种简化的方法SAVE,直接修改AVS任务的预训练分段任意模型(SAM)。通过在转换块中集成图像编码器适配器以改进数据集特定的信息捕获,并引入剩余音频编码器适配器将音频特征编码为稀疏提示,我们的模型在编码过程中实现了鲁棒的视听融合和交互。我们的方法通过将输入分辨率从1024像素降低到256像素来提高训练和推理速度,同时在性能上仍然超过以前的最先进技术(SOTA)。大量的实验验证了我们的方法,表明我们的模型明显优于其他SOTA方法。此外,在合成数据上使用预训练模型提高了实际AVSBench数据的性能,在图像输入256像素的S4 (V1S)子集上获得84.59的平均交联(mIoU),在MS3 (V1M)集上获得70.28。在输入1024像素时,S4 (V1S)的性能增加到86.16 mIoU, MS3 (V1M)的性能增加到70.83 mIoU。这些发现表明,简单地调整预训练模型可以增强自动驾驶系统,并支持实际应用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

SAVE: Segment Audio-Visual Easy way using the Segment Anything Model

SAVE: Segment Audio-Visual Easy way using the Segment Anything Model
Audio-visual segmentation (AVS) primarily aims to accurately detect and pinpoint sound elements in visual contexts by predicting pixel-level segmentation masks. To address this task effectively, it is essential to thoroughly consider both the data and model aspects. This study introduces a streamlined approach, SAVE, which directly modifies the pretrained segment anything model (SAM) for the AVS task. By integrating an image encoder adapter within the transformer blocks for improved dataset-specific information capture and introducing a residual audio encoder adapter to encode audio features as a sparse prompt, our model achieves robust audio-visual fusion and interaction during encoding. Our method enhances the training and inference speeds by reducing the input resolution from 1024 to 256 pixels while still surpassing the previous state-of-the-art (SOTA) in performance. Extensive experiments validated our approach, indicating that our model significantly outperforms other SOTA methods. Additionally, utilizing the pretrained model on synthetic data enhances performance on real AVSBench data, attaining mean intersection over union (mIoU) of 84.59 on the S4 (V1S) subset and 70.28 on the MS3 (V1M) set with image inputs of 256 pixels. This performance increases to 86.16 mIoU on the S4 (V1S) and 70.83 mIoU on the MS3 (V1M) with 1024-pixel inputs. These findings show that simple adaptations of pretrained models can enhance AVS and support real-world applications.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computer Vision and Image Understanding
Computer Vision and Image Understanding 工程技术-工程:电子与电气
CiteScore
7.80
自引率
4.40%
发文量
112
审稿时长
79 days
期刊介绍: The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信