倾听像素

2021 IEEE International Conference on Image Processing (ICIP) Pub Date : 2021-09-19 DOI:10.1109/ICIP42928.2021.9506019

S. Chowdhury, Subhrajyoti Dasgupta, Sudip Das, U. Bhattacharya

{"title":"倾听像素","authors":"S. Chowdhury, Subhrajyoti Dasgupta, Sudip Das, U. Bhattacharya","doi":"10.1109/ICIP42928.2021.9506019","DOIUrl":null,"url":null,"abstract":"Performing sound source separation and visual object segmentation jointly in naturally occurring videos is a notoriously difficult task, especially in the absence of annotated data. In this study, we leverage the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner. Human beings interact with the physical world through a few sensory systems such as vision, auditory, movement, etc. The usefulness of the interplay of such systems lies in the concept of degeneracy [1]. It tells us that the cross-modal signals can educate each other without the presence of an external supervisor. In this work, we efficiently exploit this fact that learning from one modality inherently helps to find patterns in others by introducing a novel audio-visual fusion technique. Also, to the best of our knowledge, we are the first to address the partially occluded sound source segmentation task. Our study shows that the proposed model significantly outperforms existing state-of-the-art methods in both visual and audio source separation tasks.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Listen To The Pixels\",\"authors\":\"S. Chowdhury, Subhrajyoti Dasgupta, Sudip Das, U. Bhattacharya\",\"doi\":\"10.1109/ICIP42928.2021.9506019\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Performing sound source separation and visual object segmentation jointly in naturally occurring videos is a notoriously difficult task, especially in the absence of annotated data. In this study, we leverage the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner. Human beings interact with the physical world through a few sensory systems such as vision, auditory, movement, etc. The usefulness of the interplay of such systems lies in the concept of degeneracy [1]. It tells us that the cross-modal signals can educate each other without the presence of an external supervisor. In this work, we efficiently exploit this fact that learning from one modality inherently helps to find patterns in others by introducing a novel audio-visual fusion technique. Also, to the best of our knowledge, we are the first to address the partially occluded sound source segmentation task. Our study shows that the proposed model significantly outperforms existing state-of-the-art methods in both visual and audio source separation tasks.\",\"PeriodicalId\":314429,\"journal\":{\"name\":\"2021 IEEE International Conference on Image Processing (ICIP)\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Image Processing (ICIP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICIP42928.2021.9506019\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Image Processing (ICIP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIP42928.2021.9506019","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

在自然发生的视频中进行声源分离和视觉对象分割是一项非常困难的任务，特别是在没有注释数据的情况下。在本研究中，我们利用音频和视觉模式之间的并发性，试图以自监督的方式解决联合视听分割问题。人类通过一些感官系统，如视觉、听觉、运动等，与物理世界互动。这种系统相互作用的有用性在于简并[1]的概念。它告诉我们，跨模态信号可以在没有外部监督的情况下相互教育。在这项工作中，我们通过引入一种新的视听融合技术，有效地利用了这一事实，即从一种模态中学习本质上有助于发现其他模态中的模式。此外，据我们所知，我们是第一个解决部分遮挡声源分割任务的。我们的研究表明，所提出的模型在视觉和音频源分离任务中都明显优于现有的最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Listen To The Pixels

Performing sound source separation and visual object segmentation jointly in naturally occurring videos is a notoriously difficult task, especially in the absence of annotated data. In this study, we leverage the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner. Human beings interact with the physical world through a few sensory systems such as vision, auditory, movement, etc. The usefulness of the interplay of such systems lies in the concept of degeneracy [1]. It tells us that the cross-modal signals can educate each other without the presence of an external supervisor. In this work, we efficiently exploit this fact that learning from one modality inherently helps to find patterns in others by introducing a novel audio-visual fusion technique. Also, to the best of our knowledge, we are the first to address the partially occluded sound source segmentation task. Our study shows that the proposed model significantly outperforms existing state-of-the-art methods in both visual and audio source separation tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Conference on Image Processing (ICIP)

自引率

0.00%

发文量