Efficient Semisupervised Object Segmentation for Long-Term Videos Using Adaptive Memory Network

IF 5 3区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Cognitive and Developmental Systems Pub Date : 2024-04-08 DOI:10.1109/TCDS.2024.3385849

Shan Zhong;Guoqiang Li;Wenhao Ying;Fuzhou Zhao;Gengsheng Xie;Shengrong Gong

{"title":"Efficient Semisupervised Object Segmentation for Long-Term Videos Using Adaptive Memory Network","authors":"Shan Zhong;Guoqiang Li;Wenhao Ying;Fuzhou Zhao;Gengsheng Xie;Shengrong Gong","doi":"10.1109/TCDS.2024.3385849","DOIUrl":null,"url":null,"abstract":"Video object segmentation (VOS) uses the first annotated video mask to achieve consistent and precise segmentation in subsequent frames. Recently, memory-based methods have received significant attention owing to their substantial performance enhancements. However, these approaches rely on a fixed global memory strategy, which poses a challenge to segmentation accuracy and speed in the context of longer videos. To alleviate this limitation, we propose a novel semisupervised VOS model, founded on the principles of the adaptive memory network. Our proposed model adaptively extracts object features by focusing on the object area while effectively filtering out extraneous background noise. An identification mechanism is also thoughtfully applied to discern each object in multiobject scenarios. To further reduce storage consumption without compromising the saliency of object information, the outdated features residing in the memory pool are compressed into salient features through the employment of a self-attention mechanism. Furthermore, we introduce a local matching module, specifically devised to refine object features by fusing the contextual information from historical frames. We demonstrate the efficiency of our approach through experiments, substantially augmenting both the speed and precision of segmentation for long-term videos, while maintaining comparable performance for short videos.","PeriodicalId":54300,"journal":{"name":"IEEE Transactions on Cognitive and Developmental Systems","volume":"16 5","pages":"1789-1802"},"PeriodicalIF":5.0000,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Cognitive and Developmental Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10494676/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Video object segmentation (VOS) uses the first annotated video mask to achieve consistent and precise segmentation in subsequent frames. Recently, memory-based methods have received significant attention owing to their substantial performance enhancements. However, these approaches rely on a fixed global memory strategy, which poses a challenge to segmentation accuracy and speed in the context of longer videos. To alleviate this limitation, we propose a novel semisupervised VOS model, founded on the principles of the adaptive memory network. Our proposed model adaptively extracts object features by focusing on the object area while effectively filtering out extraneous background noise. An identification mechanism is also thoughtfully applied to discern each object in multiobject scenarios. To further reduce storage consumption without compromising the saliency of object information, the outdated features residing in the memory pool are compressed into salient features through the employment of a self-attention mechanism. Furthermore, we introduce a local matching module, specifically devised to refine object features by fusing the contextual information from historical frames. We demonstrate the efficiency of our approach through experiments, substantially augmenting both the speed and precision of segmentation for long-term videos, while maintaining comparable performance for short videos.

查看原文本刊更多论文

利用自适应记忆网络为长期视频提供高效的半监督物体分割技术

视频对象分割（VOS）利用首次注释的视频掩码来实现后续帧的一致和精确分割。最近，基于内存的方法因其性能大幅提升而备受关注。然而，这些方法依赖于固定的全局内存策略，这对较长视频的分割精度和速度提出了挑战。为了缓解这一限制，我们根据自适应记忆网络的原理，提出了一种新颖的半监督 VOS 模型。我们提出的模型通过聚焦对象区域自适应地提取对象特征，同时有效地过滤掉无关的背景噪声。此外，我们还贴心地采用了一种识别机制，以辨别多物体场景中的每个物体。为了在不影响对象信息显著性的前提下进一步减少存储消耗，我们采用了自我关注机制，将内存池中的过时特征压缩成显著特征。此外，我们还引入了一个局部匹配模块，专门用于通过融合历史帧的上下文信息来完善对象特征。我们通过实验证明了这一方法的高效性，大大提高了长期视频的分割速度和精度，同时保持了与短视频相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Cognitive and Developmental Systems Computer Science-Software

CiteScore

7.20

自引率

10.00%

发文量

170

期刊介绍： The IEEE Transactions on Cognitive and Developmental Systems (TCDS) focuses on advances in the study of development and cognition in natural (humans, animals) and artificial (robots, agents) systems. It welcomes contributions from multiple related disciplines including cognitive systems, cognitive robotics, developmental and epigenetic robotics, autonomous and evolutionary robotics, social structures, multi-agent and artificial life systems, computational neuroscience, and developmental psychology. Articles on theoretical, computational, application-oriented, and experimental studies as well as reviews in these areas are considered.