基于语义交替增强和双向聚合的参考视频对象分割

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-03 DOI:10.1109/TMM.2025.3557689

Jiaxing Yang;Lihe Zhang;Huchuan Lu

{"title":"基于语义交替增强和双向聚合的参考视频对象分割","authors":"Jiaxing Yang;Lihe Zhang;Huchuan Lu","doi":"10.1109/TMM.2025.3557689","DOIUrl":null,"url":null,"abstract":"Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2987-2998"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation\",\"authors\":\"Jiaxing Yang;Lihe Zhang;Huchuan Lu\",\"doi\":\"10.1109/TMM.2025.3557689\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"2987-2998\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948307/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948307/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

参考视频对象分割（RVOS）的目的是根据给定的表达式将视频片段中描述的对象分割出来。该任务需要有效地融合跨模态特征、交流时间信息和描述参考外观的方法。然而，现有的解决方案将重点放在主要挖掘一两个线索上，导致其性能较差。在本文中，我们提出了语义交替增强（SAE），以一种替代的方式实现跨模态融合和时空语义挖掘，从而使综合利用三个线索成为可能。在每次更新过程中，SAE将生成一个跨模态和时间感知向量，引导视觉特征放大其引用语义，同时过滤掉无关内容。作为回报，纯化的特征将提供上下文土壤，以产生更精细的向导。总的来说，跨模态交互和时间通信一起交织在轴向语义增强步骤中。此外，我们设计了一个简化的SAE，通过减少空间语义增强步骤，并在视觉编码器的早期阶段使用该变体，以进一步提高可用性。为了整合不同尺度的特征，我们提出了双向语义聚合解码器（BSA）来获得参考掩码。BSA将综合增强的特征分成两组，然后采用差分感知步骤双向实现组内特征聚合，采用一致性约束步骤实现语义密集和外观丰富的特征的组间集成。在具有挑战性的基准测试上的广泛结果表明，我们的方法比最先进的竞争对手表现得更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation

Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.