{"title":"基于语义交替增强和双向聚合的参考视频对象分割","authors":"Jiaxing Yang;Lihe Zhang;Huchuan Lu","doi":"10.1109/TMM.2025.3557689","DOIUrl":null,"url":null,"abstract":"Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2987-2998"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation\",\"authors\":\"Jiaxing Yang;Lihe Zhang;Huchuan Lu\",\"doi\":\"10.1109/TMM.2025.3557689\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"2987-2998\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948307/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948307/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Semantics Alternating Enhancement and Bidirectional Aggregation for Referring Video Object Segmentation
Referring Video Object Segmentation (RVOS) aims at segmenting out the described object in a video clip according to given expression. The task requires methods to effectively fuse cross-modality features, communicate temporal information, and delineate referent appearance. However, existing solutions bias their focus to mainly mining one or two clues, causing their performance inferior. In this paper, we propose Semantics Alternating Enhancement (SAE) to achieve cross-modality fusion and temporal-spatial semantics mining in an alternate way that makes comprehensive exploit of three cues possible. During each update, SAE will generate a cross-modality and temporal-aware vector that guides vision feature to amplify its referent semantics while filtering out irrelevant contents. In return, the purified feature will provide the contextual soil to produce a more refined guider. Overall, cross-modality interaction and temporal communication are together interleaved into axial semantics enhancement steps. Moreover, we design a simplified SAE by dropping spatial semantics enhancement steps, and employ the variant in the early stages of vision encoder to further enhance usability. To integrate features of different scales, we propose Bidirectional Semantic Aggregation decoder (BSA) to obtain referent mask. The BSA arranges the comprehensively-enhanced features into two groups, and then employs difference awareness step to achieve intra-group feature aggregation bidirectionally and consistency constraint step to realize inter-group integration of semantics-dense and appearance-rich features. Extensive results on challenging benchmarks show that our method performs favorably against the state-of-the-art competitors.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.