频率感知融合改进视频目标分割

IF 6.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neurocomputing Pub Date : 2025-09-16 DOI:10.1016/j.neucom.2025.131585

Zhiqiang Hou , Hao Cui , Chenxu Wang , Sugang Ma , Xiaobao Yang , Lei Pu

{"title":"频率感知融合改进视频目标分割","authors":"Zhiqiang Hou , Hao Cui , Chenxu Wang , Sugang Ma , Xiaobao Yang , Lei Pu","doi":"10.1016/j.neucom.2025.131585","DOIUrl":null,"url":null,"abstract":"<div><div>Currently, most mainstream memory-based semi-supervised video object segmentation (VOS) methods rely on pixel-level matching to identify target objects. However, the majority of these approaches depend solely on spatial-domain features for representation, which limits their ability to preserve fine-grained details. In addition, they typically adopt a single bottom-up matching strategy, which lacks sufficient global semantic guidance, ultimately leading to suboptimal segmentation performance. To address these issues, we propose a Frequency-Aware Fusion for Improved Video Object Segmentation algorithm (FAFVOS), which incorporates frequency-domain information enhancement and a bidirectional matching mechanism to improve segmentation accuracy. First, we design a Hierarchical Frequency-Aware Encoder (HFAE), which enhances shallow features by leveraging high-frequency components to preserve edge and texture details, and strengthens deep features via low-frequency components to maintain global structural consistency, thereby achieving multi-scale frequency–spatial feature fusion. Second, a frequency-guided bidirectional matching Transformer module is proposed to establish pixel-level and object-level dual-path interactions. By incorporating a cross-attention mechanism, the model effectively facilitates joint reasoning between local pixel-wise details and global object-level semantics. Finally, a high-order moment refinement module is introduced to integrate high-order statistical features, enhancing the model’s ability to capture object deformation and leading to high-quality segmentation results. The proposed method is evaluated on the DAVIS, YouTube-VOS, and MOSE datasets. Experimental results demonstrate that, without relying on complex pretraining strategies or additional datasets, our approach achieves a real-time inference speed of 56 FPS with a <span><math><mrow><mi>J</mi></mrow><mi>&</mi><mrow><mi>F</mi></mrow></math></span> score of 88.5 % on the DAVIS 2017 benchmark, surpassing existing representative methods. Moreover, it also achieves consistently superior performance on the more challenging YouTube-VOS and MOSE datasets, further validating the generalization ability and robustness of the proposed approach.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"656 ","pages":"Article 131585"},"PeriodicalIF":6.5000,"publicationDate":"2025-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Frequency-aware fusion for improved video object segmentation\",\"authors\":\"Zhiqiang Hou , Hao Cui , Chenxu Wang , Sugang Ma , Xiaobao Yang , Lei Pu\",\"doi\":\"10.1016/j.neucom.2025.131585\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Currently, most mainstream memory-based semi-supervised video object segmentation (VOS) methods rely on pixel-level matching to identify target objects. However, the majority of these approaches depend solely on spatial-domain features for representation, which limits their ability to preserve fine-grained details. In addition, they typically adopt a single bottom-up matching strategy, which lacks sufficient global semantic guidance, ultimately leading to suboptimal segmentation performance. To address these issues, we propose a Frequency-Aware Fusion for Improved Video Object Segmentation algorithm (FAFVOS), which incorporates frequency-domain information enhancement and a bidirectional matching mechanism to improve segmentation accuracy. First, we design a Hierarchical Frequency-Aware Encoder (HFAE), which enhances shallow features by leveraging high-frequency components to preserve edge and texture details, and strengthens deep features via low-frequency components to maintain global structural consistency, thereby achieving multi-scale frequency–spatial feature fusion. Second, a frequency-guided bidirectional matching Transformer module is proposed to establish pixel-level and object-level dual-path interactions. By incorporating a cross-attention mechanism, the model effectively facilitates joint reasoning between local pixel-wise details and global object-level semantics. Finally, a high-order moment refinement module is introduced to integrate high-order statistical features, enhancing the model’s ability to capture object deformation and leading to high-quality segmentation results. The proposed method is evaluated on the DAVIS, YouTube-VOS, and MOSE datasets. Experimental results demonstrate that, without relying on complex pretraining strategies or additional datasets, our approach achieves a real-time inference speed of 56 FPS with a <span><math><mrow><mi>J</mi></mrow><mi>&</mi><mrow><mi>F</mi></mrow></math></span> score of 88.5 % on the DAVIS 2017 benchmark, surpassing existing representative methods. Moreover, it also achieves consistently superior performance on the more challenging YouTube-VOS and MOSE datasets, further validating the generalization ability and robustness of the proposed approach.</div></div>\",\"PeriodicalId\":19268,\"journal\":{\"name\":\"Neurocomputing\",\"volume\":\"656 \",\"pages\":\"Article 131585\"},\"PeriodicalIF\":6.5000,\"publicationDate\":\"2025-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Neurocomputing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S092523122502257X\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S092523122502257X","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

目前，主流的基于内存的半监督视频对象分割（VOS）方法大多依靠像素级匹配来识别目标对象。然而，这些方法中的大多数仅仅依赖于空间域特征来表示，这限制了它们保留细粒度细节的能力。此外，它们通常采用单一的自下而上匹配策略，缺乏足够的全局语义指导，最终导致分割性能不理想。为了解决这些问题，我们提出了一种频率感知融合的改进视频目标分割算法（FAFVOS），该算法结合了频域信息增强和双向匹配机制来提高分割精度。首先，设计了一种层次频率感知编码器（HFAE），利用高频分量增强浅层特征以保持边缘和纹理细节，通过低频分量增强深层特征以保持全局结构一致性，从而实现多尺度频率-空间特征融合。其次，提出了一种频率导向的双向匹配Transformer模块，建立像素级和对象级的双路径交互。通过引入交叉注意机制，该模型有效地促进了局部像素级细节和全局对象级语义之间的联合推理。最后，引入高阶矩细化模块，整合高阶统计特征，增强模型捕捉物体变形的能力，得到高质量的分割结果。在DAVIS、YouTube-VOS和MOSE数据集上对该方法进行了评估。实验结果表明，在不依赖于复杂的预训练策略或额外的数据集的情况下，我们的方法在DAVIS 2017基准上实现了56 FPS的实时推理速度，J&；F分数为88.5%，超过了现有的代表性方法。此外，该方法在更具挑战性的YouTube-VOS和MOSE数据集上也取得了持续的优异性能，进一步验证了该方法的泛化能力和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Frequency-aware fusion for improved video object segmentation

Currently, most mainstream memory-based semi-supervised video object segmentation (VOS) methods rely on pixel-level matching to identify target objects. However, the majority of these approaches depend solely on spatial-domain features for representation, which limits their ability to preserve fine-grained details. In addition, they typically adopt a single bottom-up matching strategy, which lacks sufficient global semantic guidance, ultimately leading to suboptimal segmentation performance. To address these issues, we propose a Frequency-Aware Fusion for Improved Video Object Segmentation algorithm (FAFVOS), which incorporates frequency-domain information enhancement and a bidirectional matching mechanism to improve segmentation accuracy. First, we design a Hierarchical Frequency-Aware Encoder (HFAE), which enhances shallow features by leveraging high-frequency components to preserve edge and texture details, and strengthens deep features via low-frequency components to maintain global structural consistency, thereby achieving multi-scale frequency–spatial feature fusion. Second, a frequency-guided bidirectional matching Transformer module is proposed to establish pixel-level and object-level dual-path interactions. By incorporating a cross-attention mechanism, the model effectively facilitates joint reasoning between local pixel-wise details and global object-level semantics. Finally, a high-order moment refinement module is introduced to integrate high-order statistical features, enhancing the model’s ability to capture object deformation and leading to high-quality segmentation results. The proposed method is evaluated on the DAVIS, YouTube-VOS, and MOSE datasets. Experimental results demonstrate that, without relying on complex pretraining strategies or additional datasets, our approach achieves a real-time inference speed of 56 FPS with a

J & F

score of 88.5 % on the DAVIS 2017 benchmark, surpassing existing representative methods. Moreover, it also achieves consistently superior performance on the more challenging YouTube-VOS and MOSE datasets, further validating the generalization ability and robustness of the proposed approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Neurocomputing 工程技术-计算机：人工智能

CiteScore

13.10

自引率

10.00%

发文量

1382

审稿时长

70 days

期刊介绍： Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.