Jie Wang;Xiangji Kong;Nana Yu;Zihao Zhang;Yahong Han
{"title":"半监督双峰显著目标检测的显式解纠缠和独占融合","authors":"Jie Wang;Xiangji Kong;Nana Yu;Zihao Zhang;Yahong Han","doi":"10.1109/TCSVT.2024.3514897","DOIUrl":null,"url":null,"abstract":"Bi-modal (RGB-T and RGB-D) salient object detection (SOD) aims to enhance detection performance by leveraging the complementary information between modalities. While significant progress has been made, two major limitations persist. Firstly, mainstream fully supervised methods come with a substantial burden of manual annotation, while weakly supervised or unsupervised methods struggle to achieve satisfactory performance. Secondly, the indiscriminate modeling of local detailed information (object edge) and global contextual information (object body) often results in predicted objects with incomplete edges or inconsistent internal representations. In this work, we propose a novel paradigm to effectively alleviate the above limitations. Specifically, we first enhance the consistency regularization strategy to build a basic semi-supervised architecture for the bi-modal SOD task, which ensures that the model can benefit from massive unlabeled samples while effectively alleviating the annotation burden. Secondly, to ensure detection performance (i.e., complete edges and consistent bodies), we disentangle the SOD task into two parallel sub-tasks: edge integrity fusion prediction and body consistency fusion prediction. Achieving these tasks involves two key steps: 1) the explicitly disentangling scheme decouples salient object features into edge and body features, and 2) the exclusively fusing scheme performs exclusive integrity or consistency fusion for each of them. Eventually, our approach demonstrates significant competitiveness compared to 26 fully supervised methods, while effectively alleviating 90% of the annotation burden. Furthermore, it holds a substantial advantage over 15 non-fully supervised methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4479-4492"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Explicitly Disentangling and Exclusively Fusing for Semi-Supervised Bi-Modal Salient Object Detection\",\"authors\":\"Jie Wang;Xiangji Kong;Nana Yu;Zihao Zhang;Yahong Han\",\"doi\":\"10.1109/TCSVT.2024.3514897\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bi-modal (RGB-T and RGB-D) salient object detection (SOD) aims to enhance detection performance by leveraging the complementary information between modalities. While significant progress has been made, two major limitations persist. Firstly, mainstream fully supervised methods come with a substantial burden of manual annotation, while weakly supervised or unsupervised methods struggle to achieve satisfactory performance. Secondly, the indiscriminate modeling of local detailed information (object edge) and global contextual information (object body) often results in predicted objects with incomplete edges or inconsistent internal representations. In this work, we propose a novel paradigm to effectively alleviate the above limitations. Specifically, we first enhance the consistency regularization strategy to build a basic semi-supervised architecture for the bi-modal SOD task, which ensures that the model can benefit from massive unlabeled samples while effectively alleviating the annotation burden. Secondly, to ensure detection performance (i.e., complete edges and consistent bodies), we disentangle the SOD task into two parallel sub-tasks: edge integrity fusion prediction and body consistency fusion prediction. Achieving these tasks involves two key steps: 1) the explicitly disentangling scheme decouples salient object features into edge and body features, and 2) the exclusively fusing scheme performs exclusive integrity or consistency fusion for each of them. Eventually, our approach demonstrates significant competitiveness compared to 26 fully supervised methods, while effectively alleviating 90% of the annotation burden. Furthermore, it holds a substantial advantage over 15 non-fully supervised methods.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"35 5\",\"pages\":\"4479-4492\"},\"PeriodicalIF\":8.3000,\"publicationDate\":\"2024-12-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10788520/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10788520/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Explicitly Disentangling and Exclusively Fusing for Semi-Supervised Bi-Modal Salient Object Detection
Bi-modal (RGB-T and RGB-D) salient object detection (SOD) aims to enhance detection performance by leveraging the complementary information between modalities. While significant progress has been made, two major limitations persist. Firstly, mainstream fully supervised methods come with a substantial burden of manual annotation, while weakly supervised or unsupervised methods struggle to achieve satisfactory performance. Secondly, the indiscriminate modeling of local detailed information (object edge) and global contextual information (object body) often results in predicted objects with incomplete edges or inconsistent internal representations. In this work, we propose a novel paradigm to effectively alleviate the above limitations. Specifically, we first enhance the consistency regularization strategy to build a basic semi-supervised architecture for the bi-modal SOD task, which ensures that the model can benefit from massive unlabeled samples while effectively alleviating the annotation burden. Secondly, to ensure detection performance (i.e., complete edges and consistent bodies), we disentangle the SOD task into two parallel sub-tasks: edge integrity fusion prediction and body consistency fusion prediction. Achieving these tasks involves two key steps: 1) the explicitly disentangling scheme decouples salient object features into edge and body features, and 2) the exclusively fusing scheme performs exclusive integrity or consistency fusion for each of them. Eventually, our approach demonstrates significant competitiveness compared to 26 fully supervised methods, while effectively alleviating 90% of the annotation burden. Furthermore, it holds a substantial advantage over 15 non-fully supervised methods.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.