Weide Liu , Jieming Lou , Xingxing Wang , Wei Zhou , Jun Cheng , Xulei Yang
{"title":"Physically-guided open vocabulary segmentation with weighted patched alignment loss","authors":"Weide Liu , Jieming Lou , Xingxing Wang , Wei Zhou , Jun Cheng , Xulei Yang","doi":"10.1016/j.neucom.2024.128788","DOIUrl":null,"url":null,"abstract":"<div><div>Open vocabulary segmentation is a challenging task that aims to segment out the thousands of unseen categories. Directly applying CLIP to open-vocabulary semantic segmentation is challenging due to the granularity gap between its image-level contrastive learning and the pixel-level recognition required for segmentation. To address these challenges, we propose a unified pipeline that leverages physical structure regularization to enhance the generalizability and robustness of open vocabulary segmentation. By incorporating physical structure information, which is independent of the training data, we aim to reduce bias and improve the model’s performance on unseen classes. We utilize low-level structures such as edges and keypoints as regularization terms, as they are easier to obtain and strongly correlated with segmentation boundary information. These structures are used as pseudo-ground truth to supervise the model. Furthermore, inspired by the effectiveness of comparative learning in human cognition, we introduce the weighted patched alignment loss. This loss function contrasts similar and dissimilar samples to acquire low-dimensional representations that capture the distinctions between different object classes. By incorporating physical knowledge and leveraging weighted patched alignment loss, we aim to improve the model’s generalizability, robustness, and capability to recognize diverse object classes. The experiments on the COCO Stuff, Pascal VOC, Pascal Context-59, Pascal Context-459, ADE20K-150, and ADE20K-847 datasets demonstrate that our proposed method consistently improves baselines and achieves new state-of-the-art in the open vocabulary segmentation task.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128788"},"PeriodicalIF":5.5000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neurocomputing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0925231224015595","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Open vocabulary segmentation is a challenging task that aims to segment out the thousands of unseen categories. Directly applying CLIP to open-vocabulary semantic segmentation is challenging due to the granularity gap between its image-level contrastive learning and the pixel-level recognition required for segmentation. To address these challenges, we propose a unified pipeline that leverages physical structure regularization to enhance the generalizability and robustness of open vocabulary segmentation. By incorporating physical structure information, which is independent of the training data, we aim to reduce bias and improve the model’s performance on unseen classes. We utilize low-level structures such as edges and keypoints as regularization terms, as they are easier to obtain and strongly correlated with segmentation boundary information. These structures are used as pseudo-ground truth to supervise the model. Furthermore, inspired by the effectiveness of comparative learning in human cognition, we introduce the weighted patched alignment loss. This loss function contrasts similar and dissimilar samples to acquire low-dimensional representations that capture the distinctions between different object classes. By incorporating physical knowledge and leveraging weighted patched alignment loss, we aim to improve the model’s generalizability, robustness, and capability to recognize diverse object classes. The experiments on the COCO Stuff, Pascal VOC, Pascal Context-59, Pascal Context-459, ADE20K-150, and ADE20K-847 datasets demonstrate that our proposed method consistently improves baselines and achieves new state-of-the-art in the open vocabulary segmentation task.
期刊介绍:
Neurocomputing publishes articles describing recent fundamental contributions in the field of neurocomputing. Neurocomputing theory, practice and applications are the essential topics being covered.