{"title":"MENSA: Multi-Dataset Harmonized Pretraining for Semantic Segmentation","authors":"Bowen Shi;Xiaopeng Zhang;Yaoming Wang;Wenrui Dai;Junni Zou;Hongkai Xiong","doi":"10.1109/TMM.2024.3521851","DOIUrl":null,"url":null,"abstract":"Existing pretraining methods for semantic segmentation are hampered by the task gap between global image -level pretraining and local pixel-level finetuning. Joint dense-level pretraining is a promising alternative to exploit off-the-shelf annotations from diverse segmentation datasets but suffers from low-quality class embeddings and inconsistent data and supervision signals across multiple datasets by directly employing CLIP. To overcome these challenges, we propose a novel <underline>M</u>ulti-datas<underline>E</u>t harmo<underline>N</u>ized pretraining framework for <underline>S</u>emantic s<underline>E</u>gmentation (MENSA). MENSA incorporates high-quality language embeddings and momentum-updated visual embeddings to effectively model the class relationships in the embedding space and thereby provide reliable supervision information for each category. To further adapt to multiple datasets, we achieve one-to-many pixel-embedding pairing with cross-dataset multi-label mapping through cross-modal information exchange to mitigate inconsistent supervision signals and introduce region-level and pixel-level cross-dataset mixing for varying data distribution. Experimental results demonstrate that MENSA is a powerful foundation segmentation model that consistently outperforms popular supervised or unsupervised ImageNet pretrained models for various benchmarks under standard fine-tuning. Furthermore, MENSA is shown to significantly benefit frozen-backbone fine-tuning and zero-shot learning by endowing pixel-level distinctiveness to learned representations.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2127-2140"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814086/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Existing pretraining methods for semantic segmentation are hampered by the task gap between global image -level pretraining and local pixel-level finetuning. Joint dense-level pretraining is a promising alternative to exploit off-the-shelf annotations from diverse segmentation datasets but suffers from low-quality class embeddings and inconsistent data and supervision signals across multiple datasets by directly employing CLIP. To overcome these challenges, we propose a novel Multi-datasEt harmoNized pretraining framework for Semantic sEgmentation (MENSA). MENSA incorporates high-quality language embeddings and momentum-updated visual embeddings to effectively model the class relationships in the embedding space and thereby provide reliable supervision information for each category. To further adapt to multiple datasets, we achieve one-to-many pixel-embedding pairing with cross-dataset multi-label mapping through cross-modal information exchange to mitigate inconsistent supervision signals and introduce region-level and pixel-level cross-dataset mixing for varying data distribution. Experimental results demonstrate that MENSA is a powerful foundation segmentation model that consistently outperforms popular supervised or unsupervised ImageNet pretrained models for various benchmarks under standard fine-tuning. Furthermore, MENSA is shown to significantly benefit frozen-backbone fine-tuning and zero-shot learning by endowing pixel-level distinctiveness to learned representations.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.