Tonglin Chen, Yinxuan Huang, Jinghao Huang, Bin Li, Xiangyang Xue
{"title":"Unsupervised Learning of Global Object-Centric Representations for Compositional Scene Understanding.","authors":"Tonglin Chen, Yinxuan Huang, Jinghao Huang, Bin Li, Xiangyang Xue","doi":"10.1109/TVCG.2025.3570426","DOIUrl":null,"url":null,"abstract":"<p><p>The ability to extract invariant visual features of objects from complex scenes and identify the same objects in different scenes is inborn for humans. To endow AI systems with such capability, we introduce a novel compositional scene understanding method known as Compositional Scene understanding via Global Object-centric representations (CSGO). CSGO achieves comprehensive scene understanding, including the discovery and identification of objects, by leveraging a set of learnable global object-centric representations in an unsupervised manner. CSGO comprises three components: 1) Local Object-Centric Learning, which is responsible for extracting localized and scene-specific object-centric representations to discover objects; 2) Image Decoding, facilitating the reconstruction of object and scene images using the obtained object-centric representation as input; and 3) Global Object-Centric Learning, identifying the object across diverse scenes according to a set of learnable global object-centric representations, which indicates the scene-free intrinsic attributes (i.e., appearance and shape) of objects. Experimental results on three synthetic datasets and one real-world scene dataset demonstrate that CSGO has excellent object identification and attribute disentanglement abilities. Furthermore, the scene decomposition performance (indicating object discovery performance) of CSGO is superior to comparison methods.</p>","PeriodicalId":94035,"journal":{"name":"IEEE transactions on visualization and computer graphics","volume":"PP ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on visualization and computer graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TVCG.2025.3570426","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The ability to extract invariant visual features of objects from complex scenes and identify the same objects in different scenes is inborn for humans. To endow AI systems with such capability, we introduce a novel compositional scene understanding method known as Compositional Scene understanding via Global Object-centric representations (CSGO). CSGO achieves comprehensive scene understanding, including the discovery and identification of objects, by leveraging a set of learnable global object-centric representations in an unsupervised manner. CSGO comprises three components: 1) Local Object-Centric Learning, which is responsible for extracting localized and scene-specific object-centric representations to discover objects; 2) Image Decoding, facilitating the reconstruction of object and scene images using the obtained object-centric representation as input; and 3) Global Object-Centric Learning, identifying the object across diverse scenes according to a set of learnable global object-centric representations, which indicates the scene-free intrinsic attributes (i.e., appearance and shape) of objects. Experimental results on three synthetic datasets and one real-world scene dataset demonstrate that CSGO has excellent object identification and attribute disentanglement abilities. Furthermore, the scene decomposition performance (indicating object discovery performance) of CSGO is superior to comparison methods.