Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei
{"title":"GraPLUS: Graph-based Placement Using Semantics for image composition","authors":"Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei","doi":"10.1016/j.cviu.2025.104427","DOIUrl":null,"url":null,"abstract":"<div><div>We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling a nuanced understanding of object relationships and placement patterns. GraPLUS achieves a placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.3% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 38 participants, our method was preferred in 51.8% of cases, significantly outperforming previous approaches (25.8% and 22.4% for the next best methods). The framework’s key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, eliminating the need to train feature extraction parameters from scratch, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints. Extensive experiments demonstrate GraPLUS’s superior performance in both placement plausibility and spatial precision, with particular strengths in maintaining object proportions and contextual relationships across diverse scene types.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104427"},"PeriodicalIF":4.3000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S107731422500150X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling a nuanced understanding of object relationships and placement patterns. GraPLUS achieves a placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.3% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 38 participants, our method was preferred in 51.8% of cases, significantly outperforming previous approaches (25.8% and 22.4% for the next best methods). The framework’s key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, eliminating the need to train feature extraction parameters from scratch, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints. Extensive experiments demonstrate GraPLUS’s superior performance in both placement plausibility and spatial precision, with particular strengths in maintaining object proportions and contextual relationships across diverse scene types.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems