GraPLUS: Graph-based Placement Using Semantics for image composition

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Vision and Image Understanding Pub Date : 2025-06-20 DOI:10.1016/j.cviu.2025.104427

Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei

{"title":"GraPLUS: Graph-based Placement Using Semantics for image composition","authors":"Mir Mohammad Khaleghi, Mehran Safayani, Abdolreza Mirzaei","doi":"10.1016/j.cviu.2025.104427","DOIUrl":null,"url":null,"abstract":"<div><div>We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling a nuanced understanding of object relationships and placement patterns. GraPLUS achieves a placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.3% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 38 participants, our method was preferred in 51.8% of cases, significantly outperforming previous approaches (25.8% and 22.4% for the next best methods). The framework’s key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, eliminating the need to train feature extraction parameters from scratch, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints. Extensive experiments demonstrate GraPLUS’s superior performance in both placement plausibility and spatial precision, with particular strengths in maintaining object proportions and contextual relationships across diverse scene types.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"259 ","pages":"Article 104427"},"PeriodicalIF":3.5000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S107731422500150X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

We present GraPLUS (Graph-based Placement Using Semantics), a novel framework for plausible object placement in images that leverages scene graphs and large language models. Our approach uniquely combines graph-structured scene representation with semantic understanding to determine contextually appropriate object positions. The framework employs GPT-2 to transform categorical node and edge labels into rich semantic embeddings that capture both definitional characteristics and typical spatial contexts, enabling a nuanced understanding of object relationships and placement patterns. GraPLUS achieves a placement accuracy of 92.1% and an FID score of 28.83 on the OPA dataset, outperforming state-of-the-art methods by 8.3% while maintaining competitive visual quality. In human evaluation studies involving 964 samples assessed by 38 participants, our method was preferred in 51.8% of cases, significantly outperforming previous approaches (25.8% and 22.4% for the next best methods). The framework’s key innovations include: (i) leveraging pre-trained scene graph models that transfer knowledge from other domains, eliminating the need to train feature extraction parameters from scratch, (ii) edge-aware graph neural networks that process scene semantics through structured relationships, (iii) a cross-modal attention mechanism that aligns categorical embeddings with enhanced scene features, and (iv) a multiobjective training strategy incorporating semantic consistency constraints. Extensive experiments demonstrate GraPLUS’s superior performance in both placement plausibility and spatial precision, with particular strengths in maintaining object proportions and contextual relationships across diverse scene types.

查看原文本刊更多论文

GraPLUS：使用语义进行图像合成的基于图形的放置

我们提出了GraPLUS（使用语义的基于图的放置），这是一个利用场景图和大型语言模型在图像中合理放置物体的新框架。我们的方法独特地将图结构场景表示与语义理解相结合，以确定上下文合适的对象位置。该框架采用GPT-2将分类节点和边缘标签转换为丰富的语义嵌入，捕获定义特征和典型的空间上下文，从而实现对对象关系和放置模式的细致理解。GraPLUS在OPA数据集上实现了92.1%的定位精度和28.83的FID分数，在保持具有竞争力的视觉质量的同时，比最先进的方法高出8.3%。在涉及38名参与者评估的964个样本的人体评估研究中，我们的方法在51.8%的病例中被首选，显著优于之前的方法（其次为25.8%和22.4%）。该框架的主要创新包括：(i)利用从其他领域转移知识的预训练场景图模型，消除从头开始训练特征提取参数的需要，（ii）通过结构化关系处理场景语义的边缘感知图神经网络，（iii）将分类嵌入与增强的场景特征对齐的跨模态注意机制，以及（iv）包含语义一致性约束的多目标训练策略。大量的实验表明，GraPLUS在放置合理性和空间精度方面都具有卓越的性能，特别是在不同场景类型中保持物体比例和上下文关系方面。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Computer Vision and Image Understanding 工程技术-工程：电子与电气

CiteScore

7.80

自引率

4.40%

发文量

112

审稿时长

79 days

期刊介绍： The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views. Research Areas Include: • Theory • Early vision • Data structures and representations • Shape • Range • Motion • Matching and recognition • Architecture and languages • Vision systems