{"title":"MAAN: Memory-Augmented Auto-Regressive Network for Text-Driven 3D Indoor Scene Generation","authors":"Zhaoda Ye;Yang Liu;Yuxin Peng","doi":"10.1109/TMM.2024.3443657","DOIUrl":null,"url":null,"abstract":"The objective of text-driven 3D indoor scene generation is to automatically generate and arrange the objects to form a 3D scene that accurately captures the semantics detailed in the given text description. Existing approaches are mainly guided by specific object categories and room layout to generate and position objects like furniture within 3D indoor scenes. However, few methods harness the potential of the text description to precisely control both \n<italic>spatial relationships</i>\n and \n<italic>object combinations</i>\n. Consequently, these methods lack a robust mechanism for determining accurate object attributes necessary to craft a plausible 3D scene that maintains consistent spatial relationships in alignment with the provided text description. To tackle these issues, we propose the Memory-Augmented Auto-regressive Network (MAAN), which is a text-driven method for synthesizing 3D indoor scenes with controllable spatial relationships and object compositions. Firstly, we propose a memory-augmented network to help the model decide the attributes of the objects, such as 3D coordinates, rotation and size, which improves the consistency of the object spatial relations with text descriptions. Our approach constructs a memory context to select relevant objects within the scene, which provides spatial information that aids in generating the new object with the correct attributes. Secondly, we develop a prior attribute prediction network to learn how to generate a complete scene with suitable and reasonable object compositions. This prior attribute prediction network adopts a pre-training strategy to extract composition priors from existing scenes, which enables the organization of multiple objects to form a reasonable scene and keeps the object relations according to the text descriptions. We conduct experiments on three different room types (bedroom, living room, and dining room) on the 3D-FRONT dataset. The results of these experiments underscore the accuracy of our method in governing spatial relationships among objects, showcasing its superior flexibility compared to existing techniques.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"11057-11069"},"PeriodicalIF":8.4000,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10646560/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
The objective of text-driven 3D indoor scene generation is to automatically generate and arrange the objects to form a 3D scene that accurately captures the semantics detailed in the given text description. Existing approaches are mainly guided by specific object categories and room layout to generate and position objects like furniture within 3D indoor scenes. However, few methods harness the potential of the text description to precisely control both
spatial relationships
and
object combinations
. Consequently, these methods lack a robust mechanism for determining accurate object attributes necessary to craft a plausible 3D scene that maintains consistent spatial relationships in alignment with the provided text description. To tackle these issues, we propose the Memory-Augmented Auto-regressive Network (MAAN), which is a text-driven method for synthesizing 3D indoor scenes with controllable spatial relationships and object compositions. Firstly, we propose a memory-augmented network to help the model decide the attributes of the objects, such as 3D coordinates, rotation and size, which improves the consistency of the object spatial relations with text descriptions. Our approach constructs a memory context to select relevant objects within the scene, which provides spatial information that aids in generating the new object with the correct attributes. Secondly, we develop a prior attribute prediction network to learn how to generate a complete scene with suitable and reasonable object compositions. This prior attribute prediction network adopts a pre-training strategy to extract composition priors from existing scenes, which enables the organization of multiple objects to form a reasonable scene and keeps the object relations according to the text descriptions. We conduct experiments on three different room types (bedroom, living room, and dining room) on the 3D-FRONT dataset. The results of these experiments underscore the accuracy of our method in governing spatial relationships among objects, showcasing its superior flexibility compared to existing techniques.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.