{"title":"Scene Synthesis with Automated Generation of Textual Descriptions","authors":"Julian Müller-Huschke, Marcel Ritter, M. Harders","doi":"10.2312/egs.20221026","DOIUrl":null,"url":null,"abstract":"Most current research on automatically captioning and describing scenes with spatial content focuses on images. We outline that generating descriptive text for a synthesized 3D scene can be achieved via a suitable intermediate representation employed in the synthesis algorithm. As an example, we synthesize scenes of medieval village settings, and generate their descriptions. Our system employs graph grammars, Markov Chain Monte Carlo optimization, and a natural language generation pipeline. Randomly placed objects are evaluated and optimized by a cost function capturing neighborhood relations, path layouts, and collisions. Further, in a pilot study we assess the performance of our framework by comparing the generated descriptions to others provided by human subjects. While the latter were often short and low-effort, the highest-rated ones clearly outperform our generated ones. Nevertheless, the average of all collected human descriptions was indeed rated by the study participants as being less accurate than the automated ones. CCS Concepts • Computing methodologies → Computer graphics; Natural language generation; The scene consists of three roads meeting at an intersection, a group of trees, an oak tree and three market stands. The three market stands are next to the first road. The group of trees consists of three pine trees and three bushes. The first market stand consists of a sign to the right of a table. A big pot of stew is in the middle of this table. The second market stand consists of a sign besides of a table. A big pot of stew is in the middle of this table. The third market stand consists of three flowerpots on top of a table and a sign. This sign is to the right of this table. Figure 1: (Left:) Example of procedurally generated 3D scene. (Right:) Automatically generated description with our framework.","PeriodicalId":72958,"journal":{"name":"Eurographics ... Workshop on 3D Object Retrieval : EG 3DOR. Eurographics Workshop on 3D Object Retrieval","volume":"27 1","pages":"33-36"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurographics ... Workshop on 3D Object Retrieval : EG 3DOR. Eurographics Workshop on 3D Object Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2312/egs.20221026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Most current research on automatically captioning and describing scenes with spatial content focuses on images. We outline that generating descriptive text for a synthesized 3D scene can be achieved via a suitable intermediate representation employed in the synthesis algorithm. As an example, we synthesize scenes of medieval village settings, and generate their descriptions. Our system employs graph grammars, Markov Chain Monte Carlo optimization, and a natural language generation pipeline. Randomly placed objects are evaluated and optimized by a cost function capturing neighborhood relations, path layouts, and collisions. Further, in a pilot study we assess the performance of our framework by comparing the generated descriptions to others provided by human subjects. While the latter were often short and low-effort, the highest-rated ones clearly outperform our generated ones. Nevertheless, the average of all collected human descriptions was indeed rated by the study participants as being less accurate than the automated ones. CCS Concepts • Computing methodologies → Computer graphics; Natural language generation; The scene consists of three roads meeting at an intersection, a group of trees, an oak tree and three market stands. The three market stands are next to the first road. The group of trees consists of three pine trees and three bushes. The first market stand consists of a sign to the right of a table. A big pot of stew is in the middle of this table. The second market stand consists of a sign besides of a table. A big pot of stew is in the middle of this table. The third market stand consists of three flowerpots on top of a table and a sign. This sign is to the right of this table. Figure 1: (Left:) Example of procedurally generated 3D scene. (Right:) Automatically generated description with our framework.