Scene Synthesis with Automated Generation of Textual Descriptions

Eurographics ... Workshop on 3D Object Retrieval : EG 3DOR. Eurographics Workshop on 3D Object Retrieval Pub Date : 2022-01-01 DOI:10.2312/egs.20221026

Julian Müller-Huschke, Marcel Ritter, M. Harders

{"title":"Scene Synthesis with Automated Generation of Textual Descriptions","authors":"Julian Müller-Huschke, Marcel Ritter, M. Harders","doi":"10.2312/egs.20221026","DOIUrl":null,"url":null,"abstract":"Most current research on automatically captioning and describing scenes with spatial content focuses on images. We outline that generating descriptive text for a synthesized 3D scene can be achieved via a suitable intermediate representation employed in the synthesis algorithm. As an example, we synthesize scenes of medieval village settings, and generate their descriptions. Our system employs graph grammars, Markov Chain Monte Carlo optimization, and a natural language generation pipeline. Randomly placed objects are evaluated and optimized by a cost function capturing neighborhood relations, path layouts, and collisions. Further, in a pilot study we assess the performance of our framework by comparing the generated descriptions to others provided by human subjects. While the latter were often short and low-effort, the highest-rated ones clearly outperform our generated ones. Nevertheless, the average of all collected human descriptions was indeed rated by the study participants as being less accurate than the automated ones. CCS Concepts • Computing methodologies → Computer graphics; Natural language generation; The scene consists of three roads meeting at an intersection, a group of trees, an oak tree and three market stands. The three market stands are next to the first road. The group of trees consists of three pine trees and three bushes. The first market stand consists of a sign to the right of a table. A big pot of stew is in the middle of this table. The second market stand consists of a sign besides of a table. A big pot of stew is in the middle of this table. The third market stand consists of three flowerpots on top of a table and a sign. This sign is to the right of this table. Figure 1: (Left:) Example of procedurally generated 3D scene. (Right:) Automatically generated description with our framework.","PeriodicalId":72958,"journal":{"name":"Eurographics ... Workshop on 3D Object Retrieval : EG 3DOR. Eurographics Workshop on 3D Object Retrieval","volume":"27 1","pages":"33-36"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurographics ... Workshop on 3D Object Retrieval : EG 3DOR. Eurographics Workshop on 3D Object Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2312/egs.20221026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Most current research on automatically captioning and describing scenes with spatial content focuses on images. We outline that generating descriptive text for a synthesized 3D scene can be achieved via a suitable intermediate representation employed in the synthesis algorithm. As an example, we synthesize scenes of medieval village settings, and generate their descriptions. Our system employs graph grammars, Markov Chain Monte Carlo optimization, and a natural language generation pipeline. Randomly placed objects are evaluated and optimized by a cost function capturing neighborhood relations, path layouts, and collisions. Further, in a pilot study we assess the performance of our framework by comparing the generated descriptions to others provided by human subjects. While the latter were often short and low-effort, the highest-rated ones clearly outperform our generated ones. Nevertheless, the average of all collected human descriptions was indeed rated by the study participants as being less accurate than the automated ones. CCS Concepts • Computing methodologies → Computer graphics; Natural language generation; The scene consists of three roads meeting at an intersection, a group of trees, an oak tree and three market stands. The three market stands are next to the first road. The group of trees consists of three pine trees and three bushes. The first market stand consists of a sign to the right of a table. A big pot of stew is in the middle of this table. The second market stand consists of a sign besides of a table. A big pot of stew is in the middle of this table. The third market stand consists of three flowerpots on top of a table and a sign. This sign is to the right of this table. Figure 1: (Left:) Example of procedurally generated 3D scene. (Right:) Automatically generated description with our framework.

查看原文本刊更多论文

场景合成与自动生成文本描述

目前大多数关于空间内容场景自动字幕和描述的研究都集中在图像上。我们概述了合成3D场景生成描述性文本可以通过合成算法中使用的合适的中间表示来实现。作为一个例子，我们合成了中世纪村庄场景，并生成了它们的描述。我们的系统采用了图语法、马尔可夫链蒙特卡罗优化和自然语言生成管道。随机放置的对象通过捕获邻域关系、路径布局和碰撞的成本函数进行评估和优化。此外，在一项试点研究中，我们通过将生成的描述与人类受试者提供的其他描述进行比较，来评估我们框架的性能。虽然后者通常很短且不费力，但评级最高的游戏显然优于我们生成的游戏。然而，所有收集到的人类描述的平均值确实被研究参与者评为不如自动描述准确。•计算方法→计算机图形学;自然语言生成;这个场景由三条在十字路口交汇的道路、一组树木、一棵橡树和三个市场摊位组成。三个市场摊位紧挨着第一条路。这群树由三棵松树和三棵灌木组成。第一个市场摊位由桌子右边的一个标志组成。桌子中间放着一大锅炖菜。第二个市场摊位除了一张桌子外还有一个标志。桌子中间放着一大锅炖菜。第三个市场摊位由桌子上的三个花盆和一个标志组成。这个标志在桌子的右边。图1:(左)程序生成的3D场景示例。(右:)使用我们的框架自动生成描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Eurographics ... Workshop on 3D Object Retrieval : EG 3DOR. Eurographics Workshop on 3D Object Retrieval

自引率

0.00%

发文量