Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI:10.1145/3577190.3614175

Ayaka Ideno, Takuhiro Kaneko, Tatsuya Harada

{"title":"Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion","authors":"Ayaka Ideno, Takuhiro Kaneko, Tatsuya Harada","doi":"10.1145/3577190.3614175","DOIUrl":null,"url":null,"abstract":"Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3614175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).

查看原文本刊更多论文

面向虚拟角色运动语义级生成与编辑的帧级事件表示学习

理解角色的动作和控制其内容对于内容创作非常重要，并且在计算机视觉和图形学中得到了积极的研究。角色的动作由代表每次姿势的帧组成，而随后的帧序列可以根据语义组合成一个片段。为了实现语义层面的运动控制，理解角色运动的语义划分是很重要的。我们将角色运动的语义划分定义为一个“事件”，只有当动作中的帧不能从前一帧和最后一个事件的信息中预测时，它才会切换，并解决了基于事件的编辑动作和从文本中推断动作的问题。但是，我们需要获取事件信息，并根据获得的事件信息来控制运动的内容，这是一个具有挑战性的问题。为了克服这一挑战，我们提出从运动和文本对中获得帧级事件表示，并使用它来编辑运动中的事件和从文本中预测运动。具体来说，我们通过从相应的帧级事件表示序列重构角色的运动来学习帧级事件表示，同时从文本中推断序列。通过这样做，我们可以从文本中预测运动。此外，由于每个运动帧中的事件都用相应的事件表示表示，因此我们可以通过编辑相应的事件表示序列来编辑运动中的事件。我们在HumanML3D数据集上评估了我们的方法，并证明了我们的模型可以在灵活编辑运动的同时从文本中生成运动(例如，允许改变事件持续时间、修改事件特征和添加新事件)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Companion Publication of the 2020 International Conference on Multimodal Interaction

自引率

0.00%

发文量