{"title":"Frame-Level Event Representation Learning for Semantic-Level Generation and Editing of Avatar Motion","authors":"Ayaka Ideno, Takuhiro Kaneko, Tatsuya Harada","doi":"10.1145/3577190.3614175","DOIUrl":null,"url":null,"abstract":"Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Companion Publication of the 2020 International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3577190.3614175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Understanding an avatar’s motion and controlling its content is important for content creation and has been actively studied in computer vision and graphics. An avatar’s motion consists of frames representing poses each time, and a subsequence of frames can be grouped into a segment based on semantic meaning. To enable semantic-level control of motion, it is important to understand the semantic division of the avatar’s motion. We define a semantic division of avatar’s motion as an “event”, which switches only when the frame in the motion cannot be predicted from the previous frames and information of the last event, and tackled editing motion and inferring motion from text based on events. However, it is challenging because we need to obtain the event information, and control the content of motion based on the obtained event information. To overcome this challenge, we propose obtaining frame-level event representation from the pair of motion and text and using it to edit events in motion and predict motion from the text. Specifically, we learn a frame-level event representation by reconstructing the avatar’s motion from the corresponding frame-level event representation sequence while inferring the sequence from the text. By doing so, we can predict motion from the text. Also, since the event at each motion frame is represented with the corresponding event representation, we can edit events in motion by editing the corresponding event representation sequence. We evaluated our method on the HumanML3D dataset and demonstrated that our model can generate motion from the text while editing motion flexibly (e.g., allowing the change of the event duration, modification of the event characteristics, and the addition of new events).