{"title":"响应性和自发行为生成RNN模型中多模态融合和时间特征提取的有效性研究","authors":"Hung-Hsuan Huang, Masato Fukuda, T. Nishida","doi":"10.1145/3349537.3351908","DOIUrl":null,"url":null,"abstract":"Like a human listener, a listener agent reacts to its communicational partners' non-verbal behaviors such as head nods, facial expressions, and voice tone. When adopting these modalities as inputs and develop the generative model of reactive and spontaneous behaviors using machine learning techniques, the issues of multimodal fusion emerge. That is, the effectiveness of different modalities, frame-wise interaction of multiple modalities, and temporal feature extraction of individual modalities. This paper describes our investigation on these issues of the task in generating of virtual listeners' reactive and spontaneous idling behaviors. The work is based on the comparison of corresponding recurrent neural network (RNN) configurations in the performance of generating listener's (the agent) head movements, gaze directions, facial expressions, and postures from the speaker's head movements, gaze directions, facial expressions, and voice tone. A data corpus recorded in a subject experiment of active listening is used as the ground truth. The results showed that video information is more effective than audio information, and frame-wise interaction of modalities is more effective than temporal characteristics of individual modalities.","PeriodicalId":188834,"journal":{"name":"Proceedings of the 7th International Conference on Human-Agent Interaction","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"An Investigation on the Effectiveness of Multimodal Fusion and Temporal Feature Extraction in Reactive and Spontaneous Behavior Generative RNN Models for Listener Agents\",\"authors\":\"Hung-Hsuan Huang, Masato Fukuda, T. Nishida\",\"doi\":\"10.1145/3349537.3351908\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Like a human listener, a listener agent reacts to its communicational partners' non-verbal behaviors such as head nods, facial expressions, and voice tone. When adopting these modalities as inputs and develop the generative model of reactive and spontaneous behaviors using machine learning techniques, the issues of multimodal fusion emerge. That is, the effectiveness of different modalities, frame-wise interaction of multiple modalities, and temporal feature extraction of individual modalities. This paper describes our investigation on these issues of the task in generating of virtual listeners' reactive and spontaneous idling behaviors. The work is based on the comparison of corresponding recurrent neural network (RNN) configurations in the performance of generating listener's (the agent) head movements, gaze directions, facial expressions, and postures from the speaker's head movements, gaze directions, facial expressions, and voice tone. A data corpus recorded in a subject experiment of active listening is used as the ground truth. The results showed that video information is more effective than audio information, and frame-wise interaction of modalities is more effective than temporal characteristics of individual modalities.\",\"PeriodicalId\":188834,\"journal\":{\"name\":\"Proceedings of the 7th International Conference on Human-Agent Interaction\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 7th International Conference on Human-Agent Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3349537.3351908\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Conference on Human-Agent Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3349537.3351908","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An Investigation on the Effectiveness of Multimodal Fusion and Temporal Feature Extraction in Reactive and Spontaneous Behavior Generative RNN Models for Listener Agents
Like a human listener, a listener agent reacts to its communicational partners' non-verbal behaviors such as head nods, facial expressions, and voice tone. When adopting these modalities as inputs and develop the generative model of reactive and spontaneous behaviors using machine learning techniques, the issues of multimodal fusion emerge. That is, the effectiveness of different modalities, frame-wise interaction of multiple modalities, and temporal feature extraction of individual modalities. This paper describes our investigation on these issues of the task in generating of virtual listeners' reactive and spontaneous idling behaviors. The work is based on the comparison of corresponding recurrent neural network (RNN) configurations in the performance of generating listener's (the agent) head movements, gaze directions, facial expressions, and postures from the speaker's head movements, gaze directions, facial expressions, and voice tone. A data corpus recorded in a subject experiment of active listening is used as the ground truth. The results showed that video information is more effective than audio information, and frame-wise interaction of modalities is more effective than temporal characteristics of individual modalities.