Predicting multimodal presentation skills based on instance weighting domain adaptation

IF 2.2 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Journal on Multimodal User Interfaces Pub Date : 2021-02-18 DOI:10.1007/s12193-021-00367-x

Yutaro Yagi, Shogo Okada, Shota Shiobara, Sota Sugimura

{"title":"Predicting multimodal presentation skills based on instance weighting domain adaptation","authors":"Yutaro Yagi, Shogo Okada, Shota Shiobara, Sota Sugimura","doi":"10.1007/s12193-021-00367-x","DOIUrl":null,"url":null,"abstract":"Presentation skills assessment is one of the central challenges of multimodal modeling. Presentation skills are composed of verbal and nonverbal skill components, but because people demonstrate their presentation skills in a variety of manners, the observed multimodal features vary widely. Due to the differences in features, when test data samples are generated on different training data sample distributions, in many cases, the prediction accuracy of the skills degrades. In machine learning theory, this problem in which training (source) data are biased is known as instance selection bias or covariate shift. To solve this problem, this paper presents an instance weighting adaptation method that is applied to estimate the presentation skills of each participant from multimodal (verbal and nonverbal) features. For this purpose, we collect a novel multimodal presentation dataset that includes audio signal data, body motion sensor data, and text data of the speech content for participants observed in 58 presentation sessions. The dataset also includes both verbal and nonverbal presentation skills, which are assessed by two external experts from a human resources department. We extract multimodal features, such as spoken utterances, acoustic features, and the amount of body motion, to estimate the presentation skills. We propose two approaches, early fusing and late fusing, for the regression models based on multimodal instance weighting adaptation. The experimental results show that the early fusing regression model with instance weighting adaptation achieved \\(\\rho =0.39\\) for the Pearson correlation, which presents the regression accuracy for the clarity of presentation goal elements. In the maximum case, the accuracy (correlation coefficient) is improved from \\(-0.34\\) to +0.35 by instance weighting adaptation.","PeriodicalId":17529,"journal":{"name":"Journal on Multimodal User Interfaces","volume":"33 1","pages":""},"PeriodicalIF":2.2000,"publicationDate":"2021-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal on Multimodal User Interfaces","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s12193-021-00367-x","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 3

Abstract

Presentation skills assessment is one of the central challenges of multimodal modeling. Presentation skills are composed of verbal and nonverbal skill components, but because people demonstrate their presentation skills in a variety of manners, the observed multimodal features vary widely. Due to the differences in features, when test data samples are generated on different training data sample distributions, in many cases, the prediction accuracy of the skills degrades. In machine learning theory, this problem in which training (source) data are biased is known as instance selection bias or covariate shift. To solve this problem, this paper presents an instance weighting adaptation method that is applied to estimate the presentation skills of each participant from multimodal (verbal and nonverbal) features. For this purpose, we collect a novel multimodal presentation dataset that includes audio signal data, body motion sensor data, and text data of the speech content for participants observed in 58 presentation sessions. The dataset also includes both verbal and nonverbal presentation skills, which are assessed by two external experts from a human resources department. We extract multimodal features, such as spoken utterances, acoustic features, and the amount of body motion, to estimate the presentation skills. We propose two approaches, early fusing and late fusing, for the regression models based on multimodal instance weighting adaptation. The experimental results show that the early fusing regression model with instance weighting adaptation achieved \(\rho =0.39\) for the Pearson correlation, which presents the regression accuracy for the clarity of presentation goal elements. In the maximum case, the accuracy (correlation coefficient) is improved from \(-0.34\) to +0.35 by instance weighting adaptation.

查看原文本刊更多论文

基于实例加权域自适应的多模态表达技巧预测

表达技能评估是多模态建模的核心挑战之一。演讲技巧由语言和非语言技巧组成，但由于人们以各种方式展示他们的演讲技巧，因此观察到的多模态特征差异很大。由于特征的差异，当在不同的训练数据样本分布上生成测试数据样本时，在很多情况下，技能的预测精度会降低。在机器学习理论中，这种训练(源)数据有偏差的问题被称为实例选择偏差或协变量移位。为了解决这一问题，本文提出了一种实例加权自适应方法，通过多模态(言语和非言语)特征来估计每个参与者的表达能力。为此，我们收集了一个新的多模态演示数据集，其中包括58个演示会议中参与者的音频信号数据、身体运动传感器数据和演讲内容的文本数据。该数据集还包括口头和非口头表达技巧，由人力资源部门的两名外部专家进行评估。我们提取多模态特征，如语音、声学特征和身体运动的数量，以估计演示技巧。针对基于多模态实例加权自适应的回归模型，提出了早期融合和后期融合两种方法。实验结果表明，基于实例加权自适应的早期融合回归模型在Pearson相关性上达到\(\rho =0.39\)，对表示目标元素的清晰度有较好的回归精度。在最大的情况下，通过实例加权适应，精度(相关系数)从\(-0.34\)提高到+0.35。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal on Multimodal User Interfaces COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-COMPUTER SCIENCE, CYBERNETICS

CiteScore

6.90

自引率

3.40%

发文量

审稿时长

>12 weeks

期刊介绍： The Journal of Multimodal User Interfaces publishes work in the design, implementation and evaluation of multimodal interfaces. Research in the domain of multimodal interaction is by its very essence a multidisciplinary area involving several fields including signal processing, human-machine interaction, computer science, cognitive science and ergonomics. This journal focuses on multimodal interfaces involving advanced modalities, several modalities and their fusion, user-centric design, usability and architectural considerations. Use cases and descriptions of specific application areas are welcome including for example e-learning, assistance, serious games, affective and social computing, interaction with avatars and robots.