Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction Pub Date : 2015-11-09 DOI:10.1145/2818346.2820747

Moitreya Chatterjee, Sunghyun Park, Louis-Philippe Morency, Stefan Scherer

{"title":"Combining Two Perspectives on Classifying Multimodal Data for Recognizing Speaker Traits","authors":"Moitreya Chatterjee, Sunghyun Park, Louis-Philippe Morency, Stefan Scherer","doi":"10.1145/2818346.2820747","DOIUrl":null,"url":null,"abstract":"Human communication involves conveying messages both through verbal and non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless, the task of learning these patterns for a computer by combining cues from multiple modalities is challenging because it requires effective representation of the signals and also taking into consideration the complex interactions between them. From the machine learning perspective this presents a two-fold challenge: a) Modeling the intermodal variations and dependencies; b) Representing the data using an apt number of features, such that the necessary patterns are captured but at the same time allaying concerns such as over-fitting. In this work we attempt to address these aspects of multimodal recognition, in the context of recognizing two essential speaker traits, namely passion and credibility of online movie reviewers. We propose a novel ensemble classification approach that combines two different perspectives on classifying multimodal data. Each of these perspectives attempts to independently address the two-fold challenge. In the first, we combine the features from multiple modalities but assume inter-modality conditional independence. In the other one, we explicitly capture the correlation between the modalities but in a space of few dimensions and explore a novel clustering based kernel similarity approach for recognition. Additionally, this work investigates a recent technique for encoding text data that captures semantic similarity of verbal content and preserves word-ordering. The experimental results on a recent public dataset shows significant improvement of our approach over multiple baselines. Finally, we also analyze the most discriminative elements of a speaker's non-verbal behavior that contribute to his/her perceived credibility/passionateness.","PeriodicalId":20486,"journal":{"name":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","volume":"39 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2015-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM on International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2818346.2820747","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Human communication involves conveying messages both through verbal and non-verbal channels (facial expression, gestures, prosody, etc.). Nonetheless, the task of learning these patterns for a computer by combining cues from multiple modalities is challenging because it requires effective representation of the signals and also taking into consideration the complex interactions between them. From the machine learning perspective this presents a two-fold challenge: a) Modeling the intermodal variations and dependencies; b) Representing the data using an apt number of features, such that the necessary patterns are captured but at the same time allaying concerns such as over-fitting. In this work we attempt to address these aspects of multimodal recognition, in the context of recognizing two essential speaker traits, namely passion and credibility of online movie reviewers. We propose a novel ensemble classification approach that combines two different perspectives on classifying multimodal data. Each of these perspectives attempts to independently address the two-fold challenge. In the first, we combine the features from multiple modalities but assume inter-modality conditional independence. In the other one, we explicitly capture the correlation between the modalities but in a space of few dimensions and explore a novel clustering based kernel similarity approach for recognition. Additionally, this work investigates a recent technique for encoding text data that captures semantic similarity of verbal content and preserves word-ordering. The experimental results on a recent public dataset shows significant improvement of our approach over multiple baselines. Finally, we also analyze the most discriminative elements of a speaker's non-verbal behavior that contribute to his/her perceived credibility/passionateness.

查看原文本刊更多论文

基于多模态数据分类的说话人特征识别

人类的交流包括通过语言和非语言渠道(面部表情、手势、韵律等)传达信息。然而，通过结合来自多种模式的线索来为计算机学习这些模式的任务是具有挑战性的，因为它需要有效地表示信号，并考虑到它们之间复杂的相互作用。从机器学习的角度来看，这提出了双重挑战:a)建模多式联运变化和依赖关系;b)使用适当数量的特征来表示数据，以便捕获必要的模式，同时减轻过度拟合等问题。在这项工作中，我们试图解决多模态识别的这些方面，在识别两个基本的说话者特征的背景下，即激情和信誉的在线电影评论家。我们提出了一种新的集成分类方法，结合了两种不同的观点对多模态数据进行分类。这些观点中的每一个都试图独立地解决双重挑战。首先，我们结合了多个模态的特征，但假设模态间条件独立。在另一种方法中，我们明确地捕获了模态之间的相关性，但在几个维度的空间中，并探索了一种新的基于聚类的核相似度识别方法。此外，本研究还研究了一种最新的文本数据编码技术，该技术可以捕获口头内容的语义相似性并保持词序。在最近的一个公共数据集上的实验结果表明，我们的方法在多个基线上有了显著的改进。最后，我们还分析了演讲者非语言行为中最具歧视性的因素，这些因素有助于他/她感知到的可信度/激情。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 ACM on International Conference on Multimodal Interaction

自引率

0.00%

发文量