Proceedings of the 20th ACM International Conference on Multimodal Interaction最新文献_第10页

Responding with Sentiment Appropriate for the User's Current Sentiment in Dialog as Inferred from Prosody and Gaze Patterns 从韵律和凝视模式中推断出与对话中用户当前情绪相适应的情绪

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3264974

Anindita Nath

{"title":"Responding with Sentiment Appropriate for the User's Current Sentiment in Dialog as Inferred from Prosody and Gaze Patterns","authors":"Anindita Nath","doi":"10.1145/3242969.3264974","DOIUrl":"https://doi.org/10.1145/3242969.3264974","url":null,"abstract":"Multi-modal sentiment detection from natural video/audio streams has recently received much attention. I propose to use this multi-modal information to develop a technique, Sentiment Coloring , that utilizes the detected sentiments to generate effective responses. In particular, I aim to produce suggested responses colored with sentiment appropriate for that present in the interlocutor's speech. To achieve this, contextual information pertaining to sentiment, extracted from the past as well as the current speech of both the speakers in a dialog, will be utilized. Sentiment, here, includes the three polarities: positive, neutral and negative, as well as other expressions of stance and attitude. Utilizing only the non-verbal cues, namely, prosody and gaze, I will implement two algorithmic approaches and compare their performance in sentiment detection: a simple machine learning algorithm (neural networks), that will act as the baseline, and a deep learning approach, an end-to-end bidirectional LSTM RNN, which is the state-of-the-art in emotion classification. I will build a responsive spoken dialog system(s) with this Sentiment Coloring technique and evaluate the same with human subjects to measure benefits of the technique in various interactive environments.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124797723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Deep End-to-End Representation Learning for Food Type Recognition from Speech 基于语音的食物类型识别的深度端到端表示学习

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3243683

Benjamin Sertolli, N. Cummins, A. Şengür, Björn Schuller

引用次数: 2

Human, Chameleon or Nodding Dog? 人类、变色龙还是点头狗?

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3242998

Leshao Zhang, P. Healey

{"title":"Human, Chameleon or Nodding Dog?","authors":"Leshao Zhang, P. Healey","doi":"10.1145/3242969.3242998","DOIUrl":"https://doi.org/10.1145/3242969.3242998","url":null,"abstract":"Immersive virtual environments (IVEs) present rich possibilities for the experimental study of non-verbal communication. Here, the 'digital chameleon' effect, -which suggests that a virtual speaker (agent) is more persuasive if they mimic their addresses head movements-, was tested. Using a specially constructed IVE, we recreate a full-body analogue version of the 'digital chameleon' experiment. The agent's behaviour is manipulated in three conditions 1) Mimic (Chameleon) in which it copies the participant's nodding 2) Playback (Nodding Dog) which uses nods from playback of a previous participant and are therefore unconnected with the content and 3) Original (Human) in which it uses the prerecorded actor's movements. The results do not support the original finding of differences in ratings of agent persuasiveness between conditions. However, motion capture data reveals systematic differences in a) the real-time movements of speakers and listeners b) between the Original, Mimic and Playback conditions. We conclude that the automatic mimicry model is too simplistic and that this paradigm must address the reciprocal dynamics of non-verbal interaction to achieve its full potential.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127503308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

End-to-end Learning for 3D Facial Animation from Speech 基于语音的3D面部动画端到端学习

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3243017

Hai Xuan Pham, Yuting Wang, V. Pavlovic

{"title":"End-to-end Learning for 3D Facial Animation from Speech","authors":"Hai Xuan Pham, Yuting Wang, V. Pavlovic","doi":"10.1145/3242969.3243017","DOIUrl":"https://doi.org/10.1145/3242969.3243017","url":null,"abstract":"We present a deep learning framework for real-time speech-driven 3D facial animation from speech audio. Our deep neural network directly maps an input sequence of speech spectrograms to a series of micro facial action unit intensities to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or both eyebrows raise higher in a surprised state. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126905824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 34

Online Privacy-Safe Engagement Tracking System 在线隐私安全参与跟踪系统

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3266295

Cheng Zhang, Cheng Chang, L. Chen, Yang Liu

引用次数: 0

Using Technology for Health and Wellbeing 利用科技促进健康和幸福

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3243392

M. Czerwinski

引用次数: 1

Toward Objective, Multifaceted Characterization of Psychotic Disorders: Lexical, Structural, and Disfluency Markers of Spoken Language 精神障碍的客观、多面表征:口语的词汇、结构和不流利标记

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3243020

A. Vail, E. Liebson, J. Baker, Louis-Philippe Morency

{"title":"Toward Objective, Multifaceted Characterization of Psychotic Disorders: Lexical, Structural, and Disfluency Markers of Spoken Language","authors":"A. Vail, E. Liebson, J. Baker, Louis-Philippe Morency","doi":"10.1145/3242969.3243020","DOIUrl":"https://doi.org/10.1145/3242969.3243020","url":null,"abstract":"Psychotic disorders are forms of severe mental illness characterized by abnormal social function and a general sense of disconnect with reality. The evaluation of such disorders is often complex, as their multifaceted nature is often difficult to quantify. Multimodal behavior analysis technologies have the potential to help address this need and supply timelier and more objective decision support tools in clinical settings. While written language and nonverbal behaviors have been previously studied, the present analysis takes the novel approach of examining the rarely-studied modality of spoken language of individuals with psychosis as naturally used in social, face-to-face interactions. Our analyses expose a series of language markers associated with psychotic symptom severity, as well as interesting interactions between them. In particular, we examine three facets of spoken language: (1) lexical markers, through a study of the function of words; (2) structural markers, through a study of grammatical fluency; and (3) disfluency markers, through a study of dialogue self-repair. Additionally, we develop predictive models of psychotic symptom severity, which achieve significant predictive power on both positive and negative psychotic symptom scales. These results constitute a significant step toward the design of future multimodal clinical decision support tools for computational phenotyping of mental illness.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116718383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Multimodal Dialogue Management for Multiparty Interaction with Infants 婴儿多方互动的多模式对话管理

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-09-05 DOI: 10.1145/3242969.3243029

Setareh Nasihati Gilani, D. Traum, A. Merla, Eugenia Hee, Zoey Walker, Barbara Manini, Grady Gallagher, L. Petitto

{"title":"Multimodal Dialogue Management for Multiparty Interaction with Infants","authors":"Setareh Nasihati Gilani, D. Traum, A. Merla, Eugenia Hee, Zoey Walker, Barbara Manini, Grady Gallagher, L. Petitto","doi":"10.1145/3242969.3243029","DOIUrl":"https://doi.org/10.1145/3242969.3243029","url":null,"abstract":"We present dialogue management routines for a system to engage in multiparty agent-infant interaction. The ultimate purpose of this research is to help infants learn a visual sign language by engaging them in naturalistic and socially contingent conversations during an early-life critical period for language development (ages 6 to 12 months) as initiated by an artificial agent. As a first step, we focus on creating and maintaining agent-infant engagement that elicits appropriate and socially contingent responses from the baby. Our system includes two agents, a physical robot and an animated virtual human. The system's multimodal perception includes an eye-tracker (measures attention) and a thermal infrared imaging camera (measures patterns of emotional arousal). A dialogue policy is presented that selects individual actions and planned multiparty sequences based on perceptual inputs about the baby's internal changing states of emotional engagement. The present version of the system was evaluated in interaction with 8 babies. All babies demonstrated spontaneous and sustained engagement with the agents for several minutes, with patterns of conversationally relevant and socially contingent behaviors. We further performed a detailed case-study analysis with annotation of all agent and baby behaviors. Results show that the baby's behaviors were generally relevant to agent conversations and contained direct evidence for socially contingent responses by the baby to specific linguistic samples produced by the avatar. This work demonstrates the potential for language learning from agents in very young babies and has especially broad implications regarding the use of artificial agents with babies who have minimal language exposure in early life.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132852154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition 基于注意力的视听融合鲁棒自动语音识别

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-09-05 DOI: 10.1145/3242969.3243014

George Sterpu, Christian Saam, N. Harte

引用次数: 51

Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs 基于多尺度rnn的多模态连续轮取预测

Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-08-31 DOI: 10.1145/3242969.3242997

Matthew Roddy, Gabriel Skantze, N. Harte

引用次数: 34