Proceedings of the 20th ACM International Conference on Multimodal Interaction最新文献

筛选
英文 中文
Responding with Sentiment Appropriate for the User's Current Sentiment in Dialog as Inferred from Prosody and Gaze Patterns 从韵律和凝视模式中推断出与对话中用户当前情绪相适应的情绪
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3264974
Anindita Nath
{"title":"Responding with Sentiment Appropriate for the User's Current Sentiment in Dialog as Inferred from Prosody and Gaze Patterns","authors":"Anindita Nath","doi":"10.1145/3242969.3264974","DOIUrl":"https://doi.org/10.1145/3242969.3264974","url":null,"abstract":"Multi-modal sentiment detection from natural video/audio streams has recently received much attention. I propose to use this multi-modal information to develop a technique, Sentiment Coloring , that utilizes the detected sentiments to generate effective responses. In particular, I aim to produce suggested responses colored with sentiment appropriate for that present in the interlocutor's speech. To achieve this, contextual information pertaining to sentiment, extracted from the past as well as the current speech of both the speakers in a dialog, will be utilized. Sentiment, here, includes the three polarities: positive, neutral and negative, as well as other expressions of stance and attitude. Utilizing only the non-verbal cues, namely, prosody and gaze, I will implement two algorithmic approaches and compare their performance in sentiment detection: a simple machine learning algorithm (neural networks), that will act as the baseline, and a deep learning approach, an end-to-end bidirectional LSTM RNN, which is the state-of-the-art in emotion classification. I will build a responsive spoken dialog system(s) with this Sentiment Coloring technique and evaluate the same with human subjects to measure benefits of the technique in various interactive environments.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124797723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Deep End-to-End Representation Learning for Food Type Recognition from Speech 基于语音的食物类型识别的深度端到端表示学习
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3243683
Benjamin Sertolli, N. Cummins, A. Şengür, Björn Schuller
{"title":"Deep End-to-End Representation Learning for Food Type Recognition from Speech","authors":"Benjamin Sertolli, N. Cummins, A. Şengür, Björn Schuller","doi":"10.1145/3242969.3243683","DOIUrl":"https://doi.org/10.1145/3242969.3243683","url":null,"abstract":"The use of Convolutional Neural Networks (CNN) pre-trained for a particular task, as a feature extractor for an alternate task, is a standard practice in many image classification paradigms. However, to date there have been comparatively few works exploring this technique for speech classification tasks. Herein, we utilise a pre-trained end-to-end Automatic Speech Recognition CNN as a feature extractor for the task of food-type recognition from speech. Furthermore, we also explore the benefits of Compact Bilinear Pooling for combining multiple feature representations extracted from the CNN. Key results presented indicate the suitability of this approach. When combined with a Recurrent Neural Network classifier, our strongest system achieves, for a seven-class food-type classification task an unweighted average recall of 73.3% on the test set of the iHEARu-EAT database.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127447716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Human, Chameleon or Nodding Dog? 人类、变色龙还是点头狗?
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3242998
Leshao Zhang, P. Healey
{"title":"Human, Chameleon or Nodding Dog?","authors":"Leshao Zhang, P. Healey","doi":"10.1145/3242969.3242998","DOIUrl":"https://doi.org/10.1145/3242969.3242998","url":null,"abstract":"Immersive virtual environments (IVEs) present rich possibilities for the experimental study of non-verbal communication. Here, the 'digital chameleon' effect, -which suggests that a virtual speaker (agent) is more persuasive if they mimic their addresses head movements-, was tested. Using a specially constructed IVE, we recreate a full-body analogue version of the 'digital chameleon' experiment. The agent's behaviour is manipulated in three conditions 1) Mimic (Chameleon) in which it copies the participant's nodding 2) Playback (Nodding Dog) which uses nods from playback of a previous participant and are therefore unconnected with the content and 3) Original (Human) in which it uses the prerecorded actor's movements. The results do not support the original finding of differences in ratings of agent persuasiveness between conditions. However, motion capture data reveals systematic differences in a) the real-time movements of speakers and listeners b) between the Original, Mimic and Playback conditions. We conclude that the automatic mimicry model is too simplistic and that this paradigm must address the reciprocal dynamics of non-verbal interaction to achieve its full potential.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127503308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
End-to-end Learning for 3D Facial Animation from Speech 基于语音的3D面部动画端到端学习
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3243017
Hai Xuan Pham, Yuting Wang, V. Pavlovic
{"title":"End-to-end Learning for 3D Facial Animation from Speech","authors":"Hai Xuan Pham, Yuting Wang, V. Pavlovic","doi":"10.1145/3242969.3243017","DOIUrl":"https://doi.org/10.1145/3242969.3243017","url":null,"abstract":"We present a deep learning framework for real-time speech-driven 3D facial animation from speech audio. Our deep neural network directly maps an input sequence of speech spectrograms to a series of micro facial action unit intensities to drive a 3D blendshape face model. In particular, our deep model is able to learn the latent representations of time-varying contextual information and affective states within the speech. Hence, our model not only activates appropriate facial action units at inference to depict different utterance generating actions, in the form of lip movements, but also, without any assumption, automatically estimates emotional intensity of the speaker and reproduces her ever-changing affective states by adjusting strength of related facial unit activations. For example, in a happy speech, the mouth opens wider than normal, while other facial units are relaxed; or both eyebrows raise higher in a surprised state. Experiments on diverse audiovisual corpora of different actors across a wide range of facial actions and emotional states show promising results of our approach. Being speaker-independent, our generalized model is readily applicable to various tasks in human-machine interaction and animation.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126905824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
Online Privacy-Safe Engagement Tracking System 在线隐私安全参与跟踪系统
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3266295
Cheng Zhang, Cheng Chang, L. Chen, Yang Liu
{"title":"Online Privacy-Safe Engagement Tracking System","authors":"Cheng Zhang, Cheng Chang, L. Chen, Yang Liu","doi":"10.1145/3242969.3266295","DOIUrl":"https://doi.org/10.1145/3242969.3266295","url":null,"abstract":"Tracking learners' engagement is useful for monitoring their learning quality. With an increasing number of online video courses, a system that can automatically track learners' engagement is expected to significantly help in improving the outcomes of learners' study. In this demo, we show such a system to predict a user's engagement changes in real time. Our system utilizes webcams ubiquitously existing in nowadays computers, the face tracking function that runs inside the Web browsers to avoid sending learners' videos to the cloud, and a Python Flask web service. Our demo provides a solution of using mature technologies to provide real-time engagement monitoring with privacy protection.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126834581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using Technology for Health and Wellbeing 利用科技促进健康和幸福
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3243392
M. Czerwinski
{"title":"Using Technology for Health and Wellbeing","authors":"M. Czerwinski","doi":"10.1145/3242969.3243392","DOIUrl":"https://doi.org/10.1145/3242969.3243392","url":null,"abstract":"Abstract: How can we create technologies to help us reflect on and change our behavior, improving our health and overall wellbeing? In this talk, I will briefly describe the last several years of work our research team has been doing in this area. We have developed wearable technology to help families manage tense situations with their children, mobile phone-based applications for handling stress and depression, as well as logging tools that can help you stay focused or recommend good times to take a break at work. The overarching goal in all of this research is to develop tools that adapt to the user so that they can maximize their productivity and improve their health.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121644901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Toward Objective, Multifaceted Characterization of Psychotic Disorders: Lexical, Structural, and Disfluency Markers of Spoken Language 精神障碍的客观、多面表征:口语的词汇、结构和不流利标记
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-10-02 DOI: 10.1145/3242969.3243020
A. Vail, E. Liebson, J. Baker, Louis-Philippe Morency
{"title":"Toward Objective, Multifaceted Characterization of Psychotic Disorders: Lexical, Structural, and Disfluency Markers of Spoken Language","authors":"A. Vail, E. Liebson, J. Baker, Louis-Philippe Morency","doi":"10.1145/3242969.3243020","DOIUrl":"https://doi.org/10.1145/3242969.3243020","url":null,"abstract":"Psychotic disorders are forms of severe mental illness characterized by abnormal social function and a general sense of disconnect with reality. The evaluation of such disorders is often complex, as their multifaceted nature is often difficult to quantify. Multimodal behavior analysis technologies have the potential to help address this need and supply timelier and more objective decision support tools in clinical settings. While written language and nonverbal behaviors have been previously studied, the present analysis takes the novel approach of examining the rarely-studied modality of spoken language of individuals with psychosis as naturally used in social, face-to-face interactions. Our analyses expose a series of language markers associated with psychotic symptom severity, as well as interesting interactions between them. In particular, we examine three facets of spoken language: (1) lexical markers, through a study of the function of words; (2) structural markers, through a study of grammatical fluency; and (3) disfluency markers, through a study of dialogue self-repair. Additionally, we develop predictive models of psychotic symptom severity, which achieve significant predictive power on both positive and negative psychotic symptom scales. These results constitute a significant step toward the design of future multimodal clinical decision support tools for computational phenotyping of mental illness.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116718383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Multimodal Dialogue Management for Multiparty Interaction with Infants 婴儿多方互动的多模式对话管理
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-09-05 DOI: 10.1145/3242969.3243029
Setareh Nasihati Gilani, D. Traum, A. Merla, Eugenia Hee, Zoey Walker, Barbara Manini, Grady Gallagher, L. Petitto
{"title":"Multimodal Dialogue Management for Multiparty Interaction with Infants","authors":"Setareh Nasihati Gilani, D. Traum, A. Merla, Eugenia Hee, Zoey Walker, Barbara Manini, Grady Gallagher, L. Petitto","doi":"10.1145/3242969.3243029","DOIUrl":"https://doi.org/10.1145/3242969.3243029","url":null,"abstract":"We present dialogue management routines for a system to engage in multiparty agent-infant interaction. The ultimate purpose of this research is to help infants learn a visual sign language by engaging them in naturalistic and socially contingent conversations during an early-life critical period for language development (ages 6 to 12 months) as initiated by an artificial agent. As a first step, we focus on creating and maintaining agent-infant engagement that elicits appropriate and socially contingent responses from the baby. Our system includes two agents, a physical robot and an animated virtual human. The system's multimodal perception includes an eye-tracker (measures attention) and a thermal infrared imaging camera (measures patterns of emotional arousal). A dialogue policy is presented that selects individual actions and planned multiparty sequences based on perceptual inputs about the baby's internal changing states of emotional engagement. The present version of the system was evaluated in interaction with 8 babies. All babies demonstrated spontaneous and sustained engagement with the agents for several minutes, with patterns of conversationally relevant and socially contingent behaviors. We further performed a detailed case-study analysis with annotation of all agent and baby behaviors. Results show that the baby's behaviors were generally relevant to agent conversations and contained direct evidence for socially contingent responses by the baby to specific linguistic samples produced by the avatar. This work demonstrates the potential for language learning from agents in very young babies and has especially broad implications regarding the use of artificial agents with babies who have minimal language exposure in early life.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132852154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition 基于注意力的视听融合鲁棒自动语音识别
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-09-05 DOI: 10.1145/3242969.3243014
George Sterpu, Christian Saam, N. Harte
{"title":"Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition","authors":"George Sterpu, Christian Saam, N. Harte","doi":"10.1145/3242969.3243014","DOIUrl":"https://doi.org/10.1145/3242969.3243014","url":null,"abstract":"Automatic speech recognition can potentially benefit from the lip motion patterns, complementing acoustic speech to improve the overall recognition performance, particularly in noise. In this paper we propose an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions. We test our strategy on the TCD-TIMIT and LRS2 datasets, designed for large vocabulary continuous speech recognition, applying three types of noise at different power ratios. We also exploit state of the art Sequence-to-Sequence architectures, showing that our method can be easily integrated. Results show relative improvements from 7% up to 30% on TCD-TIMIT over the acoustic modality alone, depending on the acoustic noise level. We anticipate that the fusion strategy can easily generalise to many other multimodal tasks which involve correlated modalities.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132971777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs 基于多尺度rnn的多模态连续轮取预测
Proceedings of the 20th ACM International Conference on Multimodal Interaction Pub Date : 2018-08-31 DOI: 10.1145/3242969.3242997
Matthew Roddy, Gabriel Skantze, N. Harte
{"title":"Multimodal Continuous Turn-Taking Prediction Using Multiscale RNNs","authors":"Matthew Roddy, Gabriel Skantze, N. Harte","doi":"10.1145/3242969.3242997","DOIUrl":"https://doi.org/10.1145/3242969.3242997","url":null,"abstract":"In human conversational interactions, turn-taking exchanges can be coordinated using cues from multiple modalities. To design spoken dialog systems that can conduct fluid interactions it is desirable to incorporate cues from separate modalities into turn-taking models. We propose that there is an appropriate temporal granularity at which modalities should be modeled. We design a multiscale RNN architecture to model modalities at separate timescales in a continuous manner. Our results show that modeling linguistic and acoustic features at separate temporal rates can be beneficial for turn-taking modeling. We also show that our approach can be used to incorporate gaze features into turn-taking models.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129653608","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信