Companion Publication of the 2022 International Conference on Multimodal Interaction最新文献_第2页

An Emotional Respiration Speech Dataset 情绪呼吸语音数据集

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558803

Rozemarijn Roes, Francisca Pessanha, Almila Akdag Salah

{"title":"An Emotional Respiration Speech Dataset","authors":"Rozemarijn Roes, Francisca Pessanha, Almila Akdag Salah","doi":"10.1145/3536220.3558803","DOIUrl":"https://doi.org/10.1145/3536220.3558803","url":null,"abstract":"Natural interaction with human-like embodied agents, such as social robots or virtual agents, relies on the generation of realistic non-verbal behaviours, including body language, gaze and facial expressions. Humans can read and interpret somatic social signals, such as blushing or changes in the respiration rate and depth, as part of such non-verbal behaviours. Studies show that realistic breathing changes in an agent improve the communication of emotional cues, but there are scarcely any databases for affect analysis with breathing ground truth to learn how affect and breathing correlate. Emotional speech databases typically contain utterances coloured by emotional intonation, instead of natural conversation, and lack breathing annotations. In this paper, we introduce the Emotional Speech Respiration Dataset, collected from 20 subjects in a spontaneous speech setting where emotions are elicited via music. Four emotion classes (happy, sad, annoying, calm) are elicited, with 20 minutes of data per participant. The breathing ground truth is collected with piezoelectric respiration sensors, and affective labels are collected via self-reported valence and arousal levels. Along with these, we extract and share visual features of the participants (such as facial keypoints, action units, gaze directions), transcriptions of the speech instances, and paralinguistic features. Our analysis shows that the music induced emotions show significant changes in the levels of valence for all four emotions, compared to the baseline. Furthermore, the breathing patterns change with happy music significantly, but the changes in other elicitors are less prominent. We believe this resource can be used with different embodied agents to signal affect via simulated breathing.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116157726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

An Architecture Supporting Configurable Autonomous Multimodal Joint-Attention-Therapy for Various Robotic Systems 支持各种机器人系统可配置自主多模态联合注意治疗的体系结构

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558070

André Groß, Christian Schütze, B. Wrede, Birte Richter

引用次数: 5

Speaker Motion Patterns during Self-repairs in Natural Dialogue 自然对话中自我修复过程中的说话人运动模式

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3563684

Elif Ecem Ozkan, Tom Gurion, J. Hough, P. Healey, L. Jamone

{"title":"Speaker Motion Patterns during Self-repairs in Natural Dialogue","authors":"Elif Ecem Ozkan, Tom Gurion, J. Hough, P. Healey, L. Jamone","doi":"10.1145/3536220.3563684","DOIUrl":"https://doi.org/10.1145/3536220.3563684","url":null,"abstract":"An important milestone for any agent in interaction with humans on a regular basis is to achieve natural and efficient methods of communication. Such strategies should be derived on the hallmarks of human-human interaction. So far, the work in embodied conversational agents (ECAs) implementing such signals has been predominantly through imitating human-like positive back-channels, such as nodding, rather than active interaction. The field of Conversation Analysis (CA) focusing on natural human dialogue suggests that people continuously collaborate on achieving mutual understanding by frequently repairing misunderstandings as they happen. Detecting repairs from speech in real-time is challenging, even with state-of-the-art Natural Language Processing (NLP) models. We present specific human motion patterns during key moments of interaction, namely self initiated self-repairs, which would help agents to recognise and collaboratively solve speaker trouble. The features we present in this paper are the pairwise joint distances of head and hands which are more discriminative than the positions themselves.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126502753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Predicting User Confidence in Video Recordings with Spatio-Temporal Multimodal Analytics 用时空多模态分析预测视频记录中的用户信心

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558007

Andrew Emerson, Patrick Houghton, Ke Chen, Vinay Basheerabad, Rutuja Ubale, C. W. Leong

{"title":"Predicting User Confidence in Video Recordings with Spatio-Temporal Multimodal Analytics","authors":"Andrew Emerson, Patrick Houghton, Ke Chen, Vinay Basheerabad, Rutuja Ubale, C. W. Leong","doi":"10.1145/3536220.3558007","DOIUrl":"https://doi.org/10.1145/3536220.3558007","url":null,"abstract":"A critical component of effective communication is the ability to project confidence. In video presentations (e.g., video interviews), there are many factors that influence perceived confidence by a listener. Advances in computer vision, speech processing, and natural language processing have enabled the automatic extraction of salient features that can be used to model a presenter’s perceived confidence. Moreover, these multimodal features can be used to automatically provide feedback to a user with ways they can improve their projected confidence. This paper introduces a multimodal approach to modeling user confidence in video presentations by leveraging features from visual cues (i.e., eye gaze) and speech patterns. We investigate the degree to which the extracted multimodal features were predictive of user confidence with a dataset of 48 2-minute videos, where the participants used a webcam and microphone to record themselves responding to a prompt. Comparative experimental results indicate that our modeling approach of using both visual and speech features are able to score 83% and 78% improvements over the random and majority label baselines, respectively. We discuss implications of using the multimodal features for modeling confidence as well as the potential for automated feedback to users who want to improve their confidence in video presentations.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115385601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Can you tell that I’m confused? An overhearer study for German backchannels by an embodied agent 你能看出我很困惑吗?一具身特工对德国秘密渠道的窃听研究

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558804

Isabel Donya Meywirth, Jana Götze

引用次数: 1

Towards Multimodal Dialog-Based Speech & Facial Biomarkers of Schizophrenia 基于多模态对话的精神分裂症语音和面部生物标志物研究

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558075

Vanessa Richter, Michael Neumann, Hardik Kothare, Oliver Roesler, J. Liscombe, David Suendermann-Oeft, Sebastian Prokop, Anzalee Khan, C. Yavorsky, J. Lindenmayer, Vikram Ramanarayanan

{"title":"Towards Multimodal Dialog-Based Speech & Facial Biomarkers of Schizophrenia","authors":"Vanessa Richter, Michael Neumann, Hardik Kothare, Oliver Roesler, J. Liscombe, David Suendermann-Oeft, Sebastian Prokop, Anzalee Khan, C. Yavorsky, J. Lindenmayer, Vikram Ramanarayanan","doi":"10.1145/3536220.3558075","DOIUrl":"https://doi.org/10.1145/3536220.3558075","url":null,"abstract":"We present a scalable multimodal dialog platform for the remote digital assessment and monitoring of schizophrenia. Patients diagnosed with schizophrenia and healthy controls interacted with Tina, a virtual conversational agent, as she guided them through a brief set of structured tasks, while their speech and facial video was streamed in real-time to a back-end analytics module. Patients were concurrently assessed by trained raters on validated clinical scales. We find that multiple speech and facial biomarkers extracted from these data streams show significant differences (as measured by effect sizes) between patients and controls, and furthermore, machine learning models built on such features can classify patients and controls with high sensitivity and specificity. We further investigate, using correlation analysis between the extracted metrics and standardized clinical scales for the assessment of schizophrenia symptoms, how such speech and facial biomarkers can provide further insight into schizophrenia symptomatology.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116609595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Contextual modulation of affect: Comparing humans and deep neural networks 情感的情境调节:比较人类和深度神经网络

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558036

Soomin Shin, Doo-Hyun Kim, C. Wallraven

{"title":"Contextual modulation of affect: Comparing humans and deep neural networks","authors":"Soomin Shin, Doo-Hyun Kim, C. Wallraven","doi":"10.1145/3536220.3558036","DOIUrl":"https://doi.org/10.1145/3536220.3558036","url":null,"abstract":"When inferring emotions, humans rely on a number of cues, including not only facial expressions, body posture, but also expressor-external, contextual information. The goal of the present study was to compare the impact of such contextual information on emotion processing in humans and two deep neural network (DNN) models. We used results from a human experiment in which two types of pictures were rated for valence and arousal: the first type depicted people expressing an emotion in a social context including other people; the second was a context-reduced version in which all information except for the target expressor was blurred out. The resulting human ratings of valence and arousal were systematically decreased in the context-reduced version, highlighting the importance of context. We then compared human ratings with those of two DNN models (one trained on face images only, and the other trained also on contextual information). Analyses of both categorical and the valence/arousal ratings showed that although there were some superficial similarities, both models failed to capture human rating patterns both in context-rich and context-reduced conditions. Our study emphasizes the importance of a more holistic, multi-modal training regime with richer human data to build better emotion-understanding systems in the area of affective computing.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128198464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Impact of aesthetic movie highlights on semantics and emotions: a preliminary analysis 美学电影亮点对语义和情感的影响:初步分析

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558544

Michal Muszynski, Elenor Morgenroth, Laura Vilaclara, D. Van de Ville, P. Vuilleumier

{"title":"Impact of aesthetic movie highlights on semantics and emotions: a preliminary analysis","authors":"Michal Muszynski, Elenor Morgenroth, Laura Vilaclara, D. Van de Ville, P. Vuilleumier","doi":"10.1145/3536220.3558544","DOIUrl":"https://doi.org/10.1145/3536220.3558544","url":null,"abstract":"Aesthetic highlight detection is a challenge for understanding affective processes underlying emotional movie experience. Aesthetic highlights in movies are scenes with aesthetic values and attributes in terms of form and content. Deep understanding of human emotions while watching movies and automatic recognition of emotions evoked by watching movies are critically important for a wide range of applications, such as affective content creation, analysis, and summarization. Many empirical studies on emotions have formulated theory-driven and data-driven models to uncover the underlying mechanism of emotions using discrete ad dimensional paradigms. Nevertheless, these approaches to emotions do not fully reveal all underlying processes of emotional experience. Recent neuroscience findings has led to the development of multi-process frameworks that aim to characterize emotions as a multi-componential phenomena. In particular, multi-process frameworks can be useful to study emotional movie experience. In this work, we carry out statistical analysis of the componential paradigm on emotions while watching aesthetic highlights in full-length movies. We focus on the effect of the aesthetic highlights on intensity of emotional movie experience. We explore occurrence frequency of different semantic categories involved in constructing different types of the aesthetic highlights. Moreover, we investigate the applicability of machine learning classifiers in predicting the aesthetic highlights from movie scene semantics based features.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128052206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Symbiosis: Design and Development of Novel Soft Robotic Structures for Interactive Public Spaces 共生:交互式公共空间新型软机器人结构的设计与开发

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558541

A. Chooi, Truman Stalin, Aby Raj Plamootil Mathai, Arturo Castillo Ugalde, Yixiao Wang, E. Kanhere, G. Hiramandala, Deborah Loh, P. Valdivia y Alvarado

{"title":"Symbiosis: Design and Development of Novel Soft Robotic Structures for Interactive Public Spaces","authors":"A. Chooi, Truman Stalin, Aby Raj Plamootil Mathai, Arturo Castillo Ugalde, Yixiao Wang, E. Kanhere, G. Hiramandala, Deborah Loh, P. Valdivia y Alvarado","doi":"10.1145/3536220.3558541","DOIUrl":"https://doi.org/10.1145/3536220.3558541","url":null,"abstract":"High-rise concrete structures and crowded public spaces are familiar scenes in today’s fast-paced world, resulting in decreased restorative time for humans with nature. Therefore, interior designers and architects have strived to amalgamate nature-inspired installations into architectural design in an attempt to augment interior spaces into becoming more restorative. In this paper, we explored the development of nature-inspired robotic structures for playful and interactive experiences, which are essential factors contributing to the inhabitant’s perceived restorativeness. Our work focused on developing three kinetic art-based \"robotic plant\" installations for indoor public spaces: soft oscillating LaLang (Impreecta Cylindrica) fields, a blooming dandelion-like flower, and a blooming lotus-like flower. These installations aim to create a relaxing and restorative user experience through bio-inspired design with playful and pleasant plant-human interactions. During the art exhibition, the mesmerizing kinetic movement of the developed devices and their physical interactivity attracted visitors to engage with the devices successfully. This paper discusses the design and development of these three robotic plant installations.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"9 Suppl 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123669129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations. 紧急呼叫中心对话中语音情感识别的变压器编码器和融合策略研究。

Companion Publication of the 2022 International Conference on Multimodal Interaction Pub Date : 2022-11-07 DOI: 10.1145/3536220.3558038

Théo Deschamps-Berger, L. Lamel, L. Devillers

{"title":"Investigating Transformer Encoders and Fusion Strategies for Speech Emotion Recognition in Emergency Call Center Conversations.","authors":"Théo Deschamps-Berger, L. Lamel, L. Devillers","doi":"10.1145/3536220.3558038","DOIUrl":"https://doi.org/10.1145/3536220.3558038","url":null,"abstract":"There has been growing interest in using deep learning techniques to recognize emotions from speech. However, real-life emotion datasets collected in call centers are relatively rare and small, making the use of deep learning techniques quite challenging. This research focuses on the study of Transformer-based models to improve the speech emotion recognition of patients’ speech in French emergency call center dialogues. The experiments were conducted on a corpus called CEMO, which was collected in a French emergency call center. It includes telephone conversations with more than 800 callers and 6 agents. Four emotion classes were selected for these experiments: Anger, Fear, Positive and Neutral state. We compare different Transformer encoders based on the wav2vec2 and BERT models, and explore their fine-tuning as well as fusion of the encoders for emotion recognition from speech. Our objective is to explore how to use these pre-trained models to improve model robustness in the context of a real-life application. We show that the use of specific pre-trained Transformer encoders improves the model performance for emotion recognition in the CEMO corpus. The Unweighted Accuracy (UA) of the french pre-trained wav2vec2 adapted to our task is 73.1%, whereas the UA of our baseline model (Temporal CNN-LSTM without pre-training) is 55.8%. We also tested BERT encoders models: in particular FlauBERT obtained good performance for both manual 67.1% and automatic 67.9% transcripts. The late and model-level fusion of the speech and text models also improve performance (77.1% (late) - 76.9% (model-level)) compared to our best speech pre-trained model, 73.1% UA. In order to place our work in the scientific community, we also report results on the widely used IEMOCAP corpus with our best fusion strategy, 70.8% UA. Our results are promising for constructing more robust speech emotion recognition system for real-world applications.","PeriodicalId":186796,"journal":{"name":"Companion Publication of the 2022 International Conference on Multimodal Interaction","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114833511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8