Companion Publication of the 2020 International Conference on Multimodal Interaction最新文献

FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation 语音驱动手势生成的自回归行为克隆

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3616115

Leon Harz, Hendric Voß, Stefan Kopp

{"title":"FEIN-Z: Autoregressive Behavior Cloning for Speech-Driven Gesture Generation","authors":"Leon Harz, Hendric Voß, Stefan Kopp","doi":"10.1145/3577190.3616115","DOIUrl":"https://doi.org/10.1145/3577190.3616115","url":null,"abstract":"Human communication relies on multiple modalities such as verbal expressions, facial cues, and bodily gestures. Developing computational approaches to process and generate these multimodal signals is critical for seamless human-agent interaction. A particular challenge is the generation of co-speech gestures due to the large variability and number of gestures that can accompany a verbal utterance, leading to a one-to-many mapping problem. This paper presents an approach based on a Feature Extraction Infusion Network (FEIN-Z) that adopts insights from robot imitation learning and applies them to co-speech gesture generation. Building on the BC-Z architecture, our framework combines transformer architectures and Wasserstein generative adversarial networks. We describe the FEIN-Z methodology and evaluation results obtained within the GENEA Challenge 2023, demonstrating good results and significant improvements in human-likeness over the GENEA baseline. We discuss potential areas for improvement, such as refining input segmentation, employing more fine-grained control networks, and exploring alternative inference methods.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135043301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

How Noisy is Too Noisy? The Impact of Data Noise on Multimodal Recognition of Confusion and Conflict During Collaborative Learning 多吵才算太吵?数据噪声对协同学习中混淆和冲突多模态识别的影响

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614127

Yingbo Ma, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F. Lynch, Eric Wiebe, Maya Israel

{"title":"How Noisy is Too Noisy? The Impact of Data Noise on Multimodal Recognition of Confusion and Conflict During Collaborative Learning","authors":"Yingbo Ma, Mehmet Celepkolu, Kristy Elizabeth Boyer, Collin F. Lynch, Eric Wiebe, Maya Israel","doi":"10.1145/3577190.3614127","DOIUrl":"https://doi.org/10.1145/3577190.3614127","url":null,"abstract":"Intelligent systems to support collaborative learning rely on real-time behavioral data, including language, audio, and video. However, noisy data, such as word errors in speech recognition, audio static or background noise, and facial mistracking in video, often limit the utility of multimodal data. It is an open question of how we can build reliable multimodal models in the face of substantial data noise. In this paper, we investigate the impact of data noise on the recognition of confusion and conflict moments during collaborative programming sessions by 25 dyads of elementary school learners. We measure language errors with word error rate (WER), audio noise with speech-to-noise ratio (SNR), and video errors with frame-by-frame facial tracking accuracy. The results showed that the model’s accuracy for detecting confusion and conflict in the language modality decreased drastically from 0.84 to 0.73 when the WER exceeded 20%. Similarly, in the audio modality, the model’s accuracy decreased sharply from 0.79 to 0.61 when the SNR dropped below 5 dB. Conversely, the model’s accuracy remained relatively constant in the video modality at a comparable level (> 0.70) so long as at least one learner’s face was successfully tracked. Moreover, we trained several multimodal models and found that integrating multimodal data could effectively offset the negative effect of noise in unimodal data, ultimately leading to improved accuracy in recognizing confusion and conflict. These findings have practical implications for the future deployment of intelligent systems that support collaborative learning in actual classroom settings.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"273 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Video-based Respiratory Waveform Estimation in Dialogue: A Novel Task and Dataset for Human-Machine Interaction 对话中基于视频的呼吸波形估计:人机交互的新任务和数据集

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614154

Takao Obi, Kotaro Funakoshi

引用次数: 0

Gait Event Prediction of People with Cerebral Palsy using Feature Uncertainty: A Low-Cost Approach 脑性瘫痪患者步态事件预测的特征不确定性:一种低成本方法

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614125

Saikat Chakraborty, Noble Thomas, Anup Nandy

{"title":"Gait Event Prediction of People with Cerebral Palsy using Feature Uncertainty: A Low-Cost Approach","authors":"Saikat Chakraborty, Noble Thomas, Anup Nandy","doi":"10.1145/3577190.3614125","DOIUrl":"https://doi.org/10.1145/3577190.3614125","url":null,"abstract":"Incorporation of feature uncertainty during model construction explores the real generalization ability of that model. But this factor has been avoided often during automatic gait event detection for Cerebral Palsy patients. Again, the prevailing vision-based gait event detection systems are expensive due to incorporation of high-end motion tracking cameras. This study proposes a low-cost gait event detection system for heel strike and toe-off events. A state-space model was constructed where the temporal evolution of gait signal was devised by quantifying feature uncertainty. The model was trained using Cardiff classifier. Ankle velocity was taken as the input feature. The frame associated with state transition was marked as a gait event. The model was tested on 15 Cerebral Palsy patients and 15 normal subjects. Data acquisition was performed using low-cost Kinect cameras. The model identified gait events on an average of 2 frame error. All events were predicted before the actual occurrence. Error for toe-off was less than the heel strike. Incorporation of the uncertainty factor in the detection of gait events exhibited a competing performance with respect to state-of-the-art.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"36 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

µGeT: Multimodal eyes-free text selection technique combining touch interaction and microgestures µGeT:结合触摸交互和微手势的多模态文本选择技术

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614131

Gauthier Robert Jean Faisandaz, Alix Goguey, Christophe Jouffrais, Laurence Nigay

{"title":"µGeT: Multimodal eyes-free text selection technique combining touch interaction and microgestures","authors":"Gauthier Robert Jean Faisandaz, Alix Goguey, Christophe Jouffrais, Laurence Nigay","doi":"10.1145/3577190.3614131","DOIUrl":"https://doi.org/10.1145/3577190.3614131","url":null,"abstract":"We present µGeT, a novel multimodal eyes-free text selection technique. µGeT combines touch interaction with microgestures. µGeT is especially suited for People with Visual Impairments (PVI) by expanding the input bandwidth of touchscreen devices, thus shortening the interaction paths for routine tasks. To do so, µGeT extends touch interaction (left/right and up/down flicks) using two simple microgestures: thumb touching either the index or the middle finger. For text selection, the multimodal technique allows us to directly modify the positioning of the two selection handles and the granularity of text selection. Two user studies, one with 9 PVI and one with 8 blindfolded sighted people, compared µGeT with a baseline common technique (VoiceOver like on iPhone). Despite a large variability in performance, the two user studies showed that µGeT is globally faster and yields fewer errors than VoiceOver. A detailed analysis of the interaction trajectories highlights the different strategies adopted by the participants. Beyond text selection, this research shows the potential of combining touch interaction and microgestures for improving the accessibility of touchscreen devices for PVI.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring Feedback Modality Designs to Improve Young Children's Collaborative Actions 探索反馈模式设计以提高幼儿的协作行为

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614140

Amy Melniczuk, Egesa Vrapi

引用次数: 0

Out of Sight,... How Asymmetry in Video-Conference Affects Social Interaction 看不见了，……视频会议中的不对称如何影响社交互动

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614168

Camille Sallaberry, Gwenn Englebienne, Jan Van Erp, Vanessa Evers

{"title":"Out of Sight,... How Asymmetry in Video-Conference Affects Social Interaction","authors":"Camille Sallaberry, Gwenn Englebienne, Jan Van Erp, Vanessa Evers","doi":"10.1145/3577190.3614168","DOIUrl":"https://doi.org/10.1145/3577190.3614168","url":null,"abstract":"As social-mediated interaction is becoming increasingly important and multi-modal, even expanding into virtual reality and physical telepresence with robotic avatars, new challenges emerge. For instance, video calls have become the norm and it is increasingly common that people experience a form of asymmetry, such as not being heard or seen by their communication partners online due to connection issues. Previous research has not yet extensively explored the effect on social interaction. In this study, 61 Dyads, i.e. 122 adults, played a quiz-like game using a video-conferencing platform and evaluated the quality of their social interaction by measuring five sub-scales of social presence. The Dyads had either symmetrical access to social cues (both only audio, or both audio and video) or asymmetrical access (one partner receiving only audio, the other audio and video). Our results showed that in the case of asymmetrical access, the party receiving more modalities, i.e. audio and video from the other, felt significantly less connected than their partner. We discuss these results in relation to the Media Richness Theory (MRT) and the Hyperpersonal Model: in asymmetry, more modalities or cues will not necessarily increase feeling socially connected, in opposition to what was predicted by MRT. We hypothesize that participants sending fewer cues compensate by increasing the richness of their expressions and that the interaction shifts towards an equivalent richness for both participants.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using Speech Patterns to Model the Dimensions of Teamness in Human-Agent Teams 基于语音模式的人- agent团队合作维度建模

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614121

Emily Doherty, Cara A Spencer, Lucca Eloy, Nitin Kumar, Rachel Dickler, Leanne Hirshfield

{"title":"Using Speech Patterns to Model the Dimensions of Teamness in Human-Agent Teams","authors":"Emily Doherty, Cara A Spencer, Lucca Eloy, Nitin Kumar, Rachel Dickler, Leanne Hirshfield","doi":"10.1145/3577190.3614121","DOIUrl":"https://doi.org/10.1145/3577190.3614121","url":null,"abstract":"Teamness is a newly proposed multidimensional construct aimed to characterize teams and their dynamic levels of interdependence over time. Specifically, teamness is deeply rooted in team cognition literature, considering how a team’s composition, processes, states, and actions affect collaboration. With this multifaceted construct being recently proposed, there is a call to the research community to investigate, measure, and model dimensions of teamness. In this study, we explored the speech content of 21 human-human-agent teams during a remote collaborative search task. Using self-report surveys of their social and affective states throughout the task, we conducted factor analysis to condense the survey measures into four components closely aligned with the dimensions outlined in the teamness framework: social dynamics and trust, affect, cognitive load, and interpersonal reliance. We then extracted features from teams’ speech using Linguistic Inquiry and Word Count (LIWC) and performed Epistemic Network Analyses (ENA) across these four teamwork components as well as team performance. We developed six hypotheses of how we expected specific LIWC features to correlate with self-reported team processes and performance, which we investigated through our ENA analyses. Through quantitative and qualitative analyses of the networks, we explore differences of speech patterns across the four components and relate these findings to the dimensions of teamness. Our results indicate that ENA models based on selected LIWC features were able to capture elements of teamness as well as team performance; this technique therefore shows promise for modeling of these states during CSCW, to ultimately design intelligent systems to promote greater teamness using speech-based measures.","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Projecting life onto machines 将生命投射到机器上

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3616522

Simone Natale

引用次数: 0

ReNeLiB: Real-time Neural Listening Behavior Generation for Socially Interactive Agents ReNeLiB:基于社会交互agent的实时神经倾听行为生成

Companion Publication of the 2020 International Conference on Multimodal Interaction Pub Date : 2023-10-09 DOI: 10.1145/3577190.3614133

Daksitha Senel Withanage Don, Philipp Müller, Fabrizio Nunnari, Elisabeth André, Patrick Gebhard

{"title":"ReNeLiB: Real-time Neural Listening Behavior Generation for Socially Interactive Agents","authors":"Daksitha Senel Withanage Don, Philipp Müller, Fabrizio Nunnari, Elisabeth André, Patrick Gebhard","doi":"10.1145/3577190.3614133","DOIUrl":"https://doi.org/10.1145/3577190.3614133","url":null,"abstract":"Flexible and natural nonverbal reactions to human behavior remain a challenge for socially interactive agents (SIAs) that are predominantly animated using hand-crafted rules. While recently proposed machine learning based approaches to conversational behavior generation are a promising way to address this challenge, they have not yet been employed in SIAs. The primary reason for this is the lack of a software toolkit integrating such approaches with SIA frameworks that conforms to the challenging real-time requirements of human-agent interaction scenarios. In our work, we for the first time present such a toolkit consisting of three main components: (1) real-time feature extraction capturing multi-modal social cues from the user; (2) behavior generation based on a recent state-of-the-art neural network approach; (3) visualization of the generated behavior supporting both FLAME-based and Apple ARKit-based interactive agents. We comprehensively evaluate the real-time performance of the whole framework and its components. In addition, we introduce pre-trained behavioral generation models derived from psychotherapy sessions for domain-specific listening behaviors. Our software toolkit, pivotal for deploying and assessing SIAs’ listening behavior in real-time, is publicly available. Resources, including code, behavioural multi-modal features extracted from therapeutic interactions, are hosted at https://daksitha.github.io/ReNeLib","PeriodicalId":93171,"journal":{"name":"Companion Publication of the 2020 International Conference on Multimodal Interaction","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135044542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0