C. Zimmerer, Erik Wolf, Sara Wolf, Martin Fischbach, Jean-Luc Lugrin, Marc Erich Latoschik
{"title":"Finally on Par?! Multimodal and Unimodal Interaction for Open Creative Design Tasks in Virtual Reality","authors":"C. Zimmerer, Erik Wolf, Sara Wolf, Martin Fischbach, Jean-Luc Lugrin, Marc Erich Latoschik","doi":"10.1145/3382507.3418850","DOIUrl":"https://doi.org/10.1145/3382507.3418850","url":null,"abstract":"Multimodal Interfaces (MMIs) have been considered to provide promising interaction paradigms for Virtual Reality (VR) for some time. However, they are still far less common than unimodal interfaces (UMIs). This paper presents a summative user study comparing an MMI to a typical UMI for a design task in VR. We developed an application targeting creative 3D object manipulations, i.e., creating 3D objects and modifying typical object properties such as color or size. The associated open user task is based on the Torrence Tests of Creative Thinking. We compared a synergistic multimodal interface using speech-accompanied pointing/grabbing gestures with a more typical unimodal interface using a hierarchical radial menu to trigger actions on selected objects. Independent judges rated the creativity of the resulting products using the Consensual Assessment Technique. Additionally, we measured the creativity-promoting factors flow, usability, and presence. Our results show that the MMI performs on par with the UMI in all measurements despite its limited flexibility and reliability. These promising results demonstrate the technological maturity of MMIs and their potential to extend traditional interaction techniques in VR efficiently.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121071096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nimesha Ranasinghe, Meetha Nesam James, Michael Gecawicz, Jonathan Bland, David Smith
{"title":"Influence of Electric Taste, Smell, Color, and Thermal Sensory Modalities on the Liking and Mediated Emotions of Virtual Flavor Perception","authors":"Nimesha Ranasinghe, Meetha Nesam James, Michael Gecawicz, Jonathan Bland, David Smith","doi":"10.1145/3382507.3418862","DOIUrl":"https://doi.org/10.1145/3382507.3418862","url":null,"abstract":"Little is known about the influence of various sensory modalities such as taste, smell, color, and thermal, towards perceiving simulated flavor sensations, let alone their influence on people's emotions and liking. Although flavor sensations are essential in our daily experiences and closely associated with our memories and emotions, the concept of flavor and the emotions caused by different sensory modalities are not thoroughly integrated into Virtual and Augmented Reality technologies. Hence, this paper presents 1) an interactive technology to simulate different flavor sensations by overlaying taste (via electrical stimulation on the tongue), smell (via micro air pumps), color (via RGB Lights), and thermal (via Peltier elements) sensations on plain water, and 2) a set of experiments to investigate a) the influence of different sensory modalities on the perception and liking of virtual flavors and b) varying emotions mediated through virtual flavor sensations. Our findings reveal that the participants perceived and liked various stimuli configurations and mostly associated them with positive emotions while highlighting important avenues for future research.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128567423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fusical: Multimodal Fusion for Video Sentiment","authors":"Bo Jin, L. Abdelrahman, C. Chen, Amil Khanzada","doi":"10.1145/3382507.3417966","DOIUrl":"https://doi.org/10.1145/3382507.3417966","url":null,"abstract":"Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128703912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Supporting Instructors to Provide Emotional and Instructional Scaffolding for English Language Learners through Biosensor-based Feedback","authors":"Heera Lee","doi":"10.1145/3382507.3421159","DOIUrl":"https://doi.org/10.1145/3382507.3421159","url":null,"abstract":"Delivering a presentation has been reported as one of the most anxiety-provoking tasks faced by English Language Learners. Researchers suggest that instructors should be more aware of the learners' emotional states to provide appropriate emotional and instructional scaffolding to improve their performance when presenting. Despite the critical role of instructors in perceiving the emotional states among English language learners, it can be challenging to do this solely by observing the learners? facial expressions, behaviors, and their limited verbal expressions due to language and cultural barriers. To address the ambiguity and inconsistency in interpreting the emotional states of the students, this research focuses on identifying the potential of using biosensor-based feedback of learners to support instructors. A novel approach has been adopted to classify the intensity and characteristics of public speaking anxiety and foreign language anxiety among English language learners and to provide tailored feedback to instructors while supporting teaching and learning. As part of this work, two further studies were proposed. The first study was designed to identify educators' needs for solutions providing emotional and instructional support. The second study aims to evaluate a resulting prototype as a view of instructors to offer tailored emotional and instructional scaffolding to students. The contribution of these studies includes the development of guidance in using biosensor-based feedback that will assist English language instructors in teaching and identifying the students' anxiety levels and types while delivering a presentation.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126220980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Touch Recognition with Attentive End-to-End Model","authors":"Wail El Bani, M. Chetouani","doi":"10.1145/3382507.3418834","DOIUrl":"https://doi.org/10.1145/3382507.3418834","url":null,"abstract":"Touch is the earliest sense to develop and the first mean of contact with the external world. Touch also plays a key role in our socio-emotional communication: we use it to communicate our feelings, elicit strong emotions in others and modulate behavior (e.g compliance). Although its relevance, touch is an understudied modality in Human-Machine-Interaction compared to audition and vision. Most of the social touch recognition systems require a feature engineering step making them difficult to compare and to generalize to other databases. In this paper, we propose an end-to-end approach. We present an attention-based end-to-end model for touch gesture recognition evaluated on two public datasets (CoST and HAART) in the context of the ICMI 15 Social Touch Challenge. Our model gave a similar level of accuracy: 61% for CoST and 68% for HAART and uses self-attention as an alternative to feature engineering and Recurrent Neural Networks.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114707871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanan Wang, Jianming Wu, Jinfa Huang, Gen Hattori, Y. Takishima, Shinya Wada, Rui Kimura, Jie Chen, Satoshi Kurihara
{"title":"LDNN: Linguistic Knowledge Injectable Deep Neural Network for Group Cohesiveness Understanding","authors":"Yanan Wang, Jianming Wu, Jinfa Huang, Gen Hattori, Y. Takishima, Shinya Wada, Rui Kimura, Jie Chen, Satoshi Kurihara","doi":"10.1145/3382507.3418830","DOIUrl":"https://doi.org/10.1145/3382507.3418830","url":null,"abstract":"Group cohesiveness reflects the level of intimacy that people feel with each other, and the development of a dialogue robot that can understand group cohesiveness will lead to the promotion of human communication. However, group cohesiveness is a complex concept that is difficult to predict based only on image pixels. Inspired by the fact that humans intuitively associate linguistic knowledge accumulated in the brain with the visual images they see, we propose a linguistic knowledge injectable deep neural network (LDNN) that builds a visual model (visual LDNN) for predicting group cohesiveness that can automatically associate the linguistic knowledge hidden behind images. LDNN consists of a visual encoder and a language encoder, and applies domain adaptation and linguistic knowledge transition mechanisms to transform linguistic knowledge from a language model to the visual LDNN. We train LDNN by adding descriptions to the training and validation sets of the Group AFfect Dataset 3.0 (GAF 3.0), and test the visual LDNN without any description. Comparing visual LDNN with various fine-tuned DNN models and three state-of-the-art models in the test set, the results demonstrate that the visual LDNN not only improves the performance of the fine-tuned DNN model leading to an MSE very similar to the state-of-the-art model, but is also a practical and efficient method that requires relatively little preprocessing. Furthermore, ablation studies confirm that LDNN is an effective method to inject linguistic knowledge into visual models.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127963772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian-Bo Tang, Yi Yang, Jieping Ye
{"title":"Multi-modal Fusion Using Spatio-temporal and Static Features for Group Emotion Recognition","authors":"Mo Sun, Jian Li, Hui Feng, Wei Gou, Haifeng Shen, Jian-Bo Tang, Yi Yang, Jieping Ye","doi":"10.1145/3382507.3417971","DOIUrl":"https://doi.org/10.1145/3382507.3417971","url":null,"abstract":"This paper presents our approach for Audio-video Group Emotion Recognition sub-challenge in the EmotiW 2020. The task is to classify a video into one of the group emotions such as positive, neutral, and negative. Our approach exploits two different feature levels for this task, spatio-temporal feature and static feature level. In spatio-temporal feature level, we adopt multiple input modalities (RGB, RGB difference, optical flow, warped optical flow) into multiple video classification network to train the spatio-temporal model. In static feature level, we crop all faces and bodies in an image with the state-of the-art human pose estimation method and train kinds of CNNs with the image-level labels of group emotions. Finally, we fuse all 14 models result together, and achieve the third place in this sub-challenge with classification accuracies of 71.93% and 70.77% on the validation set and test set, respectively.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134331834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The AI-Medic: A Multimodal Artificial Intelligent Mentor for Trauma Surgery","authors":"Edgar Rojas-Muñoz, K. Couperus, J. Wachs","doi":"10.1145/3382507.3421167","DOIUrl":"https://doi.org/10.1145/3382507.3421167","url":null,"abstract":"Telementoring generalist surgeons as they treat patients can be essential when in situ expertise is not readily available. However, adverse cyber-attacks, unreliable network conditions, and remote mentors' predisposition can significantly jeopardize the remote intervention. To provide medical practitioners with guidance when mentors are unavailable, we present the AI-Medic, the initial steps towards the development of a multimodal intelligent artificial system for autonomous medical mentoring. The system uses a tablet device to acquire the view of an operating field. This imagery is provided to an encoder-decoder neural network trained to predict medical instructions from the current view of a surgery. The network was training using DAISI, a dataset including images and instructions providing step-by-step demonstrations of surgical procedures. The predicted medical instructions are conveyed to the user via visual and auditory modalities.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130087006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hand-eye Coordination for Textual Difficulty Detection in Text Summarization","authors":"Jun Wang, G. Ngai, H. Leong","doi":"10.1145/3382507.3418831","DOIUrl":"https://doi.org/10.1145/3382507.3418831","url":null,"abstract":"The task of summarizing a document is a complex task that requires a person to multitask between reading and writing processes. Since a person's cognitive load during reading or writing is known to be dependent upon the level of comprehension or difficulty of the article, this suggests that it should be possible to analyze the cognitive process of the user when carrying out the task, as evidenced through their eye gaze and typing features, to obtain an insight into the different difficulty levels. In this paper, we categorize the summary writing process into different phases and extract different gaze and typing features from each phase according to characteristics of eye-gaze behaviors and typing dynamics. Combining these multimodal features, we build a classifier that achieves an accuracy of 91.0% for difficulty level detection, which is around 55% performance improvement above the baseline and at least 15% improvement above models built on a single modality. We also investigate the possible reasons for the superior performance of our multimodal features.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130308151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal Interaction in Psychopathology","authors":"Itir Onal Ertugrul, J. Cohn, Hamdi Dibeklioğlu","doi":"10.1145/3382507.3419751","DOIUrl":"https://doi.org/10.1145/3382507.3419751","url":null,"abstract":"This paper presents an introduction to the Multimodal Interaction in Psychopathology workshop, which is held virtually in conjunction with the 22nd ACM International Conference on Multimodal Interaction on October 25th, 2020. This workshop has attracted submissions in the context of investigating multimodal interaction to reveal mechanisms and assess, monitor, and treat psychopathology. Keynote speakers from diverse disciplines present an overview of the field from different vantages and comment on future directions. Here we summarize the goals and the content of the workshop.","PeriodicalId":402394,"journal":{"name":"Proceedings of the 2020 International Conference on Multimodal Interaction","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126637592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}