L. Chen, G. Feng, Jilliam Joe, C. W. Leong, Christopher Kitchen, Chong Min Lee
{"title":"Towards Automated Assessment of Public Speaking Skills Using Multimodal Cues","authors":"L. Chen, G. Feng, Jilliam Joe, C. W. Leong, Christopher Kitchen, Chong Min Lee","doi":"10.1145/2663204.2663265","DOIUrl":"https://doi.org/10.1145/2663204.2663265","url":null,"abstract":"Traditional assessments of public speaking skills rely on human scoring. We report an initial study on the development of an automated scoring model for public speaking performances using multimodal technologies. Task design, rubric development, and human rating were conducted according to standards in educational assessment. An initial corpus of 17 speakers with 4 speaking tasks was collected using audio, video, and 3D motion capturing devices. A scoring model based on basic features in the speech content, speech delivery, and hand, body, and head movements significantly predicts human rating, suggesting the feasibility of using multimodal technologies in the assessment of public speaking skills.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116895641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why We Watch the News: A Dataset for Exploring Sentiment in Broadcast Video News","authors":"Joseph G. Ellis, Brendan Jou, Shih-Fu Chang","doi":"10.1145/2663204.2663237","DOIUrl":"https://doi.org/10.1145/2663204.2663237","url":null,"abstract":"We present a multimodal sentiment study performed on a novel collection of videos mined from broadcast and cable television news programs. To the best of our knowledge, this is the first dataset released for studying sentiment in the domain of broadcast video news. We describe our algorithm for the processing and creation of person-specific segments from news video, yielding 929 sentence-length videos, and are annotated via Amazon Mechanical Turk. The spoken transcript and the video content itself are each annotated for their expression of positive, negative or neutral sentiment. Based on these gathered user annotations, we demonstrate for news video the importance of taking into account multimodal information for sentiment prediction, and in particular, challenging previous text-based approaches that rely solely on available transcripts. We show that as much as 21.54% of the sentiment annotations for transcripts differ from their respective sentiment annotations when the video clip itself is presented. We present audio and visual classification baselines over a three-way sentiment prediction of positive, negative and neutral, as well as person-dependent versus person-independent classification influence on performance. Finally, we release the News Rover Sentiment dataset to the greater research community.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114359928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Moubayed, D. Bohus, A. Esposito, D. Heylen, Maria Koutsombogera, Haris Papageorgiou, Gabriel Skantze
{"title":"UM3I 2014: International Workshop on Understanding and Modeling Multiparty, Multimodal Interactions","authors":"S. Moubayed, D. Bohus, A. Esposito, D. Heylen, Maria Koutsombogera, Haris Papageorgiou, Gabriel Skantze","doi":"10.1145/2663204.2668321","DOIUrl":"https://doi.org/10.1145/2663204.2668321","url":null,"abstract":"In this paper, we present a brief summary of the international workshop on Modeling Multiparty, Multimodal Interactions. The UM3I 2014 workshop is held in conjunction with the ICMI 2014 conference. The workshop will highlight recent developments and adopted methodologies in the analysis and modeling of multiparty and multimodal interactions, the design and implementation principles of related human-machine interfaces, as well as the identification of potential limitations and ways of overcoming them.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"196 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114198710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gregor Mehlmann, M. Häring, Kathrin Janowski, Tobias Baur, Patrick Gebhard, E. André
{"title":"Exploring a Model of Gaze for Grounding in Multimodal HRI","authors":"Gregor Mehlmann, M. Häring, Kathrin Janowski, Tobias Baur, Patrick Gebhard, E. André","doi":"10.1145/2663204.2663275","DOIUrl":"https://doi.org/10.1145/2663204.2663275","url":null,"abstract":"Grounding is an important process that underlies all human interaction. Hence, it is crucial for building social robots that are expected to collaborate effectively with humans. Gaze behavior plays versatile roles in establishing, maintaining and repairing the common ground. Integrating all these roles in a computational dialog model is a complex task since gaze is generally combined with multiple parallel information modalities and involved in multiple processes for the generation and recognition of behavior. Going beyond related work, we present a modeling approach focusing on these multi-modal, parallel and bi-directional aspects of gaze that need to be considered for grounding and their interleaving with the dialog and task management. We illustrate and discuss the different roles of gaze as well as advantages and drawbacks of our modeling approach based on a first user study with a technically sophisticated shared workspace application with a social humanoid robot.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133585260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Managing Human-Robot Engagement with Forecasts and... um... Hesitations","authors":"D. Bohus, E. Horvitz","doi":"10.1145/2663204.2663241","DOIUrl":"https://doi.org/10.1145/2663204.2663241","url":null,"abstract":"We explore methods for managing conversational engagement in open-world, physically situated dialog systems. We investigate a self-supervised methodology for constructing forecasting models that aim to anticipate when participants are about to terminate their interactions with a situated system. We study how these models can be leveraged to guide a disengagement policy that uses linguistic hesitation actions, such as filled and non-filled pauses, when uncertainty about the continuation of engagement arises. The hesitations allow for additional time for sensing and inference, and convey the system's uncertainty. We report results from a study of the proposed approach with a directions-giving robot deployed in the wild.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129695502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Detecting conversing groups with a single worn accelerometer","authors":"H. Hung, G. Englebienne, L. C. Quiros","doi":"10.1145/2663204.2663228","DOIUrl":"https://doi.org/10.1145/2663204.2663228","url":null,"abstract":"In this paper we propose the novel task of detecting groups of conversing people using only a single body-worn accelerometer per person. Our approach estimates each individual's social actions and uses the co-ordination of these social actions between pairs to identify group membership. The aim of such an approach is to be deployed in dense crowded environments. Our work differs significantly from previous approaches, which have tended to rely on audio and/or proximity sensing, often in much less crowded scenarios, for estimating whether people are talking together or who is speaking. Ultimately, we are interested in detecting who is speaking, who is conversing with whom, and from that, to infer socially relevant information about the interaction such as whether people are enjoying themselves, or the quality of their relationship in these extremely dense crowded scenarios. Striving towards this long-term goal, this paper presents a systematic study to understand how to detect groups of people who are conversing together in this setting, where we achieve a $64%$ classification accuracy using a fully automated system.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130423834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philipp Tiefenbacher, Steven Wichert, D. Merget, G. Rigoll
{"title":"Impact of Coordinate Systems on 3D Manipulations in Mobile Augmented Reality","authors":"Philipp Tiefenbacher, Steven Wichert, D. Merget, G. Rigoll","doi":"10.1145/2663204.2663234","DOIUrl":"https://doi.org/10.1145/2663204.2663234","url":null,"abstract":"Mobile touch PCs allow interactions with virtual objects in augmented reality scenes. Manipulations of 3D objects are a common way of such interactions, which can be performed in three different coordinate systems: the camera-, object- and world coordinate systems. The camera coordinate system changes continuously in augmented reality as it depends on the mobile device's pose. The axis orientations of the world coordinate system are steady, whereas the axes of the object coordinates base on previous manipulations. The selection of a coordinate system therefore influences the 3D transformation's orientation independent from the used manipulation type. In this paper, we evaluate the impact of the three possible coordinate systems on rotation and on translation of a 3D item in an augmented reality scenario. A study with 36 participants determines the best coordinates for translation and rotation.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128248552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Non-Visual Navigation Using Combined Audio Music and Haptic Cues","authors":"Emily Fujimoto, M. Turk","doi":"10.1145/2663204.2663243","DOIUrl":"https://doi.org/10.1145/2663204.2663243","url":null,"abstract":"While a great deal of work has been done exploring non-visual navigation interfaces using audio and haptic cues, little is known about the combination of the two. We investigate combining different state-of-the-art interfaces for communicating direction and distance information using vibrotactile and audio music cues, limiting ourselves to interfaces that are possible with current off-the-shelf smartphones. We use experimental logs, subjective task load questionnaires, and user comments to see how users' perceived performance, objective performance, and acceptance of the system varied for different combinations. Users' perceived performance did not differ much between the unimodal and multimodal interfaces, but a few users commented that the multimodal interfaces added some cognitive load. Objective performance showed that some multimodal combinations resulted in significantly less direction or distance error over some of the unimodal ones, especially the purely haptic interface. Based on these findings we propose a few design considerations for multimodal haptic/audio navigation interfaces.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125572565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining Multiple Kernel Methods on Riemannian Manifold for Emotion Recognition in the Wild","authors":"Mengyi Liu, Ruiping Wang, Shaoxin Li, S. Shan, Zhiwu Huang, Xilin Chen","doi":"10.1145/2663204.2666274","DOIUrl":"https://doi.org/10.1145/2663204.2666274","url":null,"abstract":"In this paper, we present the method for our submission to the Emotion Recognition in the Wild Challenge (EmotiW 2014). The challenge is to automatically classify the emotions acted by human subjects in video clips under real-world environment. In our method, each video clip can be represented by three types of image set models (i.e. linear subspace, covariance matrix, and Gaussian distribution) respectively, which can all be viewed as points residing on some Riemannian manifolds. Then different Riemannian kernels are employed on these set models correspondingly for similarity/distance measurement. For classification, three types of classifiers, i.e. kernel SVM, logistic regression, and partial least squares, are investigated for comparisons. Finally, an optimal fusion of classifiers learned from different kernels and different modalities (video and audio) is conducted at the decision level for further boosting the performance. We perform an extensive evaluation on the challenge data (including validation set and blind test set), and evaluate the effects of different strategies in our pipeline. The final recognition accuracy achieved 50.4% on test set, with a significant gain of 16.7% above the challenge baseline 33.7%.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122181251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gaze-Based Proactive User Interface for Pen-Based Systems","authors":"Çagla Çig","doi":"10.1145/2663204.2666287","DOIUrl":"https://doi.org/10.1145/2663204.2666287","url":null,"abstract":"In typical human-computer interaction, users convey their intentions through traditional input devices (e.g. keyboards, mice, joysticks) coupled with standard graphical user interface elements. Recently, pen-based interaction has emerged as a more intuitive alternative to these traditional means. However, existing pen-based systems are limited by the fact that they rely heavily on auxiliary mode switching mechanisms during interaction (e.g. hard or soft modifier keys, buttons, menus). In this paper, I describe the roadmap for my PhD research which aims at using eye gaze movements that naturally occur during pen-based interaction to reduce dependency on explicit mode selection mechanisms in pen-based systems.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122489232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}