{"title":"Identification of the Driver's Interest Point using a Head Pose Trajectory for Situated Dialog Systems","authors":"Young-Ho Kim, Teruhisa Misu","doi":"10.1145/2663204.2663230","DOIUrl":"https://doi.org/10.1145/2663204.2663230","url":null,"abstract":"This paper addresses issues existing in situated language understanding in a moving car. Particularly, we propose a method for understanding user queries regarding specific target buildings in their surroundings based on the driver's head pose and speech information. To identify a meaningful head pose motion related to the user query that is among spontaneous motions while driving, we construct a model describing the relationship between sequences of a driver's head pose and the relative direction to an interest point using the Gaussian process regression. We also consider time-varying interest point using kernel density estimation. We collected situated queries from subject drivers by using our research system embedded in a real car. The proposed method achieves an improvement in the target identification rate by 14% in the user-independent training condition and 27% in the user-dependent training condition over the method that uses the head motion at the start-of-speech timing.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122725402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural Networks for Emotion Recognition in the Wild","authors":"Michal Grosicki","doi":"10.1145/2663204.2666270","DOIUrl":"https://doi.org/10.1145/2663204.2666270","url":null,"abstract":"In this paper we present neural networks based method for emotion recognition. Proposed model was developed as part of 2014 Emotion Recognition in the Wild Challenge. It is composed of modality specific neural networks, which where trained separately on audio and video data extracted from short video clips taken from various movies. Each network was trained on frame-level data, which in later stages were aggregated by simple averaging of predicted class distributions for each clip. In the next stage various techniques for combining modalities where investigated with the best being support vector machine with RBF kernel. Our method achieved accuracy of 37.84%, which is better than 33.7% obtained by the best baseline model provided by organisers.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133701945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MLA'14: Third Multimodal Learning Analytics Workshop and Grand Challenges","authors":"X. Ochoa, M. Worsley, K. Chiluiza, S. Luz","doi":"10.1145/2663204.2668318","DOIUrl":"https://doi.org/10.1145/2663204.2668318","url":null,"abstract":"This paper summarizes the third Multimodal Learning Analytics Workshop and Grand Challenges (MLA'14). This subfield of Learning Analytics focuses on the interpretation of the multimodal interactions that occurs in learning environments, both digital and physical. This is a hybrid event that includes presentations about methods and techniques to analyze and merge the different signals captured from these environments (workshop session) and more concrete results from the application of Multimodal Learning Analytics techniques to predict the performance of students while solving math problems or presenting in the classroom (challenges sessions). A total of eight articles will be presented in this event. The main conclusion from this event is that Multimodal Learning Analytics is a desirable research endeavour that could produce results that can be currently applied to improve the learning process.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133811427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CrossMotion: Fusing Device and Image Motion for User Identification, Tracking and Device Association","authors":"Andrew D. Wilson, Hrvoje Benko","doi":"10.1145/2663204.2663270","DOIUrl":"https://doi.org/10.1145/2663204.2663270","url":null,"abstract":"Identifying and tracking people and mobile devices indoors has many applications, but is still a challenging problem. We introduce a cross-modal sensor fusion approach to track mobile devices and the users carrying them. The CrossMotion technique matches the acceleration of a mobile device, as measured by an onboard internal measurement unit, to similar acceleration observed in the infrared and depth images of a Microsoft Kinect v2 camera. This matching process is conceptually simple and avoids many of the difficulties typical of more common appearance-based approaches. In particular, CrossMotion does not require a model of the appearance of either the user or the device, nor in many cases a direct line of sight to the device. We demonstrate a real time implementation that can be applied to many ubiquitous computing scenarios. In our experiments, CrossMotion found the person's body 99% of the time, on average within 7cm of a reference device position.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129271619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Respiration for Prediction of \"Who Will Be Next Speaker and When?\" in Multi-Party Meetings","authors":"Ryo Ishii, K. Otsuka, Shiro Kumano, Junji Yamato","doi":"10.1145/2663204.2663271","DOIUrl":"https://doi.org/10.1145/2663204.2663271","url":null,"abstract":"To build a model for predicting the next speaker and the start time of the next utterance in multi-party meetings, we performed a fundamental study of how respiration could be effective for the prediction model. The results of the analysis reveal that a speaker inhales more rapidly and quickly right after the end of a unit of utterance in turn-keeping. The next speaker takes a bigger breath toward speaking in turn-changing than listeners who will not become the next speaker. Based on the results of the analysis, we constructed the prediction models to evaluate how effective the parameters are. The results of the evaluation suggest that the speaker's inhalation right after a unit of utterance, such as the start time from the end of the unit of utterance and the slope and duration of the inhalation phase, is effective for predicting whether turn-keeping or turn-changing happen about 350 ms before the start time of the next utterance on average and that listener's inhalation before the next utterance, such as the maximal inspiration and amplitude of the inhalation phase, is effective for predicting the next speaker in turn-changing about 900 ms before the start time of the next utterance on average.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127641867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Multimodal Context-based Approach for Distress Assessment","authors":"Sayan Ghosh, Moitreya Chatterjee, Louis-Philippe Morency","doi":"10.1145/2663204.2663274","DOIUrl":"https://doi.org/10.1145/2663204.2663274","url":null,"abstract":"The increasing prevalence of psychological distress disorders, such as depression and post-traumatic stress, necessitates a serious effort to create new tools and technologies to help with their diagnosis and treatment. In recent years, new computational approaches were proposed to objectively analyze patient non-verbal behaviors over the duration of the entire interaction between the patient and the clinician. In this paper, we go beyond non-verbal behaviors and propose a tri-modal approach which integrates verbal behaviors with acoustic and visual behaviors to analyze psychological distress during the course of the dyadic semi-structured interviews. Our approach exploits the advantages of the dyadic nature of these interactions to contextualize the participant responses based on the affective components (intimacy and polarity levels) of the questions. We validate our approach using one of the largest corpus of semi-structured interviews for distress assessment which consists of 154 multimodal dyadic interactions. Our results show significant improvement on distress prediction performance when integrating verbal behaviors with acoustic and visual behaviors. In addition, our analysis shows that contextualizing the responses improves the prediction performance, most significantly with positive and intimate questions.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129056559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote Address 4","authors":"J. Cohn","doi":"10.1145/3246752","DOIUrl":"https://doi.org/10.1145/3246752","url":null,"abstract":"","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123987935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Radoslaw Niewiadomski, M. Mancini, Yu Ding, C. Pelachaud, G. Volpe
{"title":"Rhythmic Body Movements of Laughter","authors":"Radoslaw Niewiadomski, M. Mancini, Yu Ding, C. Pelachaud, G. Volpe","doi":"10.1145/2663204.2663240","DOIUrl":"https://doi.org/10.1145/2663204.2663240","url":null,"abstract":"In this paper we focus on three aspects of multimodal expressions of laughter. First, we propose a procedural method to synthesize rhythmic body movements of laughter based on spectral analysis of laughter episodes. For this purpose, we analyze laughter body motions from motion capture data and we reconstruct them with appropriate harmonics. Then we reduce the parameter space to two dimensions. These are the inputs of the actual model to generate a continuum of laughs rhythmic body movements. In the paper, we also propose a method to integrate rhythmic body movements generated by our model with other synthetized expressive cues of laughter such as facial expressions and additional body movements. Finally, we present a real-time human-virtual character interaction scenario where virtual character applies our model to answer to human's laugh in real-time. ","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129443089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Oral Session 6: Healthcare and Assistive Technologies","authors":"D. Bohus","doi":"10.1145/3246751","DOIUrl":"https://doi.org/10.1145/3246751","url":null,"abstract":"","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126338910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Personal Aesthetics for Soft Biometrics: A Generative Multi-resolution Approach","authors":"Cristina Segalin, A. Perina, M. Cristani","doi":"10.1145/2663204.2663259","DOIUrl":"https://doi.org/10.1145/2663204.2663259","url":null,"abstract":"Are we recognizable by our image preferences? This paper answers affirmatively the question, presenting a soft biometric approach where the preferred images of an individual are used as his personal signature in identification tasks. The approach builds a multi-resolution latent space, formed by multiple Counting Grids, where similar images are mapped nearby. On this space, a set of preferred images of a user produces an ensemble of intensity maps, highlighting in an intuitive way his personal aesthetic preferences. These maps are then used for learning a battery of discriminative classifiers (one for each resolution), which characterizes the user and serves to perform identification. Results are promising: on a dataset of 200 users, and 40K images, using 20 preferred images as biometric template gives 66% of probability of guessing the correct user. This makes the \"personal aesthetics\" a very hot topic for soft biometrics, while its usage in standard biometric applications seems to be far from being effective, as we show in a simple user study.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"293 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124208570","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}