Yanan Guo, Jing Han, Zixing Zhang, Björn Schuller, Yide Ma
{"title":"Exploring A New Method for Food Likability Rating Based on DT-CWT Theory","authors":"Yanan Guo, Jing Han, Zixing Zhang, Björn Schuller, Yide Ma","doi":"10.1145/3242969.3243684","DOIUrl":"https://doi.org/10.1145/3242969.3243684","url":null,"abstract":"In this paper, we mainly investigate subjects' food likability based on audio-related features as a contribution to EAT ? the ICMI 2018 Eating Analysis and Tracking challenge. Specifically, we conduct 4-level Double Tree Complex Wavelet Transform decomposition of an audio signal, and obtain five sub-audio signals with frequencies ranging from low to high. For each sub-audio signal, not only 'traditional' functional-based features but also deep learning-based features via pretrained CNNs based on SliCQ-nonstationary Gabor transform and a cochleagram map, are calculated. Besides, the original audio signals based Bag-of-Audio-Words features extracted by the openXBOW toolkit are used to enhance the model as well. Finally, the early fusion of all these three kinds of features can lead to promising results, yielding the highest UAR of 79.2 % by means of a leave-one-speaker-out cross-validation, which holds a 12.7 % absolute gain compared with the baseline of 66.5 % UAR.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"231 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116770043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Smart Arse: Posture Classification with Textile Sensors in Trousers","authors":"Sophie Skach, R. Stewart, P. Healey","doi":"10.1145/3242969.3242977","DOIUrl":"https://doi.org/10.1145/3242969.3242977","url":null,"abstract":"Body posture is a good indicator of, amongst other things, people's state of arousal, focus of attention and level of interest in a conversation. Posture is conventionally measured by observation and hand coding of videos or, more recently, through automated computer vision and motion capture techniques. Here we introduce a novel alternative approach exploiting a new modality: posture classification using bespoke 'smart' trousers with integrated textile pressure sensors. Changes in posture translate to changes in pressure patterns across the surface of our clothing. We describe the construction of the textile pressure sensor and, using simple machine learning techniques on data gathered from 10 participants, demonstrate its ability to discriminate between 19 different basic posture types with high accuracy. This technology has the potential to support anonymous, unintrusive sensing of interest, attention and engagement in a wide variety of settings.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127089392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Theodora Chaspari, A. Metallinou, L. S. Duker, A. Behzadan
{"title":"Human-Habitat for Health (H3): Human-habitat Multimodal Interaction for Promoting Health and Well-being in the Internet of Things Era","authors":"Theodora Chaspari, A. Metallinou, L. S. Duker, A. Behzadan","doi":"10.1145/3242969.3265862","DOIUrl":"https://doi.org/10.1145/3242969.3265862","url":null,"abstract":"This paper presents an introduction to the \"Human-Habitat for Health (H3): Human-habitat multimodal interaction for promoting health and well-being in the Internet of Things era\" workshop, which was held at the 20th ACM International Conference on Multimodal Interaction on October 16th, 2018, in Boulder, CO, USA. The main theme of the workshop focused on the effect of the physical or virtual environment on individual's behavior, well-being, and health. The H3 workshop included keynote speeches that provided an overview and future directions of the field, as well as presentations including position papers and research contributions. The workshop brought together experts from academia and industry spanning a set of multi-disciplinary fields, including computer science, speech and spoken language understanding, construction science, life-sciences, health sciences, and psychology, to discuss their respective views and identify synergistic and converging research directions and solutions.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125933651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal Modeling of Coordination and Coregulation Patterns in Speech Rate during Triadic Collaborative Problem Solving","authors":"Angela E. B. Stewart, Z. Keirn, S. D’Mello","doi":"10.1145/3242969.3242989","DOIUrl":"https://doi.org/10.1145/3242969.3242989","url":null,"abstract":"We model coordination and coregulation patterns in 33 triads engaged in collaboratively solving a challenging computer programming task for approximately 20 minutes. Our goal is to prospectively model speech rate (words/sec) - an important signal of turn taking and active participation - of one teammate (A or B or C) from time lagged nonverbal signals (speech rate and acoustic-prosodic features) of the other two (i.e., A + B → C; A + C → B; B + C → A) and task-related context features. We trained feed-forward neural networks (FFNNs) and long short-term memory recurrent neural networks (LSTMs) using group-level nested cross-validation. LSTMs outperformed FFNNs and a chance baseline and could predict speech rate up to 6s into the future. A multimodal combination of speech rate, acoustic-prosodic, and task context features outperformed unimodal and bimodal signals. The extent to which the models could predict an individual's speech rate was positively related to that individual's scores on a subsequent posttest, suggesting a link between coordination/coregulation and collaborative learning outcomes. We discuss applications of the models for real-time systems that monitor the collaborative process and intervene to promote positive collaborative outcomes.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114351994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interpretable Multimodal Deception Detection in Videos","authors":"Hamid Karimi","doi":"10.1145/3242969.3264967","DOIUrl":"https://doi.org/10.1145/3242969.3264967","url":null,"abstract":"There are various real-world applications such as video ads, airport screenings, courtroom trials, and job interviews where deception detection can play a crucial role. Hence, there are immense demands on deception detection in videos. Videos contain rich information including acoustic, visual, temporal, and/or linguistic information, which provides great opportunities for advanced deception detection. However, videos are inherently complex; moreover, they lack detective labels in many real-world applications, which poses tremendous challenges to traditional deception detection. In this manuscript, I present my Ph.D. research on the problem of deception detection in videos. In particular, I provide a principled way to capture rich information into a coherent model and propose an end-to-end framework DEV to detect DEceptive Videos automatically. Preliminary results on real-world videos demonstrate the effectiveness of the proposed framework.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114744195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hand, Foot or Voice: Alternative Input Modalities for Touchless Interaction in the Medical Domain","authors":"Benjamin Hatscher, C. Hansen","doi":"10.1145/3242969.3242971","DOIUrl":"https://doi.org/10.1145/3242969.3242971","url":null,"abstract":"During medical interventions, direct interaction with medical image data is a cumbersome task for physicians due to the sterile environment. Even though touchless input via hand, foot or voice is possible, these modalities are not available for these tasks all the time. Therefore, we investigated touchless input methods as alternatives to each other with focus on two common interaction tasks in sterile settings: activation of a system to avoid unintentional input and manipulation of continuous values. We created a system where activation could be achieved via voice, hand or foot gestures and continuous manipulation via hand and foot gestures. We conducted a comparative user study and found that foot interaction performed best in terms of task completion times and scored highest in the subjectively assessed measures usability and usefulness. Usability and usefulness scores for hand and voice were only slightly worse and all participants were able to perform all tasks in a sufficient short amount of time. This work contributes by proposing methods to interact with computers in sterile, dynamic environments and by providing evaluation results for direct comparison of alternative modalities for common interaction tasks.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127881205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Empathy in Embodied Conversational Agents: Extended Abstract","authors":"Ö. Yalçın","doi":"10.1145/3242969.3264977","DOIUrl":"https://doi.org/10.1145/3242969.3264977","url":null,"abstract":"This paper is intended to outline the PhD research that is aimed to model empathy in embodied conversational systems. Our goal is to determine the requirements for implementation of an empathic interactive agent and develop evaluation methods that is aligned with the empathy research from various fields. The thesis is composed of three scientific contributions: (i) developing a computational model of empathy, (ii) implementation of the model in embodied conversational agents and (iii) enhance the understanding of empathy in interaction by generating data and build evaluation tools. The paper will give results for the contribution (i) and preliminary results for contribution (ii). Moreover, we will present the future plan for contribution (ii) and (iii).","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132844090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating Visual Focus of Attention in Multiparty Meetings using Deep Convolutional Neural Networks","authors":"K. Otsuka, Keisuke Kasuga, Martina Köhler","doi":"10.1145/3242969.3242973","DOIUrl":"https://doi.org/10.1145/3242969.3242973","url":null,"abstract":"Convolutional neural networks (CNNs) are employed to estimate the visual focus of attention (VFoA), also called gaze direction , in multiparty face-to-face meetings on the basis of multimodal nonverbal behaviors including head pose, direction of the eyeball, and presence/absence of utterance. To reveal the potential of CNNs, we focus on aspects of multimodal and multiparty fusion including individual/group models, early/late fusion, and robustness when using inputs from image-based trackers. In contrast to the individual model that separately targets each person specific to one's seat, the group model aims to jointly estimate the gaze directions of all participants. Experiments confirmed that the group model outperformed the individual model especially in predicting listeners' VFoA when the inputs did not include eyeball directions. This result indicates that the group CNN model can implicitly learn underlying conversation structures, e.g., the listeners' gazes converge on the speaker. When the eyeball direction feature is available, both models outperformed the Bayes models used for comparison. In this case, the individual model was superior to the group model, particularly in estimating the speaker's VFoA. Moreover, it was revealed that in group models, two-stage late fusion, which integrates an individual features first, and multiparty features second, outperformed other structures. Furthermore, our experiment confirmed that image-based tracking can provide a comparable level of performance to that of sensor-based measurements. Overall, the results suggest that the CNN is a promising approach for VFoA estimation.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130849081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using Interlocutor-Modulated Attention BLSTM to Predict Personality Traits in Small Group Interaction","authors":"Yun-Shao Lin, Chi-Chun Lee","doi":"10.1145/3242969.3243001","DOIUrl":"https://doi.org/10.1145/3242969.3243001","url":null,"abstract":"Small group interaction occurs often in workplace and education settings. Its dynamic progression is an essential factor in dictating the final group performance outcomes. The personality of each individual within the group is reflected in his/her interpersonal behaviors with other members of the group as they engage in these task-oriented interactions. In this work, we propose an interlocutor-modulated attention BSLTM (IM-aBLSTM) architecture that models an individual's vocal behaviors during small group interactions in order to automatically infer his/her personality traits. The interlocutor-modulated attention mechanism jointly optimize the relevant interpersonal vocal behaviors of other members of group during interactions. In specifics, we evaluate our proposed IM-aBLSTM in one of the largest small group interaction database, the ELEA corpus. Our framework achieves a promising unweighted recall accuracy of 87.9% in ten different binary personality trait prediction tasks, which outperforms the best results previously reported on the same database by 10.4% absolute. Finally, by analyzing the interpersonal vocal behaviors in the region of high attention weights, we observe several distinct intra- and inter-personal vocal behavior patterns that vary as a function of personality traits.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133581533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Ensemble Model Using Face and Body Tracking for Engagement Detection","authors":"Cheng Chang, Cheng Zhang, L. Chen, Yang Liu","doi":"10.1145/3242969.3264986","DOIUrl":"https://doi.org/10.1145/3242969.3264986","url":null,"abstract":"Precise detection and localization of learners' engagement levels are useful for monitoring their learning quality. In the emotiW Challenge's engagement detection task, we proposed a series of novel improvements, including (a) a cluster-based framework for fast engagement level predictions, (b) a neural network using the attention pooling mechanism, (c) heuristic rules using body posture information, and (d) model ensemble for more accurate and robust predictions. Our experimental results suggest that our proposed methods effectively improved engagement detection performance. On the validation set, our system can reduce the baseline Mean Squared Error (MSE) by about 56%. On the final test set, our system yielded a competitively low MSE of 0.081.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"604 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132728287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}