Ilhan Aslan, Tabea Schmidt, Jens Woehrle, Lukas Vogel, E. André
{"title":"Pen + Mid-Air Gestures: Eliciting Contextual Gestures","authors":"Ilhan Aslan, Tabea Schmidt, Jens Woehrle, Lukas Vogel, E. André","doi":"10.1145/3242969.3242979","DOIUrl":"https://doi.org/10.1145/3242969.3242979","url":null,"abstract":"Combining mid-air gestures with pen input for bi-manual input on tablets has been reported as an alternative and attractive input technique in drawing applications. Previous work has also argued that mid-air gestural input can cause discomfort and arm fatigue over time, which can be addressed in a desktop setting by allowing users to gesture in alternative restful arm positions (e.g., elbow rests on desk). However, it is unclear if and how gesture preferences and gesture designs would be different for alternative arm positions. In order to inquire these research question we report on a user and choice based gesture elicitation study in which 10 participants designed gestures for different arm positions. We provide an in-depth qualitative analysis and detailed categorization of gestures, discussing commonalities and differences in the gesture sets based on a \"think aloud\" protocol, video recordings, and self-reports on user preferences.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134504532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Philipp Mock, Maike Tibus, A. Ehlis, H. Baayen, Peter Gerjets
{"title":"Predicting ADHD Risk from Touch Interaction Data","authors":"Philipp Mock, Maike Tibus, A. Ehlis, H. Baayen, Peter Gerjets","doi":"10.1145/3242969.3242986","DOIUrl":"https://doi.org/10.1145/3242969.3242986","url":null,"abstract":"This paper presents a novel approach for automatic prediction of risk of ADHD in schoolchildren based on touch interaction data. We performed a study with 129 fourth-grade students solving math problems on a multiple-choice interface to obtain a large dataset of touch trajectories. Using Support Vector Machines, we analyzed the predictive power of such data for ADHD scales. For regression of overall ADHD scores, we achieve a mean squared error of 0.0962 on a four-point scale (R² = 0.5667). Classification accuracy for increased ADHD risk (upper vs. lower third of collected scores) is 91.1%.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115523790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hiroki Tanaka, Hideki Negoro, H. Iwasaka, Satoshi Nakamura
{"title":"Listening Skills Assessment through Computer Agents","authors":"Hiroki Tanaka, Hideki Negoro, H. Iwasaka, Satoshi Nakamura","doi":"10.1145/3242969.3242970","DOIUrl":"https://doi.org/10.1145/3242969.3242970","url":null,"abstract":"Social skills training, performed by human trainers, is a well-established method for obtaining appropriate skills in social interaction. Previous work automated the process of social skills training by developing a dialogue system that teaches social skills through interaction with a computer agent. Even though previous work that simulated social skills training considered speaking skills, human social skills trainers take into account other skills such as listening. In this paper, we propose assessment of user listening skills during conversation with computer agents toward automated social skills training. We recorded data of 27 Japanese graduate students interacting with a female agent. The agent spoke to the participants about a recent memorable story and how to make a telephone call, and the participants listened. Two expert external raters assessed the participants' listening skills. We manually extracted features relating to eye fixation and behavioral cues of the participants, and confirmed that a simple linear regression with selected features can correctly predict a user's listening skills with above 0.45 correlation coefficient.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114510613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying-Chao Tung, Mayank Goel, Isaac Zinda, Jacob O. Wobbrock
{"title":"RainCheck","authors":"Ying-Chao Tung, Mayank Goel, Isaac Zinda, Jacob O. Wobbrock","doi":"10.1145/3242969.3243028","DOIUrl":"https://doi.org/10.1145/3242969.3243028","url":null,"abstract":"Modern smartphones are built with capacitive-sensing touchscreens, which can detect anything that is conductive or has a dielectric differential with air. The human finger is an example of such a dielectric, and works wonderfully with such touchscreens. However, touch interactions are disrupted by raindrops, water smear, and wet fingers because capacitive touchscreens cannot distinguish finger touches from other conductive materials. When users' screens get wet, the screen's usability is significantly reduced. RainCheck addresses this hazard by filtering out potential touch points caused by water to differentiate fingertips from raindrops and water smear, adapting in real-time to restore successful interaction to the user. Specifically, RainCheck uses the low-level raw sensor data from touchscreen drivers and employs precise selection techniques to resolve water-fingertip ambiguity. Our study shows that RainCheck improves gesture accuracy by 75.7%, touch accuracy by 47.9%, and target selection time by 80.0%, making it a successful remedy to interference caused by rain and other water.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"403 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115996589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shao-Yen Tseng, Haoqi Li, Brian R. Baucom, P. Georgiou
{"title":"\"Honey, I Learned to Talk\": Multimodal Fusion for Behavior Analysis","authors":"Shao-Yen Tseng, Haoqi Li, Brian R. Baucom, P. Georgiou","doi":"10.1145/3242969.3242996","DOIUrl":"https://doi.org/10.1145/3242969.3242996","url":null,"abstract":"In this work we analyze the importance of lexical and acoustic modalities in behavioral expression and perception. We demonstrate that this importance relates to the amount of therapy, and hence communication training, that a person received. It also exhibits some relationship to gender. We proceed to provide an analysis on couple therapy data by splitting the data into clusters based on gender or stage in therapy. Our analysis demonstrates the significant difference between optimal modality weights per cluster and relationship to therapy stage. Given this finding we propose the use of communication-skill aware fusion models to account for these differences in modality importance. The fusion models operate on partitions of the data according to the gender of the speaker or the therapy stage of the couple. We show that while most multimodal fusion methods can improve mean absolute error of behavioral estimates, the best results are given by a model that considers the degree of communication training among the interlocutors.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129248348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Niklas Rach, Klaus Weber, Louisa Pragst, E. André, W. Minker, Stefan Ultes
{"title":"EVA: A Multimodal Argumentative Dialogue System","authors":"Niklas Rach, Klaus Weber, Louisa Pragst, E. André, W. Minker, Stefan Ultes","doi":"10.1145/3242969.3266292","DOIUrl":"https://doi.org/10.1145/3242969.3266292","url":null,"abstract":"This work introduces EVA, a multimodal argumentative Dialogue System that is capable of discussing controversial topics with the user. The interaction is structured as an argument game in which the user and the system select respective moves in order to convince their opponent. EVA's response is presented as a natural language utterance by a virtual agent that supports the respective content using characteristic gestures and mimic.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125591748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiple Spatio-temporal Feature Learning for Video-based Emotion Recognition in the Wild","authors":"Cheng Lu, Wenming Zheng, Chaolong Li, Chuangao Tang, Suyuan Liu, Simeng Yan, Yuan Zong","doi":"10.1145/3242969.3264992","DOIUrl":"https://doi.org/10.1145/3242969.3264992","url":null,"abstract":"The difficulty of emotion recognition in the wild (EmotiW) is how to train a robust model to deal with diverse scenarios and anomalies. The Audio-video Sub-challenge in EmotiW contains audio-video short clips with several emotional labels and the task is to distinguish which label the video belongs to. For the better emotion recognition in videos, we propose a multiple spatio-temporal feature fusion (MSFF) framework, which can more accurately depict emotional information in spatial and temporal dimensions by two mutually complementary sources, including the facial image and audio. The framework is consisted of two parts: the facial image model and the audio model. With respect to the facial image model, three different architectures of spatial-temporal neural networks are employed to extract discriminative features about different emotions in facial expression images. Firstly, the high-level spatial features are obtained by the pre-trained convolutional neural networks (CNN), including VGG-Face and ResNet-50 which are all fed with the images generated by each video. Then, the features of all frames are sequentially input to the Bi-directional Long Short-Term Memory (BLSTM) so as to capture dynamic variations of facial appearance textures in a video. In addition to the structure of CNN-RNN, another spatio-temporal network, namely deep 3-Dimensional Convolutional Neural Networks (3D CNN) by extending the 2D convolution kernel to 3D, is also applied to attain evolving emotional information encoded in multiple adjacent frames. For the audio model, the spectrogram images of speech generated by preprocessing audio, are also modeled in a VGG-BLSTM framework to characterize the affective fluctuation more efficiently. Finally, a fusion strategy with the score matrices of different spatio-temporal networks gained from the above framework is proposed to boost the performance of emotion recognition complementally. Extensive experiments show that the overall accuracy of our proposed MSFF is 60.64%, which achieves a large improvement compared with the baseline and outperform the result of champion team in 2017.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115616910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gao-Yi Chao, Chun-Min Chang, Jeng-Lin Li, Ya-Tse Wu, Chi-Chun Lee
{"title":"Generating fMRI-Enriched Acoustic Vectors using a Cross-Modality Adversarial Network for Emotion Recognition","authors":"Gao-Yi Chao, Chun-Min Chang, Jeng-Lin Li, Ya-Tse Wu, Chi-Chun Lee","doi":"10.1145/3242969.3242992","DOIUrl":"https://doi.org/10.1145/3242969.3242992","url":null,"abstract":"Automatic emotion recognition has long been developed by concentrating on modeling human expressive behavior. At the same time, neuro-scientific evidences have shown that the varied neuro-responses (i.e., blood oxygen level-dependent (BOLD) signals measured from the functional magnetic resonance imaging (fMRI)) is also a function on the types of emotion perceived. While past research has indicated that fusing acoustic features and fMRI improves the overall speech emotion recognition performance, obtaining fMRI data is not feasible in real world applications. In this work, we propose a cross modality adversarial network that jointly models the bi-directional generative relationship between acoustic features of speech samples and fMRI signals of human percetual responses by leveraging a parallel dataset. We encode the acoustic descriptors of a speech sample using the learned cross modality adversarial network to generate the fMRI-enriched acoustic vectors to be used in the emotion classifier. The generated fMRI-enriched acoustic vector is evaluated not only in the parallel dataset but also in an additional dataset without fMRI scanning. Our proposed framework significantly outperform using acoustic features only in a four-class emotion recognition task for both datasets, and the use of cyclic loss in learning the bi-directional mapping is also demonstrated to be crucial in achieving improved recognition rates.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114372257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Putze, Jutta Hild, A. Sano, Enkelejda Kasneci, E. Solovey, Tanja Schultz
{"title":"Modeling Cognitive Processes from Multimodal Signals","authors":"F. Putze, Jutta Hild, A. Sano, Enkelejda Kasneci, E. Solovey, Tanja Schultz","doi":"10.1145/3242969.3265861","DOIUrl":"https://doi.org/10.1145/3242969.3265861","url":null,"abstract":"Multimodal signals allow us to gain insights into internal cognitive processes of a person, for example: speech and gesture analysis yields cues about hesitations, knowledgeability, or alertness, eye tracking yields information about a person's focus of attention, task, or cognitive state, EEG yields information about a person's cognitive load or information appraisal. Capturing cognitive processes is an important research tool to understand human behavior as well as a crucial part of a user model to an adaptive interactive system such as a robot or a tutoring system. As cognitive processes are often multifaceted, a comprehensive model requires the combination of multiple complementary signals. In this workshop at the ACM International Conference on Multimodal Interfaces (ICMI) conference in Boulder, Colorado, USA, we discussed the state-of-the-art in monitoring and modeling cognitive processes from multi-modal signals.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117198022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simultaneous Multimodal Access to Wheelchair and Computer for People with Tetraplegia","authors":"M. N. Sahadat, Nordine Sebkhi, Maysam Ghovanloo","doi":"10.1145/3242969.3242980","DOIUrl":"https://doi.org/10.1145/3242969.3242980","url":null,"abstract":"Existing assistive technologies often capture and utilize a single remaining ability to assist people with tetraplegia which is unable to do complex interaction efficiently. In this work, we developed a multimodal assistive system (MAS) to utilize multiple remaining abilities (speech, tongue, and head motion) sequentially or simultaneously to facilitate complex computer interactions such as scrolling, drag and drop, and typing long sentences. Inputs of MAS can be used to drive a wheelchair using only tongue motion, mouse functionalities (e.g., clicks, navigation) by combining the tongue and head motions. To enhance seamless interface, MAS processes both head and tongue motions in the headset with an average accuracy of 88.5%. In a pilot study, a modified center-out tapping task was performed by four able-bodied participants to navigate cursor, using head tracking, click using tongue command, and text entry through speech recognition, respectively. The average throughput in the final round was 1.28 bits/s and a cursor navigation path efficiency of 68.62%.","PeriodicalId":308751,"journal":{"name":"Proceedings of the 20th ACM International Conference on Multimodal Interaction","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121589009","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}