Ilja Popovic, D. Culibrk, Milan Mirković, S. Vukmirović
{"title":"Automatic Speech Recognition and Natural Language Understanding for Emotion Detection in Multi-party Conversations","authors":"Ilja Popovic, D. Culibrk, Milan Mirković, S. Vukmirović","doi":"10.1145/3423325.3423737","DOIUrl":"https://doi.org/10.1145/3423325.3423737","url":null,"abstract":"Conversational emotion and sentiment analysis approaches rely on Natural Language Understanding (NLU) and audio processing components to achieve the goal of detecting emotions and sentiment based on what is being said. While there has been marked progress in pushing the state-of-the-art of theses methods on benchmark multimodal data sets, such as the Multimodal EmotionLines Dataset (MELD), the advances still seem to lag behind what has been achieved in the domain of mainstream Automatic Speech Recognition (ASR) and NLU applications and we were unable to identify any widely used products, services or production-ready systems that would enable the user to reliably detect emotions from audio recordings of multi-party conversations. Published, state-of-the-art scientific studies of multi-view emotion recognition seem to take it for granted that a human-generated or edited transcript is available as input to the NLU modules, providing no information of what happens in a realistic application scenario, where audio only is available and the NLU processing has to rely on text generated by ASR. Motivated by this insight, we present a study designed to evaluate the possibility of applying widely-used state-of-the-art commercial ASR products as the initial audio processing component in an emotion-from-speech detection system. We propose an approach which relies on commercially available products and services, such as Google Speech-to-Text, Mozilla DeepSpeech and the NVIDIA NeMo toolkit to process the audio and applies state-of-the-art NLU approaches for emotion recognition, in order to quickly create a robust, production-ready emotion-from-speech detection system applicable to multi-party conversations.","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114510811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Augment Machine Intelligence with Multimodal Information","authors":"Zhou Yu","doi":"10.1145/3423325.3424123","DOIUrl":"https://doi.org/10.1145/3423325.3424123","url":null,"abstract":"Humans interact with other humans or the world through information from various channels including vision, audio, language, haptics, etc. To simulate intelligence, machines require similar abilities to process and combine information from different channels to acquire better situation awareness, better communication ability, and better decision-making ability. In this talk, we describe three projects. In the first study, we enable a robot to utilize both vision and audio information to achieve better user understanding [1]. Then we use incremental language generation to improve the robot's communication with a human. In the second study, we utilize multimodal history tracking to optimize policy planning in task-oriented visual dialogs. In the third project, we tackle the well-known trade-off between dialog response relevance and policy effectiveness in visual dialog generation. We propose a new machine learning procedure that alternates from supervised learning and reinforcement learning to optimum language generation and policy planning jointly in visual dialogs [2]. We will also cover some recent ongoing work on image synthesis through dialogs.","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124786813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Geraghty, James Hale, S. Sen, Timothy S. Kroecker
{"title":"FUN-Agent: A 2020 HUMAINE Competition Entrant","authors":"R. Geraghty, James Hale, S. Sen, Timothy S. Kroecker","doi":"10.1145/3423325.3423736","DOIUrl":"https://doi.org/10.1145/3423325.3423736","url":null,"abstract":"Of late, there has been a significant surge of interest in industry and the general populace about future potential of human-AI collaboration [20]. Academic researchers have been pushing the frontier of new modalities of peer-level and ad-hoc human agent collaboration [10;22] for a longer period. We have been particularly interested in research on agents representing human users in negotiating deals with other human and autonomous agents [12;16;18]. Here we present the design for the conversational aspect of our agent entry into the HUMAINE League of the 2020 Automated Negotiation Agent Competition (ANAC). We discuss how our agent utilizes conversational and negotiation strategies, that mimic those used in human negotiations, to maximize its utility as a simulated street vendor. We leverage verbal influence tactics, offer pricing, and increasing human convenience to entice the buyer, build trust and discourage exploitation. Additionally, we discuss the results of some in-house testing we conducted.","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123225051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Assisted Speech to Enable Second Language","authors":"Mehmet Altinkaya, A. Smeulders","doi":"10.1145/3423325.3423735","DOIUrl":"https://doi.org/10.1145/3423325.3423735","url":null,"abstract":"Speaking a second language (L2) is a desired capability for billionsof people. Currently, the only way to achieve it naturally is througha lengthy and tedious training, which ends up various stages offluency. The process is far away from the natural acquisition of alanguage.In this paper, we propose a system that enables any person withsome basic understanding of L2 speak fluently through \"Instant As-sistance\" provided by digital conversational agents such as GoogleAssistant, Microsoft Cortana, or Apple Siri, which monitors thespeaker. It attends to provide assistance to continue to speak whenspeech is interrupted as it is not yet completely mastered. The notyet acquired elements of language can be missing words, unfa-miliarity with expressions, the implicit rules of articles, and thehabits of sayings. We can employ the hardware and software of theassistants to create an immersive, adaptive learning environmentto train the speaker online by a symbiotic interaction for implicit,unnoticeable correction.","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125822880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dale Peasley, Michael Naguib, Bohan Xu, S. Sen, Timothy S. Kroecker
{"title":"Motivation and Design of the Conversational Components of DraftAgent for Human-Agent Negotiation","authors":"Dale Peasley, Michael Naguib, Bohan Xu, S. Sen, Timothy S. Kroecker","doi":"10.1145/3423325.3423734","DOIUrl":"https://doi.org/10.1145/3423325.3423734","url":null,"abstract":"In sync with the significant interest in industry and the general populace about future potential of human-AI collaboration [14], academic researchers have been pushing the frontier of new modalities of peer-level and ad-hoc human agent collaboration [4,15]. We have been particularly interested in research on agents representing human users in negotiating deals with other human and autonomous agents [6,11,13]. We present the design motivation and key components of the conversational aspect of our agent entry into the Human-Agent League(HAL) (http://web.tuat.ac.jp/~katfuji/ANAC2020/cfp/ham_cfp.pdf )of the 2020 Automated Negotiation Agent Competition (ANAC). We explore how language can be used to promote human-agent collaboration even in the domain of a competitive negotiation. We present small scale in-lab testing to demonstrate the potential of our approach.","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127690656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Dynamic, Self Supervised, Large Scale AudioVisual Dataset for Stuttered Speech","authors":"Mehmet Altinkaya, A. Smeulders","doi":"10.1145/3423325.3423733","DOIUrl":"https://doi.org/10.1145/3423325.3423733","url":null,"abstract":"Stuttering affects at least 1% of the world population. It is caused by irregular disruptions in speech production. These interruptions occur in various forms and frequencies. Repetition of words or parts of words, prolongations, or blocks in getting the words out are the most common ones. Accurate detection and classification of stuttering would be important in the assessment of severity for speech therapy. Furthermore, real time detection might create many new possibilities to facilitate reconstruction into fluent speech. Such an interface could help people to utilize voice-based interfaces like Apple Siri and Google Assistant, or to make (video) phone calls more fluent by delayed delivery. In this paper we present the first expandable audio-visual database of stuttered speech. We explore an end-to-end, real-time, multi-modal model for detection and classification of stuttered blocks in unbound speech. We also make use of video signals since acoustic signals cannot be produced immediately. We use multiple modalities as acoustic signals together with secondary characteristics exhibited in visual signals will permit an increased accuracy of detection.","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131247702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","authors":"","doi":"10.1145/3423325","DOIUrl":"https://doi.org/10.1145/3423325","url":null,"abstract":"","PeriodicalId":142947,"journal":{"name":"Proceedings of the 1st International Workshop on Multimodal Conversational AI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124322834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}