Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions

Proceedings of the 16th International Conference on Multimodal Interaction Pub Date : 2014-11-12 DOI:10.1145/2663204.2663277

Dilek Z. Hakkani-Tür, M. Slaney, Asli Celikyilmaz, Larry Heck

{"title":"Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions","authors":"Dilek Z. Hakkani-Tür, M. Slaney, Asli Celikyilmaz, Larry Heck","doi":"10.1145/2663204.2663277","DOIUrl":null,"url":null,"abstract":"When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2663204.2663277","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 28

Abstract

When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.

查看原文本刊更多论文

多模态会话互动中眼睛注视对口语理解的影响

当人类彼此交谈时，他们自然地融合来自多种形式的信息(即，语音，手势，语音韵律，面部表情和眼睛注视)。本文主要研究目光注视及其与言语的结合。我们开发了一个模型来解析会话式web浏览系统中对可视(屏幕)元素的引用。该系统检测眼睛注视，识别语音，然后通过口语理解和眼睛注视跟踪的结合来解释用户的浏览意图(例如，点击特定元素)。我们在一个wizard-of-Oz场景中收集了多回合交互，在这个场景中，用户被要求执行几个网页浏览任务。我们比较了几种凝视特征，并评估了它们与基于语音的词汇特征结合时的有效性。由此产生的多模态系统不仅将用户意图(转向)准确性提高了17%，而且还解决了对话系统中常见的引用表达歧义，F-measure提高了10%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th International Conference on Multimodal Interaction

自引率

0.00%

发文量