Dilek Z. Hakkani-Tür, M. Slaney, Asli Celikyilmaz, Larry Heck
{"title":"多模态会话互动中眼睛注视对口语理解的影响","authors":"Dilek Z. Hakkani-Tür, M. Slaney, Asli Celikyilmaz, Larry Heck","doi":"10.1145/2663204.2663277","DOIUrl":null,"url":null,"abstract":"When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.","PeriodicalId":389037,"journal":{"name":"Proceedings of the 16th International Conference on Multimodal Interaction","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"28","resultStr":"{\"title\":\"Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions\",\"authors\":\"Dilek Z. Hakkani-Tür, M. Slaney, Asli Celikyilmaz, Larry Heck\",\"doi\":\"10.1145/2663204.2663277\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.\",\"PeriodicalId\":389037,\"journal\":{\"name\":\"Proceedings of the 16th International Conference on Multimodal Interaction\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"28\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 16th International Conference on Multimodal Interaction\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2663204.2663277\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th International Conference on Multimodal Interaction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2663204.2663277","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Eye Gaze for Spoken Language Understanding in Multi-modal Conversational Interactions
When humans converse with each other, they naturally amalgamate information from multiple modalities (i.e., speech, gestures, speech prosody, facial expressions, and eye gaze). This paper focuses on eye gaze and its combination with speech. We develop a model that resolves references to visual (screen) elements in a conversational web browsing system. The system detects eye gaze, recognizes speech, and then interprets the user's browsing intent (e.g., click on a specific element) through a combination of spoken language understanding and eye gaze tracking. We experiment with multi-turn interactions collected in a wizard-of-Oz scenario where users are asked to perform several web-browsing tasks. We compare several gaze features and evaluate their effectiveness when combined with speech-based lexical features. The resulting multi-modal system not only increases user intent (turn) accuracy by 17%, but also resolves the referring expression ambiguity commonly observed in dialog systems with a 10% increase in F-measure.