{"title":"Context-sensitive help for multimodal dialogue","authors":"H. Hastie, Michael Johnston, Patrick Ehlen","doi":"10.1109/ICMI.2002.1166975","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166975","url":null,"abstract":"Multimodal interfaces offer users unprecedented flexibility in choosing a style of interaction. However, users are frequently unaware of or forget shorter or more effective multimodal or pen-based commands. This paper describes a working help system that leverages the capabilities of a multimodal interface in order to provide targeted, unobtrusive, context-sensitive help. This multimodal help system guides the user to the most effective way to specify a request, providing transferable knowledge that can be used in future requests without repeatedly invoking the help system.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128073931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Talking heads: which matching between faces and synthetic voices?","authors":"Marc Mersiol, N. Chateau, V. Maffiolo","doi":"10.1109/ICMI.2002.1166971","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166971","url":null,"abstract":"The integration of synthetic faces and text-to-speech voice synthesis (what we call \"talking heads\") allows new applications in the area of man-machine interfaces. In the near future, talking heads might be useful communicative interface agents. But before making an extensive use of talking heads, several issues have to be checked according to their acceptability by users. An important issue is to make sure that the used synthetic voices match their faces. The scope of this paper is to study the coherence that might exist between synthetic voices and faces. Twenty-four subjects rated the coherence of all the combinations between ten faces and six voices. The main results of this paper show that not all associations between faces and voices are relevant and that some associations are better rated than others according to qualitative criteria.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128135886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data driven design of an ANN/HMM system for on-line unconstrained handwritten character recognition","authors":"Haifeng Li, T. Artières, P. Gallinari","doi":"10.1109/ICMI.2002.1166984","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166984","url":null,"abstract":"This paper is dedicated to a data driven design method for a hybrid ANN/HMM based handwriting recognition system. On one hand, a data driven designed neural modelling of handwriting primitives is proposed. ANNs are firstly used as state models in a HMM primitive divider that associates each signal frame with an ANN by minimizing the accumulated prediction error. Then, the neural modelling is realized by training each network on its own frame set. Organizing these two steps in an EM algorithm, precise primitive models are obtained. On the other hand, a data driven systematic method is proposed for the HMM topology inference task. All possible prototypes of a pattern class are firstly merged into several clusters by a tabu search aided clustering algorithm. Then a multiple parallel-path HMM is constructed for the pattern class. Experiments prove an 8% recognition improvement with a saving of 50% of system resources, compared to an intuitively designed referential ANN/HMM system.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133516149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-modal embodied agents scripting","authors":"Y. Arafa, A. Mamdani","doi":"10.1109/ICMI.2002.1167038","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167038","url":null,"abstract":"Embodied agents present ongoing challenging agenda for research in multi-modal user interfaces and human-computer-interaction. Such agent metaphors will only be widely applicable to online applications when there is a standardised way to map underlying engines with the visual presentation of the agents. This paper delineates the functions and specifications of a mark-up language for scripting the animation of virtual characters. The language is called: Character Mark-up Language (CML) and is an XML-based character attribute definition and animation scripting language designed to aid in the rapid incorporation of lifelike characters/agents into online applications or virtual reality worlds. This multi-modal scripting language is designed to be easily understandable by human animators and easily generated by a software process such as software agents. CML is constructed based jointly on motion and multi-modal capabilities of virtual life-like figures. The paper further illustrates the constructs of the language and describes a real-time execution architecture that demonstrates the use of such a language as a 4G language to easily utilise and integrate MPEG-4 media objects in online interfaces and virtual environments.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114442807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The NESPOLE! multimodal interface for cross-lingual communication $experience and lessons learned","authors":"Loredana Taddei, E. Costantini, A. Lavie","doi":"10.1109/ICMI.2002.1166997","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166997","url":null,"abstract":"We describe the design, evolution, and development of the user interface components of the NESPOLE! speech-to-speech translation system. The NESPOLE! system was designed for users with medium-to-low levels of computer literacy and Web expertise. The user interface was designed to effectively combine Web browsing, real-time sharing of graphical information and multi-modal annotations using a shared whiteboard, and real-time multilingual speech communication, all within an e-commerce scenario. Data collected in sessions with naive users in several stages in the process of system development formed the basis for improving the effectiveness and usability of the system. We describe this development process, the resulting interface components and the lessons learned.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117157605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel computing-based architecture for mixed-initiative spoken dialogue","authors":"Ryuta Taguma, T. Moriyama, K. Iwano, S. Furui","doi":"10.1109/ICMI.2002.1166968","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166968","url":null,"abstract":"This paper describes a new method of implementing mixed-initiative spoken dialogue systems based on parallel computing architecture. In a mixed-initiative dialogue, the user as well as the system needs to be capable of controlling the dialogue sequence. In our implementation, various language models corresponding to different dialogue contents, such as requests for information or replies to the system, are built and multiple recognizers using these language models are driven under a parallel computing architecture. The dialogue content of the user is automatically detected based on likelihood scores given by the recognizers, and the content is used to build the dialogue. A transitional probability from one dialogue state uttering a kind of content to another state uttering a different content is incorporated into the likelihood score. A flexible dialogue structure that gives users the initiative to control the dialogue is implemented by this architecture. Real-time dialogue systems for retrieving information about restaurants and food stores are built and evaluated in terms of dialogue content identification rate and keyword accuracy. The proposed architecture has the advantage that the dialogue system can be easily modified without remaking the whole language model.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124688203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Do multimodal signals need to come from the same place? Crossmodal attentional links between proximal and distal surfaces","authors":"R. Gray, H. Tan, J. Young","doi":"10.1109/ICMI.2002.1167035","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167035","url":null,"abstract":"Previous research has shown that the use of multimodal signals can lead to faster and more accurate responses compared to purely unimodal displays. However, in most cases response facilitation only occurs when the signals are presented in roughly the same spatial location. This would suggest a severe restriction on interface designers: to use multimodal displays effectively all signals must be presented from the same location on the display. We previously reported evidence that the use of haptic cues may provide a solution to this problem as haptic cues presented to a user's back can be used to redirect visual attention to locations on a screen in front of the user (Tan et al., 2001). In the present experiment we used a visual change detection task to investigate whether (i) this type of visual-haptic interaction is robust at low cue validity rates and (ii) similar effects occur for auditory cues. Valid haptic cues resulted in significantly faster change detection times even when they accurately indicated the location of the change on only 20% of the trials. Auditory cues had a much smaller effect on detection times at the high validity rate (80%) than haptic cues and did not significantly improve performance at the 20% validity rate. These results suggest that the use haptic attentional cues may be particularly effective in environments in which information cannot be presented in the same spatial location.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127275694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The added value of multimodality in the NESPOLE! speech-to-speech translation system: an experimental study","authors":"E. Costantini, F. Pianesi, Susanne Burger","doi":"10.1109/ICMI.2002.1166999","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166999","url":null,"abstract":"Multimodal interfaces, which combine two or more input modes (speech, pen, touch...), are expected to be more efficient, natural and usable than single-input interfaces. However, the advantage of multimodal input has only been ascertained in highly controlled experimental conditions (S.L. Oviatt, 1997; 1999); in particular, we lack data about what happens with \"real\" human-human, multilingual communication systems. We discuss the results of an experiment aiming to evaluate the added value of multimodality in a \"true\" speech-to-speech translation system, the NESPOLE! system, which provides for multilingual and multimodal communication in the tourism domain, allowing users to interact through the Internet sharing maps, Web-pages and pen-based gestures. We compared two experimental conditions differing as to whether multimodal resources were available: a speech-only condition (SO), and a multimodal condition (MM). Most of the data show tendencies for MM to be better than SO.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120947922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Krahnstoever, S. Kettebekov, M. Yeasin, Rajeev Sharma
{"title":"A real-time framework for natural multimodal interaction with large screen displays","authors":"N. Krahnstoever, S. Kettebekov, M. Yeasin, Rajeev Sharma","doi":"10.1109/ICMI.2002.1167020","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167020","url":null,"abstract":"This paper presents a framework for designing a natural multimodal human computer interaction (HCI) system. The core of the proposed framework is a principled method for combining information derived from audio and visual cues. To achieve natural interaction, both audio and visual modalities are fused along with feedback through a large screen display. Careful design along with due considerations of possible aspects of a systems interaction cycle and integration has resulted in a successful system. The performance of the proposed framework has been validated through the development of several prototype systems as well as commercial applications for the retail and entertainment industry. To assess the impact of these multimodal systems (MMS), informal studies have been conducted. It was found that the system performed according to its specifications in 95% of the cases and that users showed ad-hoc proficiency, indicating natural acceptance of such systems.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127099182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Achieving real-time lip-synch via SVM-based phoneme classification and lip shape refinement","authors":"Taeyoon Kim, Yongsung Kang, Hanseok Ko","doi":"10.1109/ICMI.2002.1167010","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167010","url":null,"abstract":"In this paper, we develop a real time lip-synch system that activates a 2D avatar's lip motion in synch with incoming speech utterance. To realize \"real time\" operation of the system, we contain the processing time by invoking a merge and split procedure performing coarse-to-fine phoneme classification. At each stage of phoneme classification, we apply a support vector machine (SVM) to constrain the computational load while attaining desirable accuracy. Coarse-to-fine phoneme classification is accomplished via 2 stages of feature extraction, where each speech frame is acoustically analyzed first for 3 classes of lip opening using MFCC as the feature and then a further refined classification for detailed lip shape using formant information. We implemented the system with 2D lip animation that shows the effectiveness of the proposed 2-stage procedure accomplishing the real-time lip-synch task.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126606948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}