{"title":"Gesture patterns during speech repairs","authors":"L. Chen, M. Harper, Francis K. H. Quek","doi":"10.1109/ICMI.2002.1166985","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166985","url":null,"abstract":"Speech and gesture are two primary modes used in natural human communication; hence, they are important inputs for a multimodal interface to process. One of the challenges for multimodal interfaces is to accurately recognize the words in spontaneous speech. This is partly due to the presence of speech repairs, which seriously degrade the accuracy of current speech recognition systems. Based on the assumption that speech and gesture arise from the same thought process, we would expect to find patterns of gesture that co-occur with speech repairs that can be exploited by a multimodal processing system to more effectively process spontaneous speech. To evaluate this hypothesis, we have conducted a measurement study of gesture and speech repair data extracted from videotapes of natural dialogs. Although we have found that gestures do not always co-occur with speech repairs, we observed that modification gesture patterns have a high correlation with content replacement speech repairs, but rarely occur with content repetitions. These results suggest that gesture patterns can help us to classify different types of speech repairs in order to correct them more accurately.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116624777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The role of gesture in multimodal referring actions","authors":"Frédéric Landragin","doi":"10.1109/ICMI.2002.1166988","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166988","url":null,"abstract":"When deictic gestures are produced on a touch screen, they can take forms which can lead to several sorts of ambiguities. Considering that the resolution of a multimodal reference requires the identification of the referents and of the context (\"reference domain\") from which these referents are extracted, we focus on the linguistic, gestural, and visual clues that a dialogue system may exploit to comprehend the referring intention. We explore the links between words, gestures and perceptual groups, doing so in terms of the clues that delimit the reference domain. We also show the importance of taking the domain into account for dialogue management, particularly for the comprehension of further utterances, when they seem to implicitly use a pre-existing restriction to a subset of objects. We propose a strategy of multimodal reference resolution based on this notion of reference domain, and we illustrate its efficiency with prototypic examples built from a study of significant referring situations extracted from a corpus. We also present the future directions of our works, concerning some linguistic and task aspects that are not integrated here.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129598788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Active gaze tracking for human-robot interaction","authors":"Rowel Atienza, A. Zelinsky","doi":"10.1109/ICMI.2002.1167004","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167004","url":null,"abstract":"In our effort to make human-robot interfaces more user-friendly, we built an active gaze tracking system that can measure a person's gaze direction in real-time. Gaze normally tells which object in his/her surrounding a person is interested in. Therefore, it can be used as a medium for human-robot interaction like instructing a robot arm to pick a certain object a user is looking at. We discuss how we developed and put together algorithms for zoom camera calibration, low-level control of active head, face and gaze tracking to create an active gaze tracking system.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130730818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Designing transition networks for multimodal VR-interactions using a markup language","authors":"Marc Erich Latoschik","doi":"10.1109/ICMI.2002.1167030","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167030","url":null,"abstract":"This article presents one core component for enabling multimodal-speech and gesture-driven interaction in and for virtual environments. A so-called temporal Augmented Transition Network (tATN) is introduced. It allows to integrate and evaluate information from speech, gesture, and a given application context using a combined syntactic/semantic parse approach. This tATN represents the target structure for a multimodal integration markup language (MIML). MIML centers around the specification of multimodal interactions by letting an application designer declare temporal and semantic relations between given input utterance percepts and certain application states in a declarative and portable manner. A subsequent parse pass translates MIML into corresponding tATNs which are directly loaded and executed by a simulation engines scripting facility.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131199249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"State sharing in a hybrid neuro-Markovian on-line handwriting recognition system through a simple hierarchical clustering algorithm","authors":"Haifeng Li, T. Artières, P. Gallinari","doi":"10.1109/ICMI.2002.1166993","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166993","url":null,"abstract":"HMM has been largely applied in many fields with great success. To achieve a better performance, an easy way is using more states or more free parameters for a better signal modelling. Thus, state sharing and state clipping methods have been proposed to reduce parameter redundancy and to limit the explosive consummation of system resources. We focus on a simple state sharing method for a hybrid neuro-Markovian on-line handwriting recognition system. At first, a likelihood-based distance is proposed for measuring the similarity between two HMM state models. Afterwards, a minimum quantification error aimed hierarchical clustering algorithm is also proposed to select the most representative models. Here, models are shared to the most under the constraint of the minimum system performance loss. As the result, we maintain about 98% of the system performance while about 60% of the parameters are reduced.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"250 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133516443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Covariance-tied clustering method in speaker identification","authors":"Ziqiang Wang, Yang Liu, Peng Ding, Bo Xu","doi":"10.1109/ICMI.2002.1166973","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166973","url":null,"abstract":"Gaussian mixture models (GMMs) have been successfully applied to the classifier for speaker modeling in speaker identification. However, there are still problems to solve, such as the clustering methods. The conditional k-means algorithm utilizes Euclidean distance taking all data distribution as sphericity, which is not the distribution of the actual data. In this paper we present a new method making use of covariance information to direct the clustering of GMMs, namely covariance-tied clustering. This method consists of two parts: obtaining covariance matrices using the data sharing technique based on a binary tree, and making use of covariance matrices to direct clustering. The experimental results prove that this method leads to worthwhile reductions of error rates in speaker identification. Much remains to be done to explore fully the covariance information.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132102866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards visually-grounded spoken language acquisition","authors":"D. Roy","doi":"10.1109/ICMI.2002.1166977","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166977","url":null,"abstract":"A characteristic shared by most approaches to natural language understanding and generation is the use of symbolic representations of word and sentence meanings. Frames and semantic nets are examples of symbolic representations. Symbolic methods are inappropriate for applications which require natural language semantics to be linked to perception, as is the case in tasks such as scene description or human-robot interaction. This paper presents two implemented systems, one that learns to generate, and one that learns to understand visually-grounded spoken language. These implementations are part of our on-going effort to develop a comprehensive model of perceptually-grounded semantics.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116269884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A probabilistic dynamic contour model for accurate and robust lip tracking","authors":"Qiang Wang, H. Ai, Guangyou Xu","doi":"10.1109/ICMI.2002.1167007","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167007","url":null,"abstract":"In this paper a new condensation style contour tracking method called probabilistic dynamic contour (PDC) is proposed for lip tracking: a novel mixture dynamic model is designed to represent shape more compactly and to tolerate larger motions between frames, a measurement model is designed to include multiple visual cues. The proposed PDC tracker has the advantage that it is conceptually general but effectively suitable for lip tracking with the designed dynamic and measurement model. The new tracker improves the traditional condensation style tracker in three aspects: Firstly, the dynamic model is partially derived from the image sequence, so the tracker does not need to learn the dynamics in advance. Secondly, the measurement model is easy to be updated during tracking, which avoids modeling the foreground object in prior. Thirdly, to improve the tracker's speed, a compact representation of shape and a noise model are proposed to reduce the samples required to represent the posterior distribution. An experiment on lip contour tracking shows that the proposed method tracks contours robustly as well as accurately compared to the existing tracking method.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128398820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Techniques for interactive audience participation","authors":"Dan Maynes-Aminzade, R. Pausch, S. Seitz","doi":"10.1109/ICMI.2002.1166962","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166962","url":null,"abstract":"At SIGGRAPH in 1991, Loren and Rachel Carpenter unveiled an interactive entertainment system that allowed members of a large audience to control an onscreen game using red and green reflective paddles. In the spirit of this approach, we present a new set of techniques that enable members of an audience to participate, either cooperatively or competitively, in shared entertainment experiences. Our techniques allow audiences with hundreds of people to control onscreen activity by (1) leaning left and right in their seats, (2) batting a beach ball while its shadow is used as a pointing device, and (3) pointing laser pointers at the screen. All of these techniques can be implemented with inexpensive, off the shelf hardware. Me have tested these techniques with a variety of audiences; in this paper we describe both the computer vision based implementation and the lessons we learned about designing effective content for interactive audience participation.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134618263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An improved active shape model for face alignment","authors":"Wei Wang, S. Shan, Wen Gao, B. Cao, Baocai Yin","doi":"10.1109/ICMI.2002.1167050","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167050","url":null,"abstract":"In this paper, we present several improvements on conventional active shape models (ASM) for face alignment. Despite the accuracy and robustness of ASMs in image alignment, its performance depends heavily on the initial parameters of the shape model, as well as the local texture model for each landmark and the corresponding local matching strategy. In this work, to improve ASMs for face alignment, several measures are taken. First, salient facial features, such as the eyes and the mouth, are localized based on a face detector. These salient features are then utilized to initialize the shape model and provide region constraints on the subsequent iterative shape searching. Secondly, we exploit edge information to construct better local texture models for landmarks on the face contour. The edge intensity at the contour landmark is used as a self-adaptive weight when calculating the Mahalanobis distance between the candidate and reference profile. Thirdly, to avoid unreasonable shift from pre-localized salient features, landmarks around the salient features are adjusted before applying global subspace constraints. Experiments on a database containing 300 labeled face images show that the proposed method performs significantly better than traditional ASMs.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133333330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}