{"title":"Towards monitoring human activities using an omnidirectional camera","authors":"Xilin Chen, Jie Yang","doi":"10.1109/ICMI.2002.1167032","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167032","url":null,"abstract":"We propose an approach for monitoring human activities in an indoor environment using an omnidirectional camera. Robustly tracking people is prerequisite for modeling and recognizing human activities. An omnidirectional camera mounted on the ceiling is less prone to problems of occlusion. We use the Markov Random Field (MRF) to present both background and foreground, and adapt models effectively against environment changes. We employ a deformable model to adapt the foreground models to optimally match objects in different position within a pattern of view of the omnidirectional camera. In order to monitor human activity, we represent positions of people as spatial points and analyze moving trajectories within a time-spatial window. The method provides an efficient way to monitoring high-level human activities without exploring identities.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"623 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117085765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhirong Wang, Umut Topkara, Tanja Schultz, A. Waibel
{"title":"Towards universal speech recognition","authors":"Zhirong Wang, Umut Topkara, Tanja Schultz, A. Waibel","doi":"10.1109/ICMI.2002.1167001","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167001","url":null,"abstract":"The increasing interest in multilingual applications like speech-to-speech translation systems is accompanied by the need for speech recognition front-ends in many languages that can also handle multiple input languages at the same time. We describe a universal speech recognition system that fulfills such needs. It is trained by sharing speech and text data across languages and thus reduces the number of parameters and overhead significantly at the cost of only slight accuracy loss. The final recognizer eases the burden of maintaining several monolingual engines, makes dedicated language identification obsolete and allows for code-switching within an utterance. To achieve these goals we developed new methods for constructing multilingual acoustic models and multilingual n-gram language models.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117182865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jack Mostow, J. Beck, Raghuvee Chalasani, Andrew Cuneo, Peng Jia
{"title":"Viewing and analyzing multimodal human-computer tutorial dialogue: a database approach","authors":"Jack Mostow, J. Beck, Raghuvee Chalasani, Andrew Cuneo, Peng Jia","doi":"10.1109/ICMI.2002.1166981","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166981","url":null,"abstract":"It is easier to record logs of multimodal human-computer tutorial dialogue than to make sense of them. In the 2000-2001 school year, we logged the interactions of approximately 400 students who used Project LISTEN's Reading Tutor and who read aloud over 2.4 million words. We discuss some difficulties we encountered converting the logs into a more easily understandable database. It is faster to write SQL queries to answer research questions than to analyze complex log files each time. The database also permits us to construct a viewer to examine individual Reading Tutor-student interactions. This combination of queries and viewable data has turned out to be very powerful, and we discuss how we have combined them to answer research questions.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134101380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prosody based co-analysis for continuous recognition of coverbal gestures","authors":"S. Kettebekov, M. Yeasin, Rajeev Sharma","doi":"10.1109/ICMI.2002.1166986","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166986","url":null,"abstract":"Although recognition of natural speech and gestures have been studied extensively, previous attempts at combining them in a unified framework to boost classification were mostly semantically motivated, e.g., keyword-gesture co-occurrence. Such formulations inherit the complexity of natural language processing. This paper presents a Bayesian formulation that uses a phenomenon of gesture and speech articulation for improving accuracy of automatic recognition of continuous coverbal gestures. The prosodic features from the speech signal were co-analyzed with the visual signal to learn the prior probability of co-occurrence of the prominent spoken segments with the particular kinematical phases of gestures. It was found that the above co-analysis helps in detecting and disambiguating small hand movements, which subsequently improves the rate of continuous gesture recognition. The efficacy of the proposed approach was demonstrated on a large database collected front the weather channel broadcast. This formulation opens new avenues for bottom-up frameworks of multimodal integration.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127570013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive dialog based upon multimodal language acquisition","authors":"Sorin Dusan, J. Flanagan","doi":"10.1109/ICMI.2002.1166982","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166982","url":null,"abstract":"Communicating by voice with speech-enabled computer applications based on preprogrammed rule grammars suffers from constrained vocabulary and sentence structures. Deviations from the allowed language result in an unrecognized utterance that will not be understood and processed by the system. One way to alleviate this restriction consists in allowing the user to expand the computer's recognized and understood language by teaching the computer system new language knowledge. We present an adaptive dialog system capable of learning from users new words, phrases and sentences, and their corresponding meanings. User input incorporates multiple modalities, including speaking, typing, pointing, drawing and image capturing. The allowed language can thus be expanded in real time by users according to their preferences. By acquiring new language knowledge the system becomes more capable in specific tasks, although its language is still constrained.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133115913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lecture and presentation tracking in an intelligent meeting room","authors":"I. Rogina, Thomas Schaaf","doi":"10.1109/ICMI.2002.1166967","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1166967","url":null,"abstract":"Archiving, indexing, and later browsing through stored presentations and lectures is increasingly being used. We have investigated the special problems and advantages of lectures and propose the design and adaptation of a speech recognizer to a lecture such that the recognition accuracy can be significantly improved by prior analysis of the presented documents using a special class-based language model. We define a tracking accuracy measure which measures how well a system can automatically align recognized words with parts of a presentation and show that by prior exploitation of the presented documents, the tracking accuracy can be improved. The system described in this paper is part of an intelligent meeting room developed in the European Union-sponsored project FAME (Facilitating Agent for Multicultural Exchange).","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126386204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ziyou Xiong, Yunqiang Chen, Roy Wang, Thomas S. Huang
{"title":"Improved information maximization based face and facial feature detection from real-time video and application in a multi-modal person identification system","authors":"Ziyou Xiong, Yunqiang Chen, Roy Wang, Thomas S. Huang","doi":"10.1109/ICMI.2002.1167048","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167048","url":null,"abstract":"In this paper an improved face detection method based on our previous information-based maximum discrimination approach is presented that maximizes the discrimination between face and non-face examples in a training set without using color or motion information. A short review of our previous method is given together with a description of a recent improvement of its detection speed. A person identification system has been developed that performs multi-modal person identification in real-time video based on this newly improved face detection method together with speaker identification.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116064863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. B. Larsen, Morten Damm Jensen, Wisdom Kobby Vodzi
{"title":"Multi modal user interaction in an automatic pool trainer","authors":"L. B. Larsen, Morten Damm Jensen, Wisdom Kobby Vodzi","doi":"10.1109/ICMI.2002.1167022","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167022","url":null,"abstract":"This paper presents the human-computer interaction in an automatic pool trainer currently being developed at the Center for PersonKommunikation, Aalborg University. The aim of the system is to automate (parts of) the learning process, in this case of the game of pool. The automated pool trainer (APT) utilises multi modal, agent driven user-system communication, to facilitate the user interaction. To allow the user the necessary freedom of movement when addressing the task, system output is presented on a wall-mounted screen and is augmented by a laser drawing lines and points directly on the pool table surface. User interaction is either carried out via a spoken dialogue with an animated interface agent, or by using a touch screen panel. The paper describes the philosophy on which the system is designed, as well as the system architecture and individual modules. The user interaction is described and the paper concludes with a presentation of some test results and a discussion of the suitability of the presented and similar systems.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"379 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116578936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Myers, Robert G. Malkin, M. Bett, A. Waibel, Benjamin Bostwick, Robert C. Miller, Jie Yang, Matthias Denecke, Edgar Seemann, Jie Zhu, Choon Hong Peck, Dave Kong, Jeffrey Nichols, W. Scherlis
{"title":"Flexi-modal and multi-machine user interfaces","authors":"B. Myers, Robert G. Malkin, M. Bett, A. Waibel, Benjamin Bostwick, Robert C. Miller, Jie Yang, Matthias Denecke, Edgar Seemann, Jie Zhu, Choon Hong Peck, Dave Kong, Jeffrey Nichols, W. Scherlis","doi":"10.1109/ICMI.2002.1167019","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167019","url":null,"abstract":"We describe our system which facilitates collaboration using multiple modalities, including speech, handwriting, gestures, gaze tracking, direct manipulation, large projected touch-sensitive displays, laser pointer tracking, regular monitors with a mouse and keyboard, and wireless networked handhelds. Our system allows multiple, geographically dispersed participants to simultaneously and flexibly mix different modalities using the right interface at the right time on one or more machines. We discuss each of the modalities provided, how they were integrated in the system architecture, and how the user interface enabled one or more people to flexibly use one or more devices.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131203976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Animating arbitrary topology 3D facial model using the MPEG-4 FaceDefTables","authors":"D. Jiang, Wen Gao, Zhiguo Li, Zhaoqi Wang","doi":"10.1109/ICMI.2002.1167049","DOIUrl":"https://doi.org/10.1109/ICMI.2002.1167049","url":null,"abstract":"In this paper we put forward a method to animate an arbitrary topology facial model (ATFM) based on the MPEG-4 standard. This paper deals mainly with the problem of building the FaceDefTables, which play a very important role in the MPEG-4 based facial animation system. The FaceDefTables for our predefined standard facial model (SFM) are built using the interpolation method. Since the FaceDefTables depend on facial models, the FaceDefTables for the SFM can be applied only to those facial models having the same topology as the SFM. For those facial models that have different topology, we have to build the FaceDefTables accordingly. To acquire the FaceDefTables for ATFM, we first select feature points on ATFM, then transform the SFM according to those feature points. Finally, we project each vertex on the ATFM to the transformed SFM and build the FaceDefTables for the ATFM according to the projection position. With the FaceDefTables we built, realistic animation results have been acquired.","PeriodicalId":208377,"journal":{"name":"Proceedings. Fourth IEEE International Conference on Multimodal Interfaces","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133834648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}