{"title":"Similarity matching of continuous melody contours for humming querying of melody databases","authors":"Yongwei Zhu, M. Kankanhalli, Q. Tian","doi":"10.1109/MMSP.2002.1203293","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203293","url":null,"abstract":"Music query-by-humming is a challenging problem since the humming query inevitably contains much variation and inaccuracy. In this paper, we present a novel melody similarity matching technique, which is based on continuous melody contour. We introduce a contour alignment technique, which addresses the robustness and efficiency issues. We also present a new melody similarity metric, which is performed directly on continuous melody contours of the query data. This approach cleanly separates the alignment and similarity measurement in the retrieval process. Our melody alignment method can reduce the matching candidate to 1.7% with 90% correct alignment rate. The overall retrieval system achieved 88% correct retrieval in the top 20 rank lists.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123444745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On the capacity of the reachback channel in wireless sensor networks","authors":"J. Barros, S. Servetto","doi":"10.1109/MMSP.2002.1203332","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203332","url":null,"abstract":"We consider the problem of reachback communication in wireless sensor networks: multiple sensors are deployed on a field, and they collect local measurements of some random process which then need to be encoded and reproduced at a remote location. In this paper we present a number of information theoretic bounds on the performance of a distributed transmission array that is formed by a large number of cheap, unreliable sensors. We formulate this problem in terms of classical network information theory concepts, formulation which leads us to consider two important cases: transmission of correlated sources over multiple independent channels, and rate/distortion with separate encoders.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131917258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Deng, A. Acero, Ye-Yi Wang, Kuansan Wang, H. Hon, J. Droppo, M. Mahajan, Xuedong Huang
{"title":"A speech-centric perspective for human-computer interface","authors":"L. Deng, A. Acero, Ye-Yi Wang, Kuansan Wang, H. Hon, J. Droppo, M. Mahajan, Xuedong Huang","doi":"10.1109/MMSP.2002.1203296","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203296","url":null,"abstract":"Speech technology has been playing a central role in enhancing human-machine interactions, especially for small devices for which GUI has obvious limitations. The speech-centric perspective for human-computer interface advanced in this paper derives from the view that speech is the only natural and expressive modality to enable people to access information from and to interact with any device. In this paper, we describe the work conducted at Microsoft Research, in the project codenamed Dr.Who, aimed at the development of enabling technologies for speech-centric multimodal human-computer interaction. In particular, we present MiPad as the first Dr.Who's application that addresses specifically the mobile user interaction scenario. MiPad is a wireless mobile PDA prototype that enables users to accomplish many common tasks using a multimodal spoken language interface and wireless-data technologies. It fully integrates continuous speech recognition and spoken language understanding, and provides a novel solution to the current prevailing problem of pecking with tiny styluses or typing on minuscule keyboards in today's PDAs or smart phones.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"71 27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116558024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Context based coding of quantized alpha planes for video objects","authors":"S. M. Aghito, Søren Forchhammer","doi":"10.1109/MMSP.2002.1203258","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203258","url":null,"abstract":"In object based video, each frame is a composition of objects that are coded separately. The composition is performed through the alpha plane that represents the transparency of the object. We present an alternative to MPEG-4 for coding of alpha planes that considers their specific properties. Comparisons in terms of rate and distortion are provided, showing that the proposed coding scheme for still alpha planes is better than the algorithms for I-frames used in MPEG-4.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120959524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An open source development tool for anthropomorphic dialog agent: face image synthesis and lip synchronization","authors":"T. Yotsukura, S. Morishima","doi":"10.1109/MMSP.2002.1203298","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203298","url":null,"abstract":"We describe the design and report the development of an open source ware toolkit for building an easily customizable anthropomorphic dialog agent. This toolkit consists of four modules for multi-modal dialog integration, speech recognition, speech synthesis, and face image synthesis. In this paper, we focus on the construction of an agent's face image synthesis.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131210136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Recent progress in spontaneous speech recognition and understanding","authors":"S. Furui","doi":"10.1109/MMSP.2002.1203294","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203294","url":null,"abstract":"How to recognize and understand spontaneous speech is one of the most important issues in state-of-the-art speech recognition technology. In this context, a five-year large scale national project entitled \"Spontaneous speech: corpus and processing technology\" started in Japan in 1999. This paper gives an overview of the project and reports on the major results of experiments that have been conducted so far at Tokyo Institute of Technology, including spontaneous presentation speech recognition, automatic speech summarization, and message-driven speech recognition. The paper also discusses the most important research problems to be solved in order to achieve ultimate spontaneous speech recognition systems.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128280509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design-Trotter: a multimedia embedded systems design space exploration tool","authors":"Y. Moullec, J. Diguet, J. Philippe","doi":"10.1109/MMSP.2002.1203342","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203342","url":null,"abstract":"In this paper we present the intra-function dynamic estimation step of our system-level design space exploration tool. The aim of our global methodology is to fill the gap between system specification and the tasks of the system design flow to converge towards an efficient system on chip architecture for multimedia applications. In this context, the intra-function estimation step rapidly provides for each functional block of the specification, trade-off curves which represent a large set of parallelism options for both data-transfer and processing resources. A set of methods used to achieve this estimation process is detailed.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128597886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A ranking technique for fast audio identification","authors":"F. Kurth","doi":"10.1109/MMSP.2002.1203278","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203278","url":null,"abstract":"We introduce a novel ranking technique for fast and robust audio identification. In this approach, identification is performed by evaluating certain parts of correlation sequences. Starting form a general framework for content-based audio identification, we show to integrate our new algorithm into a fast index-based search algorithm. We demonstrate the capabilities of our approach by considering the tasks of identifying highly distorted audio material and search fragments in large scale MP3 data bases.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122313557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A closed-form solution to the autocorrelation matching method for wireless MIMO communications","authors":"Hui Luo, L. Luo, Ruey-Wen Liu","doi":"10.1109/MMSP.2002.1203328","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203328","url":null,"abstract":"Wireless MIMO (multiple input multiple output) communication techniques are proposed to boost spectrum efficiency using antenna arrays so that broadband data such as multimedia signals can be transmitted over a limited bandwidth. The AM (autocorrelation matching) method is a SOS (second-order statistics) based blind MIMO-FIR equalization technique that may enable wireless MIMO communications without identifying MIMO-FIR channels. This paper presents a closed-form solution to computing the optimal zero-forcing equalizer for the AM method using the knowledge of second-order statistics of the transmitted signals and received signals. Numerical simulations are given to show the effectiveness of the AM method and the performance of the closed-form solution.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124676748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Personalized video summary using visual semantic annotations and automatic speech transcriptions","authors":"Belle L. Tseng, Ching-Yung Lin","doi":"10.1109/MMSP.2002.1203234","DOIUrl":"https://doi.org/10.1109/MMSP.2002.1203234","url":null,"abstract":"A personalized video summary is dynamically generated in our video personalization and summary system based on user preference and usage environment. The three-tier personalization system adopts the server-middleware-client architecture in order maintain, select, adapt, and deliver rich media content to the user. The server stores the content sources along with their corresponding MPEG-7 metadata descriptions. In this paper, the metadata includes visual semantic annotations and automatic speech transcriptions. Our personalization and summarization engine in the middleware selects the optimal set of desired video segments by matching shot annotations and sentence transcripts with user preferences. The process includes the shot-to-sentence alignment, summary segment selection, and user preference matching and propagation. As a result, the relevant visual shot and audio sentence segments are aggregated and composed into a personalized video summary.","PeriodicalId":398813,"journal":{"name":"2002 IEEE Workshop on Multimedia Signal Processing.","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128977936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}