{"title":"Binaural speech segregation based on pitch and azimuth tracking","authors":"John F. Woodruff, Deliang Wang","doi":"10.1109/ICASSP.2012.6287862","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6287862","url":null,"abstract":"We propose an approach to binaural speech segregation in reverberation based on pitch and azimuth cues. These cues are integrated within a statistical tracking framework to estimate up to two concurrent pitch frequencies and three concurrent azimuth angles. The tracking framework implicitly estimates binary time-frequency masks by solving a data association problem, thereby performing speech segregation. Experimental results show that the proposed approach compares favorably to existing two-microphone systems in spite of less prior information. The benefit of the proposed approach is most pronounced in conditions with substantial reverberation or for closely spaced sources.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85732199","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved minimum converted trajectory error training for real-time speech-to-lips conversion","authors":"Wei Han, Lijuan Wang, F. Soong, Bo Yuan","doi":"10.1109/ICASSP.2012.6288921","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6288921","url":null,"abstract":"Gaussian mixture model (GMM) based speech-to-lips conversion often operates in two alternative ways: batch conversion and sliding window-based conversion for real-time processing. Previously, Minimum Converted Trajectory Error (MCTE) training has been proposed to improve the performance of batch conversion. In this paper, we extend previous work and propose a new training criteria, MCTE for Real-time conversion (R-MCTE), to explicitly optimize the quality of sliding window-based conversion. In R-MCTE, we use the probabilistic descent method to refine model parameters by minimizing the error on real-time converted visual trajectories over training data. Objective evaluations on the LIPS 2008 Visual Speech Synthesis Challenge data set shows that the proposed method achieves both good lip animation performance and low delay in real-time conversion.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85859314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel eye region based privacy protection scheme","authors":"Dohyoung Lee, K. Plataniotis","doi":"10.1109/ICASSP.2012.6288261","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6288261","url":null,"abstract":"This paper introduces a novel eye region scrambling scheme capable of protecting privacy sensitive eye region information present in video contents. The proposed system consists of an automatic eye detection module followed by a privacy enabling JPEG XR encoder module. An object detection method based on a probabilistic model of image generation is used in conjunction with a skin-tone segmentation to accurately locate eye regions in real time. The utilized JPEG XR encoder effectively deteriorate the visual quality of privacy sensitive eye region at low computational cost. Performance of proposed solution is validated using benchmark face recognition algorithms on face image database. Experimental results indicate that the proposed solution is able to conceal identity by preventing successful identification at low computational costs.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85941566","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of the sphericalwave truncation error for spherical harmonic soundfield expansions","authors":"S. Brown, Shuai Wang, D. Sen","doi":"10.1109/ICASSP.2012.6287803","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6287803","url":null,"abstract":"Three dimensional soundfield recording and reproduction is an area of ongoing investigation and its implementation is increasingly achieved through use of the infinite Spherical Harmonic soundfield expansion. Perfect recording or reconstruction requires infinite microphones or loudspeakers, respectively. Thus, real-world approximations to both require spatial discretisation, which truncates the soundfield expansion and loses some of the soundfield information. The resulting truncation error is the focus of this paper, specifically for soundfields comprising of spherical waves. We define two norms of the truncation error to signal ratio, L2 and L∞, for comparison and use in different situations. Finally we observe how some of these errors converge to the plane wave case under certain circumstances.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85985395","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. M. Nickel, Ramón Fernández Astudillo, D. Kolossa, Steffen Zeiler, Rainer Martin
{"title":"Inventory-style speech enhancement with uncertainty-of-observation techniques","authors":"R. M. Nickel, Ramón Fernández Astudillo, D. Kolossa, Steffen Zeiler, Rainer Martin","doi":"10.1109/ICASSP.2012.6288954","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6288954","url":null,"abstract":"We present a new method for inventory-style speech enhancement that significantly improves over earlier approaches [1]. Inventory-style enhancement attempts to resynthesize a clean speech signal from a noisy signal via corpus-based speech synthesis. The advantage of such an approach is that one is not bound to trade noise suppression against signal distortion in the same way that most traditional methods do. A significant improvement in perceptual quality is typically the result. Disadvantages of this new approach, however, include speaker dependency, increased processing delays, and the necessity of substantial system training. Earlier published methods relied on a-priori knowledge of the expected noise type during the training process [1]. In this paper we present a new method that exploits uncertainty-of-observation techniques to circumvent the need for noise specific training. Experimental results show that the new method is not only able to match, but outperform the earlier approaches in perceptual quality.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76721211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Face recognition based on nonsubsampled contourlet transform and block-based kernel Fisher linear discriminant","authors":"Biao Wang, Weifeng Li, Q. Liao","doi":"10.1109/ICASSP.2012.6288183","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6288183","url":null,"abstract":"Face representation, including both feature extraction and feature selection, is the key issue for a successful face recognition system. In this paper, we propose a novel face representation scheme based on nonsubsampled contourlet transform (NSCT) and block-based kernel Fisher linear discriminant (BKFLD). NSCT is a newly developed multiresolution analysis tool and has the ability to extract both intrinsic geometrical structure and directional information in images, which implies its discriminative potential for effective feature extraction of face images. By encoding the the NSCT coefficient images with the local binary pattern (LBP) operator, we could obtain a robust feature set. Furthermore, kernel Fisher linear discriminant is introduced to select the most discriminative feature sets, and the block-based scheme is incorporated to address the small sample size problem. Face recognition experiments on FERET database demonstrate the effectiveness of our proposed approach.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76886966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Handling incomplete matrix data via continuous-valued infinite relational model","authors":"Tomohiko Suzuki, Takuma Nakamura, Yasutoshi Ida, Takashi Matsumoto","doi":"10.1109/ICASSP.2012.6288338","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6288338","url":null,"abstract":"A continuous-valued infinite relational model is proposed as a solution to the co-clustering problem which arises in matrix data or tensor data calculations. The model is a probabilistic model utilizing the framework of Bayesian Nonparametrics which can estimate the number of components in posterior distributions. The original Infinite Relational Model cannot handle continuous-valued or multi-dimensional data directly. Our proposed model overcomes the data expression restrictions by utilizing the proposed likelihood, which can handle many types of data. The posterior distribution is estimated via variational inference. Using real-world data, we show that the proposed model outperforms the original model in terms of AUC score and efficiency for a movie recommendation task. (111 words).","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80838909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A study of discriminative feature extraction for i-vector based acoustic sniffing in IVN acoustic model training","authors":"Yu Zhang, Jian Xu, Zhijie Yan, Qiang Huo","doi":"10.1109/ICASSP.2012.6288814","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6288814","url":null,"abstract":"Recently, we proposed an i-vector approach to acoustic sniffing for irrelevant variability normalization based acoustic model training in large vocabulary continuous speech recognition (LVCSR). Its effectiveness has been confirmed by experimental results on Switchboard- 1 conversational telephone speech transcription task. In this paper, we study several discriminative feature extraction approaches in i-vector space to improve both recognition accuracy and run-time efficiency. New experimental results are reported on a much larger scale LVCSR task with about 2000 hours training data.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83603373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Togami, Y. Kawaguchi, Ryu Takeda, Y. Obuchi, N. Nukaga
{"title":"Multichannel speech dereverberation and separation with optimized combination of linear and non-linear filtering","authors":"M. Togami, Y. Kawaguchi, Ryu Takeda, Y. Obuchi, N. Nukaga","doi":"10.1109/ICASSP.2012.6288809","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6288809","url":null,"abstract":"In this paper, we propose a multichannel speech dereverberation and separation technique which is effective even when there are multiple speakers and each speaker's transfer function is time-varying due to fluctuation of the corresponding speaker's head. For robustness against fluctuation, the proposed method optimizes linear filtering with non-linear filtering simultaneously from probabilistic perspective based on a probabilistic reverberant transfer-function model, PRTFM. PRTFM is an extension of the conventional time-invariant transfer-function model under uncertain conditions, and PRTFM can be also regarded as an extension of recently proposed blind local Gaussian modeling. The linear filtering and the non-linear filtering are optimized in MMSE (Minimum Mean Square Error) sense during parameter optimization. The proposed method is evaluated in a reverberant meeting room, and the proposed method is shown to be effective.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76287121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trade-off evaluation for speech enhancement algorithms with respect to the a priori SNR estimation","authors":"Pei Chee Yong, S. Nordholm, H. H. Dam","doi":"10.1109/ICASSP.2012.6288957","DOIUrl":"https://doi.org/10.1109/ICASSP.2012.6288957","url":null,"abstract":"In this paper, a modified a priori SNR estimator is proposed for speech enhancement. The well-known decision-directed (DD) approach is modified by matching each gain function with the noisy speech spectrum at current frame rather than the previous one. The proposed algorithm eliminates the speech transient distortion and reduces the impact from the choice of the gain function towards the level of smoothing in the SNR estimate. An objective evaluation metric is employed to measure the trade-off between musical noise, noise reduction and speech distortion. Performance is evaluated and compared between a modified sigmoid gain function, the state-of-the-art log-spectral amplitude estimator and the Wiener filter. Simulation results show that the modified DD approach performs better in terms of the trade-off evaluation.","PeriodicalId":6443,"journal":{"name":"2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2012-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73598202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}