{"title":"A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data","authors":"T. Kinnunen, Padmanabhan Rajan","doi":"10.1109/ICASSP.2013.6639066","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6639066","url":null,"abstract":"A voice activity detector (VAD) plays a vital role in robust speaker verification, where energy VAD is most commonly used. Energy VAD works well in noise-free conditions but deteriorates in noisy conditions. One way to tackle this is to introduce speech enhancement preprocessing. We study an alternative, likelihood ratio based VAD that trains speech and nonspeech models on an utterance-by-utterance basis from mel-frequency cepstral coefficients (MFCCs). The training labels are obtained from enhanced energy VAD. As the speech and nonspeech models are re-trained for each utterance, minimum assumptions of the background noise are made. According to both VAD error analysis and speaker verification results utilizing state-of-the-art i-vector system, the proposed method outperforms energy VAD variants by a wide margin. We provide open-source implementation of the method.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"164 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134593848","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Grapheme and multilingual posterior features for under-resourced speech recognition: A study on Scottish Gaelic","authors":"Ramya Rasipuram, P. Bell, M. Magimai.-Doss","doi":"10.1109/ICASSP.2013.6639087","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6639087","url":null,"abstract":"Standard automatic speech recognition (ASR) systems use phonemes as subword units. Thus, one of the primary resource required to build a good ASR system is a well developed phoneme pronunciation lexicon. However, under-resourced languages typically lack such lexical resources. In this paper, we investigate recently proposed grapheme-based ASR in the framework of Kullback-Leibler divergence based hidden Markov model (KL-HMM) for underresourced languages, particularly Scottish Gaelic which has no lexical resources. More specifically, we study the use of grapheme and multilingual phoneme class conditional probabilities (posterior features) as feature observations in KL-HMM. ASR studies conducted show that the proposed approach yields better system compared to the conventional HMM/GMM approach using cepstral features. Furthermore, grapheme posterior features estimated using both auxiliary data and Gaelic data yield the best system.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128693161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Clustering similar acoustic classes in the Fishervoice framework","authors":"Na Li, W. Jiang, H. Meng, Zhifeng Li","doi":"10.1109/ICASSP.2013.6639167","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6639167","url":null,"abstract":"In the Fishervoice (FSH) based framework, the mean supervectors of the speaker models are divided into several subvectors by mixture index. However, this division strategy cannot capture local acoustic class structure information among similar acoustic classes or discriminative information between different acoustic classes. In order to verify whether or not local structure information can help improve system performance, we develop five different speaker supervector segmentation methods. Experiments on NIST SRE08 prove that clustering similar acoustic classes together improves the system performance. In particular, the proposed method of equal size clustering achieves 5.1% relative decrease on EER compared to FSH1.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127308659","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Identification of genes consistently co-expressed in multiple microarray datasets by a genome-wide Bi-CoPaM approach","authors":"Basel Abu-Jamous, Rui Fa, D. Roberts, A. Nandi","doi":"10.1109/ICASSP.2013.6637835","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6637835","url":null,"abstract":"Many methods have been proposed to identify informative subsets of genes in microarray studies in order to focus the research. For instance, the recently proposed binarization of consensus partition matrices (Bi-CoPaM) method has, amongst its various features, the ability to generate tight clusters of genes while leaving many genes unassigned from all clusters. We propose exploiting this particular feature by applying the Bi-CoPaM over genome-wide microarray data from multiple datasets to generate more clusters than required. Then, these clusters are tightened so that most of their genes are left unassigned from all clusters, and most of the clusters are left totally empty. The tightened clusters, which are still not empty, include those genes that are consistently co-expressed in multiple datasets when examined by various clustering methods. An example of this is demonstrated in this paper for cyclic and acyclic genes as well as for genes that are highly expressed and that are not. Thus, the results of our proposed approach cannot be reproduced by other methods of genes' periodicity identification or by other methods of clustering.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127850095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sharifa Alghowinem, Roland Göcke, M. Wagner, J. Epps, Tom Gedeon, M. Breakspear, G. Parker
{"title":"A comparative study of different classifiers for detecting depression from spontaneous speech","authors":"Sharifa Alghowinem, Roland Göcke, M. Wagner, J. Epps, Tom Gedeon, M. Breakspear, G. Parker","doi":"10.1109/ICASSP.2013.6639227","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6639227","url":null,"abstract":"Accurate detection of depression from spontaneous speech could lead to an objective diagnostic aid to assist clinicians to better diagnose depression. Little thought has been given so far to which classifier performs best for this task. In this study, using a 60-subject real-world clinically validated dataset, we compare three popular classifiers from the affective computing literature - Gaussian Mixture Models (GMM), Support Vector Machines (SVM) and Multilayer Perceptron neural networks (MLP) - as well as the recently proposed Hierarchical Fuzzy Signature (HFS) classifier. Among these, a hybrid classifier using GMM models and SVM gave the best overall classification results. Comparing feature, score, and decision fusion, score fusion performed better for GMM, HFS and MLP, while decision fusion worked best for SVM (both for raw data and GMM models). Feature fusion performed worse than other fusion methods in this study. We found that loudness, root mean square, and intensity were the voice features that performed best to detect depression in this dataset.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126193558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mitigation of clipping in sensors","authors":"Shang-Kee Ting, A. H. Sayed","doi":"10.1109/ICASSP.2013.6638803","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6638803","url":null,"abstract":"One major source of nonlinear distortion in analog-to-digital converters (ADCs) is clipping. The problem introduces spurious noise across the bandwidth of the sampled data. Prior works recover the signal from the acquired samples by relying on oversampling or on the assumption of vacant frequency bands and on the use of sparse signal representations. In this work, we propose a different approach, which uses two streams of data to mitigate the clipping distortion. Simulation results show an SNR improvement of 9dB, while the conventional approaches may even degrade the SNR in some situations.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133139472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Adaptive feature split selection for co-training: Application to tire irregular wear classification","authors":"Wei Du, R. Phlypo, T. Adalı","doi":"10.1109/ICASSP.2013.6638308","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6638308","url":null,"abstract":"Co-training is a practical and powerful semi-supervised learning method. It yields high classification accuracy with a training data set containing only a small set of labeled data. Successful performance in co-training requires two important conditions on the features: diversity and sufficiency. In this paper, we propose a novel mutual information (MI) based approach inspired by the idea of dependent component analysis (DCA) to achieve feature splits that are maximally independent between-subsets (diversity) or within-subsets (sufficiency). We evaluate the relationship between the classification performance and the relative importance of the two conditions. Experimental results on actual tire data indicate that compared to diversity, sufficiency has a more significant impact on their classification accuracy. Further results show that co-training with feature splits obtained by the MI-based approach yields higher accuracy than supervised classification and significantly higher when using a small set of labeled training data.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121293894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interactive multimodal music transcription","authors":"J. Quereda, C. Pérez-Sancho","doi":"10.1109/ICASSP.2013.6637639","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6637639","url":null,"abstract":"Automatic music transcription has usually been performed as an autonomous task and its evaluation has been made in terms of precision, recall, accuracy, etc. Nevertheless, in this work, assuming that the state of the art is far from being perfect, it is considered as an interactive one, where an expert user is assisted in its work by a transcription tool. In this context, the performance evaluation of the system turns into an assessment of how many user interactions are needed to complete the work. The strategy is that the user interactions can be used by the system to improve its performance in an adaptive way, thus minimizing the workload. Also, a multimodal approach has been implemented, in such a way that different sources of information, like onsets, beats, and meter, are used to detect notes in a musical audio excerpt. The system is focused on monotimbral polyphonic transcription.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126560594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discriminatively trained Bayesian speaker comparison of i-vectors","authors":"B. J. Borgstrom, A. McCree","doi":"10.1109/ICASSP.2013.6639153","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6639153","url":null,"abstract":"This paper presents a framework for fully Bayesian speaker comparison of i-vectors. By generalizing the train/test paradigm, we derive an analytic expression for the speaker comparison log-likelihood ratio (LLR), as well as solutions for model training and Bayesian scoring. This framework is useful for enrollment sets of any size. For the specific case of single-cut enrollment, it is shown to be mathematically equivalent to probabilistic linear discriminant analysis (PLDA). Additionally, we present discriminative training of model hyper-parameters by minimizing the total cross entropy between LLRs and class labels. When applied to speaker recognition, significant performance gains are observed for various NIST SRE 2010 extended evaluation tasks.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125928860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tracking sparse signal sequences from nonlinear/non-Gaussian measurements and applications in illumination-motion tracking","authors":"Rituparna Sarkar, Samarjit Das, Namrata Vaswani","doi":"10.1109/ICASSP.2013.6638941","DOIUrl":"https://doi.org/10.1109/ICASSP.2013.6638941","url":null,"abstract":"In this work, we develop algorithms for tracking time sequences of sparse spatial signals with slowly changing sparsity patterns, and other unknown states, from a sequence of nonlinear observations corrupted by (possibly) non-Gaussian noise. A key example of the above problem occurs in tracking moving objects across spatially varying illumination changes, where motion is the small dimensional state while the illumination image is the sparse spatial signal satisfying the slow-sparsity-pattern-change property.","PeriodicalId":183968,"journal":{"name":"2013 IEEE International Conference on Acoustics, Speech and Signal Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130934329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}