W. Campbell, K. Brady, J. Campbell, R. Granville, D. Reynolds
{"title":"Understanding Scores in Forensic Speaker Recognition","authors":"W. Campbell, K. Brady, J. Campbell, R. Granville, D. Reynolds","doi":"10.1109/ODYSSEY.2006.248091","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248091","url":null,"abstract":"Recent work in forensic speaker recognition has introduced many new scoring methodologies. First, confidence scores (posterior probabilities) have become a useful method of presenting results to an analyst. The introduction of an objective measure of confidence score quality, the normalized cross entropy, has resulted in a systematic manner of evaluating and designing these systems. A second scoring methodology that has become popular is support vector machines (SVMs) for high-level features. SVMs are accurate and produce excellent results across a wide variety of token types-words, phones, and prosodic features. In both cases, an analyst may be at a loss to explain the significance and meaning of the score produced by these methods. We tackle the problem of interpretation by exploring concepts from the statistical and pattern classification literature. In both cases, our preliminary results show interesting aspects of scores not obvious from viewing them \"only as numbers\"","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124407905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LZW Based Distance Measures for Spoken Language Identification","authors":"S. Basavaraja, T. Sreenivas","doi":"10.1109/ODYSSEY.2006.248103","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248103","url":null,"abstract":"We present a new approach to spoken language modeling for language identification (LID) using the Lempel-Ziv-Welch (LZW) algorithm. The LZW technique is applicable to any kind of tokenization of the speech signal. Because of the efficiency of LZW algorithm to obtain variable length symbol strings in the training data, the LZW codebook captures the essentials of a language effectively. We develop two new deterministic measures for LID based on the LZW algorithm namely: (i) Compression ratio score (LZW-CR) and (ii) weighted discriminant score (LZW-WDS). To assess these measures, we consider error-free tokenization of speech as well as artificially induced noise in the tokenization. It is shown that for a 6 language LID task of OGI-TS database with clean tokenization, the new model (LZW-WDS) performs slightly better than the conventional bigram model. For noisy tokenization, which is the more realistic case, LZW-WDS significantly outperforms the bigram technique","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127213212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Kenny, Gilles Boulianne, P. Ouellet, P. Dumouchel
{"title":"The Geometry of the Channel Space in GMM-Based Speaker Recognition","authors":"P. Kenny, Gilles Boulianne, P. Ouellet, P. Dumouchel","doi":"10.1109/ODYSSEY.2006.248137","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248137","url":null,"abstract":"We describe an extension of the joint factor analysis model of speaker and channel variability in which channel supervectors are modeled by mixtures of low-rank Gaussians rather than by a unimodal Gaussian. This version of the joint factor analysis model includes data-driven feature mapping and the standard joint factor analysis models as limiting cases and it enables us to explore a range of possibilities between these two extremes. Our experimental results indicate that unimodal models of relatively high rank perform better than mixture models of lower rank and they confirm the appropriateness of the unimodal assumption in the standard joint factor analysis model","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"320 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127277545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speaker Recognition Score-Normalization to Compensate for SNR and Duration","authors":"J. Harmse, S. Beck, H. Nakasone","doi":"10.1109/ODYSSEY.2006.248092","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248092","url":null,"abstract":"The decision criterion for automatic speaker verification tests is based on minimization of a weighted sum of the miss and false alarm probabilities. These probabilities are derived from an evaluation of claimant and impostor scores using a representative population of recorded speech samples. However, in applications such as forensic speaker verification, the signal quality and the recording conditions of the speech samples are usually unknown and generally not matched to the evaluation conditions for the defined error probabilities. For example, test samples are often of short duration, have significant noise, and are from uncertain channels. It is therefore necessary to normalize the speaker test scores or to adjust detection thresholds in accordance with the recorded signal conditions. Instead of accounting for all possibilities, evaluations were conducted for a few specific joint combinations of signal-to-noise ratio (SNR) and speech duration for both the training and test sets. A composite regression model was developed to predict the necessary adjustments for any measured value of these conditions. In addition, a method is discussed to interpret the normalized scores relative to a set of desired Type I and Type II error probabilities","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116419970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Ramos, J. González-Rodríguez, A. Montero-Asenjo, J. Ortega-Garcia
{"title":"Suspect-Adapted MAP Estimation of Within-Source Distributions in Generative Likelihood Ratio Estimation","authors":"D. Ramos, J. González-Rodríguez, A. Montero-Asenjo, J. Ortega-Garcia","doi":"10.1109/ODYSSEY.2006.248090","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248090","url":null,"abstract":"In this paper, a novel suspect-adaptive technique for robust Bayesian forensic speaker recognition via maximum a posteriori (MAP) estimation is presented, which addresses likelihood ratio (LR) computation in limited suspect speech data conditions obtaining good calibration performance. Robustness is achieved by the use of speaker-independent information, adapting it to the specificities of the suspect involved in the process. Thus, this procedure allows the system to weight the relevance of the suspect specificities depending on the amount of suspect data available via MAP estimation. Experimental results show robustness to suspect data scarcity and stable performance for any amount of suspect material. Also, the proposed technique outperforms other previously proposed non-adaptive approaches. Results are presented as discrimination capabilities (DET plots), distributions of LRs (Tippett plots) and expected cost of wrong decisions over any prior or decision cost (Cllr). The use of such evaluation metrics allows us to highlight the importance of LR calibration in the performance of a forensic system","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130008373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Speaker Segmentation and Clustering using Gender Information","authors":"Brian M. Ore, Raymond E. Slyh, Eric G. Hansen","doi":"10.1109/ODYSSEY.2006.248125","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248125","url":null,"abstract":"This paper considers the segmentation and clustering of conversational speech for the two-wire training (3conv2w) and two-wire testing (1conv2w) conditions of the NIST 2005 speaker recognition evaluation. A notable feature of the system described is that each file is labeled as containing either opposite- or same-gender speakers. The speech segments for opposite-gender files are clustered by gender, while those for same-gender files are processed by agglomerative clustering. By using gender information in the clustering of the opposite-gender files, the equal error rate in the 3conv2w training condition was reduced from 15.2% to 9.9%. For the 1conv2w testing condition, clustering opposite-gender files by gender did not improve performance over agglomerative clustering; however, it was over 100 times faster than agglomerative clustering on the opposite-gender files","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133174619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compensating for Mismatch in High-Level Speaker Recognition","authors":"William M. Campbell","doi":"10.1109/ODYSSEY.2006.248110","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248110","url":null,"abstract":"Speaker recognition using high-level features has been a successful area of exploration. Features obtained from many different levels-phones, words, prosodic events, etc.-are used to characterize the speaker. A good modeling technique for these features is the support vector machine (SVM). SVMs model the n-gram frequencies from speaker utterances in a high-dimensional SVM feature space and have shown excellent performance over a wide variety of high-level features. A complimentary method of recent exploration in SVM speaker recognition is the use of nuisance attributes projection (NAP). NAP removes directions from SVM feature space that are superfluous to the task of speaker recognition-channel information, session variability, etc. In this paper, we consider the application of NAP to high-level speaker recognition. We describe the difficulties in applying this method and propose solutions. We also conduct experiments showing that NAP can reduce variability in SVM feature space leading to improved performance","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124536606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Support Vector Gmms for Speaker Verification","authors":"N. Dehak, G. Chollet","doi":"10.1109/ODYSSEY.2006.248131","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248131","url":null,"abstract":"This article presents a new approach using the discrimination power of support vectors machines (SVM) in combination with Gaussian mixture models (GMM) for automatic speaker verification (ASV). In this combination SVMs are applied in the GMM model space. Each point of this space represents a GMM speaker model. The kernel which is used for the SVM allows the computation of a similarity between GMM models. It was calculated using the Kullback-Leibler (KL) divergence. The results of this new approach show a clear improvement compared to a simple GMM system on the NIST2005 Speaker Recognition Evaluation primary task","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117184231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MASC: A Speech Corpus in Mandarin for Emotion Analysis and Affective Speaker Recognition","authors":"Tian Wu, Yingchun Yang, Zhaohui Wu, Dongdong Li","doi":"10.1109/ODYSSEY.2006.248084","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248084","url":null,"abstract":"In this paper, a large emotional speech database MASC (Mandarin affective speech corpus) is introduced. The database contains recordings of 68 native speakers (23 female and 45 male) and five kinds of emotional states: neutral, anger, elation, panic and sadness. Each speaker pronounces 5 phrases, 10 sentences for three times for each emotional states and 2 paragraphs only for neutral. These materials covers all the phonemes in Chinese. This corpus is constructed for prosodic and linguistic investigation of emotion expression in Mandarin. It can also be used for recognition of affectively stressed speakers. Furthermore, prosodic feature analysis and speaker recognition baseline experiment are performed on this database","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129856585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Adaptive Score Normalization for Noise Robustness Speaker Verification on Cellular phone","authors":"Wei Huang, Yaxin Zhang","doi":"10.1109/ODYSSEY.2006.248140","DOIUrl":"https://doi.org/10.1109/ODYSSEY.2006.248140","url":null,"abstract":"Most commonly used score normalization methods can improve the performance of speaker verification systems, but need extra speech data or cohort models, more memory and computation MIPS. In this paper we present a low-cost adaptive online score normalization (LAOSN) method to improve the performance of speaker verification without any extra data. The computation and memory cost of LAOSN is very small. The procedure begins with initialization of the normalization parameters with existing scores of enrolment utterances from a given enrolment speaker model, and the normalization parameters will be online updated with the scores of subsequent test utterances. By this means, an accurate estimation of the unknown score distribution is archived to normalize current test score. Experiments on the Polycost corpus suggest that the LAOSN can achieve much better performance comparing to the well-known Z-norm method without any extra memory and computation cost","PeriodicalId":215883,"journal":{"name":"2006 IEEE Odyssey - The Speaker and Language Recognition Workshop","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125668836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}