Raymond W. M. Ng, Tan Lee, C. Leung, B. Ma, Haizhou Li
{"title":"Spoken Language Recognition With Prosodic Features","authors":"Raymond W. M. Ng, Tan Lee, C. Leung, B. Ma, Haizhou Li","doi":"10.1109/TASL.2013.2260157","DOIUrl":null,"url":null,"abstract":"Speech prosody is believed to carry much language-specific information that can be used for spoken language recognition (SLR). In the past, the use of prosodic features for SLR has been studied sporadically and the reported performances were considered unsatisfactory. In this paper, we exploit a wide range of prosodic attributes for large-scale SLR tasks. These attributes describe the multifaceted variations of F0, intensity and duration in different spoken languages. Prosodic attributes are modeled by the bag of n-grams approach with support vector machine (SVM) as in the conventional phonotactic SLR systems. Experimental results on OGI and NIST-LRE tasks showed that the use of proposed attributes gives significantly better SLR performance than those previously reported. The full feature set includes 87 prosodic attributes and redundancy among attributes may exist. Attributes are broken down into particular bigrams called bins. Four entropy-based feature selection metrics with different selection criteria are derived. Attributes can be selected by individual bins, or by attributes as batches of bins. It can also be done in a language-dependent or language-independent manner. By comparing different selection sizes and criteria, an optimal attribute subset comprising 5,000 bins is found by using a bin-level language-independent criterion. Feature selection reduces model size by 2.5 times and shortens the runtime by 6 times. The optimal subset of bins gives the lowest EER of 20.18% on NIST-LRE 2007 SLR task in a prosodic attribute model (PAM) system which exclusively modeled prosodic attributes. In a phonotactic-prosodic fusion SLR system, the detection cost, Cavg is 2.09%. The relative detection cost reduction is 23%.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260157","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2260157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 23
Abstract
Speech prosody is believed to carry much language-specific information that can be used for spoken language recognition (SLR). In the past, the use of prosodic features for SLR has been studied sporadically and the reported performances were considered unsatisfactory. In this paper, we exploit a wide range of prosodic attributes for large-scale SLR tasks. These attributes describe the multifaceted variations of F0, intensity and duration in different spoken languages. Prosodic attributes are modeled by the bag of n-grams approach with support vector machine (SVM) as in the conventional phonotactic SLR systems. Experimental results on OGI and NIST-LRE tasks showed that the use of proposed attributes gives significantly better SLR performance than those previously reported. The full feature set includes 87 prosodic attributes and redundancy among attributes may exist. Attributes are broken down into particular bigrams called bins. Four entropy-based feature selection metrics with different selection criteria are derived. Attributes can be selected by individual bins, or by attributes as batches of bins. It can also be done in a language-dependent or language-independent manner. By comparing different selection sizes and criteria, an optimal attribute subset comprising 5,000 bins is found by using a bin-level language-independent criterion. Feature selection reduces model size by 2.5 times and shortens the runtime by 6 times. The optimal subset of bins gives the lowest EER of 20.18% on NIST-LRE 2007 SLR task in a prosodic attribute model (PAM) system which exclusively modeled prosodic attributes. In a phonotactic-prosodic fusion SLR system, the detection cost, Cavg is 2.09%. The relative detection cost reduction is 23%.
期刊介绍:
The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.