Spoken Language Recognition With Prosodic Features

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-09-01 DOI:10.1109/TASL.2013.2260157

Raymond W. M. Ng, Tan Lee, C. Leung, B. Ma, Haizhou Li

{"title":"Spoken Language Recognition With Prosodic Features","authors":"Raymond W. M. Ng, Tan Lee, C. Leung, B. Ma, Haizhou Li","doi":"10.1109/TASL.2013.2260157","DOIUrl":null,"url":null,"abstract":"Speech prosody is believed to carry much language-specific information that can be used for spoken language recognition (SLR). In the past, the use of prosodic features for SLR has been studied sporadically and the reported performances were considered unsatisfactory. In this paper, we exploit a wide range of prosodic attributes for large-scale SLR tasks. These attributes describe the multifaceted variations of F0, intensity and duration in different spoken languages. Prosodic attributes are modeled by the bag of n-grams approach with support vector machine (SVM) as in the conventional phonotactic SLR systems. Experimental results on OGI and NIST-LRE tasks showed that the use of proposed attributes gives significantly better SLR performance than those previously reported. The full feature set includes 87 prosodic attributes and redundancy among attributes may exist. Attributes are broken down into particular bigrams called bins. Four entropy-based feature selection metrics with different selection criteria are derived. Attributes can be selected by individual bins, or by attributes as batches of bins. It can also be done in a language-dependent or language-independent manner. By comparing different selection sizes and criteria, an optimal attribute subset comprising 5,000 bins is found by using a bin-level language-independent criterion. Feature selection reduces model size by 2.5 times and shortens the runtime by 6 times. The optimal subset of bins gives the lowest EER of 20.18% on NIST-LRE 2007 SLR task in a prosodic attribute model (PAM) system which exclusively modeled prosodic attributes. In a phonotactic-prosodic fusion SLR system, the detection cost, Cavg is 2.09%. The relative detection cost reduction is 23%.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1841-1853"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260157","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2260157","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 23

Abstract

Speech prosody is believed to carry much language-specific information that can be used for spoken language recognition (SLR). In the past, the use of prosodic features for SLR has been studied sporadically and the reported performances were considered unsatisfactory. In this paper, we exploit a wide range of prosodic attributes for large-scale SLR tasks. These attributes describe the multifaceted variations of F0, intensity and duration in different spoken languages. Prosodic attributes are modeled by the bag of n-grams approach with support vector machine (SVM) as in the conventional phonotactic SLR systems. Experimental results on OGI and NIST-LRE tasks showed that the use of proposed attributes gives significantly better SLR performance than those previously reported. The full feature set includes 87 prosodic attributes and redundancy among attributes may exist. Attributes are broken down into particular bigrams called bins. Four entropy-based feature selection metrics with different selection criteria are derived. Attributes can be selected by individual bins, or by attributes as batches of bins. It can also be done in a language-dependent or language-independent manner. By comparing different selection sizes and criteria, an optimal attribute subset comprising 5,000 bins is found by using a bin-level language-independent criterion. Feature selection reduces model size by 2.5 times and shortens the runtime by 6 times. The optimal subset of bins gives the lowest EER of 20.18% on NIST-LRE 2007 SLR task in a prosodic attribute model (PAM) system which exclusively modeled prosodic attributes. In a phonotactic-prosodic fusion SLR system, the detection cost, Cavg is 2.09%. The relative detection cost reduction is 23%.

查看原文本刊更多论文

具有韵律特征的口语识别

语音韵律被认为携带了许多语言特有的信息，可以用于口语识别。在过去，对单反使用韵律特征的研究很少，报道的效果并不令人满意。在本文中，我们为大规模单反任务开发了广泛的韵律属性。这些特征描述了F0、强度和持续时间在不同口语中的多方面变化。韵律属性的建模采用支持向量机(SVM)的n-grams包方法，与传统的语音定向单反系统一样。在OGI和NIST-LRE任务上的实验结果表明，使用所提出的属性可以显著提高单反性能。完整的特性集包括87个韵律属性，属性之间可能存在冗余。属性被分解成特定的双元数据，称为bin。推导了四个基于熵的特征选择指标，并给出了不同的选择标准。属性可以通过单个箱子选择，也可以通过作为批次箱子的属性选择。它也可以以语言依赖或语言独立的方式完成。通过比较不同的选择大小和标准，使用与bin级语言无关的标准找到了包含5,000个bin的最优属性子集。特征选择使模型尺寸减小2.5倍，运行时间缩短6倍。在专门建模韵律属性的韵律属性模型(PAM)系统中，箱的最优子集在NIST-LRE 2007 SLR任务上的EER最低，为20.18%。在声韵融合单反系统中，检测成本Cavg为2.09%。检测成本相对降低23%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.