探索语音信号中元音检测的不同声学建模技术

2016 Twenty Second National Conference on Communication (NCC) Pub Date : 2016-03-01 DOI:10.1109/NCC.2016.7561195

Avinash Kumar, S. Shahnawazuddin, G. Pradhan

{"title":"探索语音信号中元音检测的不同声学建模技术","authors":"Avinash Kumar, S. Shahnawazuddin, G. Pradhan","doi":"10.1109/NCC.2016.7561195","DOIUrl":null,"url":null,"abstract":"In this paper, we explore acoustic modeling techniques based on the Gaussian mixture modeling (GMM), the subspace GMM (SGMM) and deep neural network (DNN) for the detection of vowels in a given speech signal. At the outset, we develop a recognition system on the TIMIT database that recognizes the sequence of phonetic units present in a given speech sample. Two recognizers are developed using speech data sampled at 16 kHz and 8 kHz rates, respectively. The phone error rates (classification errors) for the two recognizers help in studying the effect of sampling rate on the classifier performance. The experimental evaluations presented in this study show that there is a slight deterioration in the recognition performance when speech data is re-sampled to 8 kHz rate. Next, a three-class classifier (vowel, non-vowel and silence) is also developed on the TIMIT database and the classification performances are studied. Using the three-class classifier, a given speech sample is then forced aligned against the trained acoustic model under the constraints of true/first-pass transcriptions to detect the vowel regions. The correctly detected and spurious vowel regions are analyzed in detail to find the impact of semivowel and nasal sound units on the detection of vowel regions as well as on the determination of vowel onset and end points. Among the explored acoustic modeling techniques, the SGMM-based system is observed to superior to all other systems. Furthermore, for all the studied modeling techniques, the spurious cases are mostly due to the detection of semivowels as the vowels.","PeriodicalId":279637,"journal":{"name":"2016 Twenty Second National Conference on Communication (NCC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":"{\"title\":\"Exploring different acoustic modeling techniques for the detection of vowels in speech signal\",\"authors\":\"Avinash Kumar, S. Shahnawazuddin, G. Pradhan\",\"doi\":\"10.1109/NCC.2016.7561195\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we explore acoustic modeling techniques based on the Gaussian mixture modeling (GMM), the subspace GMM (SGMM) and deep neural network (DNN) for the detection of vowels in a given speech signal. At the outset, we develop a recognition system on the TIMIT database that recognizes the sequence of phonetic units present in a given speech sample. Two recognizers are developed using speech data sampled at 16 kHz and 8 kHz rates, respectively. The phone error rates (classification errors) for the two recognizers help in studying the effect of sampling rate on the classifier performance. The experimental evaluations presented in this study show that there is a slight deterioration in the recognition performance when speech data is re-sampled to 8 kHz rate. Next, a three-class classifier (vowel, non-vowel and silence) is also developed on the TIMIT database and the classification performances are studied. Using the three-class classifier, a given speech sample is then forced aligned against the trained acoustic model under the constraints of true/first-pass transcriptions to detect the vowel regions. The correctly detected and spurious vowel regions are analyzed in detail to find the impact of semivowel and nasal sound units on the detection of vowel regions as well as on the determination of vowel onset and end points. Among the explored acoustic modeling techniques, the SGMM-based system is observed to superior to all other systems. Furthermore, for all the studied modeling techniques, the spurious cases are mostly due to the detection of semivowels as the vowels.\",\"PeriodicalId\":279637,\"journal\":{\"name\":\"2016 Twenty Second National Conference on Communication (NCC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"13\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Twenty Second National Conference on Communication (NCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NCC.2016.7561195\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Twenty Second National Conference on Communication (NCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NCC.2016.7561195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 13

摘要

在本文中，我们探索了基于高斯混合建模(GMM)，子空间GMM (SGMM)和深度神经网络(DNN)的声学建模技术，用于检测给定语音信号中的元音。首先，我们在TIMIT数据库上开发了一个识别系统，该系统可以识别给定语音样本中存在的语音单位序列。两个识别器分别使用16 kHz和8 kHz采样率的语音数据开发。两种识别器的电话错误率(分类错误)有助于研究采样率对分类器性能的影响。本研究的实验评估表明，当语音数据重采样到8 kHz时，识别性能略有下降。其次，在TIMIT数据库上开发了元音、非元音和静音三类分类器，并对分类性能进行了研究。使用三类分类器，给定的语音样本在真实/第一遍转录的约束下被强制与训练的声学模型对齐，以检测元音区域。详细分析了正确检测的元音区域和虚假的元音区域，找出了半元音和鼻音单元对元音区域检测以及元音起止点确定的影响。在探索的声学建模技术中，基于sgmm的系统被观察到优于所有其他系统。此外，在所有研究的建模技术中，由于将半元音检测为元音而产生的虚假情况居多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring different acoustic modeling techniques for the detection of vowels in speech signal

In this paper, we explore acoustic modeling techniques based on the Gaussian mixture modeling (GMM), the subspace GMM (SGMM) and deep neural network (DNN) for the detection of vowels in a given speech signal. At the outset, we develop a recognition system on the TIMIT database that recognizes the sequence of phonetic units present in a given speech sample. Two recognizers are developed using speech data sampled at 16 kHz and 8 kHz rates, respectively. The phone error rates (classification errors) for the two recognizers help in studying the effect of sampling rate on the classifier performance. The experimental evaluations presented in this study show that there is a slight deterioration in the recognition performance when speech data is re-sampled to 8 kHz rate. Next, a three-class classifier (vowel, non-vowel and silence) is also developed on the TIMIT database and the classification performances are studied. Using the three-class classifier, a given speech sample is then forced aligned against the trained acoustic model under the constraints of true/first-pass transcriptions to detect the vowel regions. The correctly detected and spurious vowel regions are analyzed in detail to find the impact of semivowel and nasal sound units on the detection of vowel regions as well as on the determination of vowel onset and end points. Among the explored acoustic modeling techniques, the SGMM-based system is observed to superior to all other systems. Furthermore, for all the studied modeling techniques, the spurious cases are mostly due to the detection of semivowels as the vowels.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 Twenty Second National Conference on Communication (NCC)

自引率

0.00%

发文量