{"title":"Robust Detection of Vowel Onset and End Points","authors":"Avinash Kumar, S. Shahnawazuddin","doi":"10.1109/SPCOM50965.2020.9179535","DOIUrl":null,"url":null,"abstract":"A novel approach for detecting vowels, vowel onset-points and vowel end-points is presented in this paper. This study is motivated by the fact that some vowels have significant amount of spectral information even in the high frequency region. Furthermore, high-pitched speakers such as adult females and children have relatively more high frequency components than adult males. In order to effectively capture that information, we have exploited linear frequency cepstral coefficients (LFCC) along with Mel-frequency cepstral coefficients (MFCC). The MFCC features are known to down-sample the high frequency components. The LFCC features, on the other hand, provide equal resolution to all frequencies. Therefore, the use of LFCC features helps in effectively resolving high frequency components as well. In order to detect the vowels, two separate vowel non-vowel classification systems, employing deep learning architectures, are developed using MFCC and LFCC features, respectively. Next, for any given test utterance, lattices are generated using the trained acoustic models. The beginning time, duration and confidence scores are then extracted for each occurrence of vowel/non-vowel from the lattices. The weak evidences are discarded by applying a threshold on the confidence scores in order to reduce spurious detection. Finally, the evidences obtained using MFCC and LFCC features are weighted with their respective confidence scores and combined. The proposed approach is observed to outperform the existing ones. Using the detected vowel regions, we have also developed a simple scheme to determine whether the given speech utterance is from an adult or a child speaker. The developed scheme is highly effective in discriminating between adult and child speakers.","PeriodicalId":208527,"journal":{"name":"2020 International Conference on Signal Processing and Communications (SPCOM)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on Signal Processing and Communications (SPCOM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SPCOM50965.2020.9179535","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
A novel approach for detecting vowels, vowel onset-points and vowel end-points is presented in this paper. This study is motivated by the fact that some vowels have significant amount of spectral information even in the high frequency region. Furthermore, high-pitched speakers such as adult females and children have relatively more high frequency components than adult males. In order to effectively capture that information, we have exploited linear frequency cepstral coefficients (LFCC) along with Mel-frequency cepstral coefficients (MFCC). The MFCC features are known to down-sample the high frequency components. The LFCC features, on the other hand, provide equal resolution to all frequencies. Therefore, the use of LFCC features helps in effectively resolving high frequency components as well. In order to detect the vowels, two separate vowel non-vowel classification systems, employing deep learning architectures, are developed using MFCC and LFCC features, respectively. Next, for any given test utterance, lattices are generated using the trained acoustic models. The beginning time, duration and confidence scores are then extracted for each occurrence of vowel/non-vowel from the lattices. The weak evidences are discarded by applying a threshold on the confidence scores in order to reduce spurious detection. Finally, the evidences obtained using MFCC and LFCC features are weighted with their respective confidence scores and combined. The proposed approach is observed to outperform the existing ones. Using the detected vowel regions, we have also developed a simple scheme to determine whether the given speech utterance is from an adult or a child speaker. The developed scheme is highly effective in discriminating between adult and child speakers.