{"title":"A Practical Way to Improve Automatic Phonetic Segmentation Performance","authors":"Wenjie Peng, Yingming Gao, Binghuai Lin, Jinsong Zhang","doi":"10.1109/ISCSLP49672.2021.9362107","DOIUrl":null,"url":null,"abstract":"Automatic phonetic segmentation is a fundamental task for many applications. Segmentation systems highly rely upon the acoustic-phonetic relationship. However, the phonemes’ realization varies in continuous speech. As a consequence, segmentation systems usually suffer from such variation, which includes the intra-phone dissimilarity and the inter-phone similarity in terms of acoustic properties. In this paper, We conducted experiments following the classic GMM-HMM framework to address these issues. In the baseline setup, we found the top error comes from diphthong /oy/ and boundary of glide-to-vowel respectively, which suggested the influence of the above variation on segmentation results. Here, we present our approaches to improve automatic phonetic segmentation performance. First, we modeled the intra-phone dissimilarity using GMM with model selection at the state-level. Second, we utilized the context-dependent models to handle the inter-phone similarity due to coarticulation effect. The two approaches are coupled with the objective to improve segmentation accuracy. Experimental results demonstrated the effectiveness for the aforementioned top error. In addition, we also took the phones’ duration into account for the HMM topology design. The segmentation accuracy was further improved to 91.32% within 20ms on the TIMIT corpus after combining the above refinements, which has a relative error reduction of 3.34% compared to the raw GMM-HMM segmentation in [1].","PeriodicalId":279828,"journal":{"name":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 12th International Symposium on Chinese Spoken Language Processing (ISCSLP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISCSLP49672.2021.9362107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Automatic phonetic segmentation is a fundamental task for many applications. Segmentation systems highly rely upon the acoustic-phonetic relationship. However, the phonemes’ realization varies in continuous speech. As a consequence, segmentation systems usually suffer from such variation, which includes the intra-phone dissimilarity and the inter-phone similarity in terms of acoustic properties. In this paper, We conducted experiments following the classic GMM-HMM framework to address these issues. In the baseline setup, we found the top error comes from diphthong /oy/ and boundary of glide-to-vowel respectively, which suggested the influence of the above variation on segmentation results. Here, we present our approaches to improve automatic phonetic segmentation performance. First, we modeled the intra-phone dissimilarity using GMM with model selection at the state-level. Second, we utilized the context-dependent models to handle the inter-phone similarity due to coarticulation effect. The two approaches are coupled with the objective to improve segmentation accuracy. Experimental results demonstrated the effectiveness for the aforementioned top error. In addition, we also took the phones’ duration into account for the HMM topology design. The segmentation accuracy was further improved to 91.32% within 20ms on the TIMIT corpus after combining the above refinements, which has a relative error reduction of 3.34% compared to the raw GMM-HMM segmentation in [1].