{"title":"Phonetic Analysis of Self-supervised Representations of English Speech","authors":"Dan Wells, Hao Tang, Korin Richmond","doi":"10.21437/interspeech.2022-10884","DOIUrl":null,"url":null,"abstract":"We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when consid-ering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3583-3587"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-10884","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when consid-ering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.