{"title":"使用基于语音生产参数的语音转换模型的性能评估","authors":"Ashwini Dasare, K.T. Deepak","doi":"10.1016/j.csl.2025.101853","DOIUrl":null,"url":null,"abstract":"<div><div>Voice Conversion (VC) transforms a source voice to sound like a target voice. However, the field requires more standardized objective metrics to evaluate its performance thoroughly. Traditional evaluation methods, such as Mel-Cepstral Distortion (MCD), F0-Root Mean Squared Error (F0RMSE), and Modulated-Spectral Distance (MSD), primarily focus on perceptual features and often overlook speech production attributes. This can result in a mismatch between perceived voice similarity and the physiological aspects of the voice, leading to a reliance on subjective methods like the Mean Opinion Score (MOS). While MOS provides valuable insights, it is resource-intensive and inherently subjective, limiting its practicality for widespread use. This research proposes an objective framework for evaluating voice quality in VC tasks by focusing on key speech production parameters, including jitter, shimmer, harmonics-to-noise ratio, and vocal tract length. Our findings suggest that these parameters, which encapsulate the distinct characteristics of a speaker’s voice, provide a more precise basis for assessing perceptual similarity between converted and target voices. Compared to traditional objective metrics like MCD, MSD, F0RMSE, and also non-intrusive measures like MOSNET, UTMOS, our proposed method consistently shows a correlation with MOS, suggesting that it better aligns with subjective evaluations of voice quality. This presents a more reliable and practical alternative to conventional methods that primarily emphasize perceptual features. This study evaluates how well different VC models, such as StarGANv2-VC, Retrival-based VC, Suno-Bark, and Diff-VC replicate speech production parameters across various languages and accents, including English, Kannada, Hindi, and the low-resource Soliga language. The results provide insights into improving the evaluation of voice conversion technologies by focusing on speech production attributes, helping to bridge the gap between perceptual similarity and physiological accuracy. The proposed work lays the groundwork for developing standardized, objective evaluation methods for VC models based on speech production characteristics.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101853"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance assessment of voice conversion models using speech production-based parameters\",\"authors\":\"Ashwini Dasare, K.T. Deepak\",\"doi\":\"10.1016/j.csl.2025.101853\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Voice Conversion (VC) transforms a source voice to sound like a target voice. However, the field requires more standardized objective metrics to evaluate its performance thoroughly. Traditional evaluation methods, such as Mel-Cepstral Distortion (MCD), F0-Root Mean Squared Error (F0RMSE), and Modulated-Spectral Distance (MSD), primarily focus on perceptual features and often overlook speech production attributes. This can result in a mismatch between perceived voice similarity and the physiological aspects of the voice, leading to a reliance on subjective methods like the Mean Opinion Score (MOS). While MOS provides valuable insights, it is resource-intensive and inherently subjective, limiting its practicality for widespread use. This research proposes an objective framework for evaluating voice quality in VC tasks by focusing on key speech production parameters, including jitter, shimmer, harmonics-to-noise ratio, and vocal tract length. Our findings suggest that these parameters, which encapsulate the distinct characteristics of a speaker’s voice, provide a more precise basis for assessing perceptual similarity between converted and target voices. Compared to traditional objective metrics like MCD, MSD, F0RMSE, and also non-intrusive measures like MOSNET, UTMOS, our proposed method consistently shows a correlation with MOS, suggesting that it better aligns with subjective evaluations of voice quality. This presents a more reliable and practical alternative to conventional methods that primarily emphasize perceptual features. This study evaluates how well different VC models, such as StarGANv2-VC, Retrival-based VC, Suno-Bark, and Diff-VC replicate speech production parameters across various languages and accents, including English, Kannada, Hindi, and the low-resource Soliga language. The results provide insights into improving the evaluation of voice conversion technologies by focusing on speech production attributes, helping to bridge the gap between perceptual similarity and physiological accuracy. The proposed work lays the groundwork for developing standardized, objective evaluation methods for VC models based on speech production characteristics.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"95 \",\"pages\":\"Article 101853\"},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2025-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230825000786\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000786","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Performance assessment of voice conversion models using speech production-based parameters
Voice Conversion (VC) transforms a source voice to sound like a target voice. However, the field requires more standardized objective metrics to evaluate its performance thoroughly. Traditional evaluation methods, such as Mel-Cepstral Distortion (MCD), F0-Root Mean Squared Error (F0RMSE), and Modulated-Spectral Distance (MSD), primarily focus on perceptual features and often overlook speech production attributes. This can result in a mismatch between perceived voice similarity and the physiological aspects of the voice, leading to a reliance on subjective methods like the Mean Opinion Score (MOS). While MOS provides valuable insights, it is resource-intensive and inherently subjective, limiting its practicality for widespread use. This research proposes an objective framework for evaluating voice quality in VC tasks by focusing on key speech production parameters, including jitter, shimmer, harmonics-to-noise ratio, and vocal tract length. Our findings suggest that these parameters, which encapsulate the distinct characteristics of a speaker’s voice, provide a more precise basis for assessing perceptual similarity between converted and target voices. Compared to traditional objective metrics like MCD, MSD, F0RMSE, and also non-intrusive measures like MOSNET, UTMOS, our proposed method consistently shows a correlation with MOS, suggesting that it better aligns with subjective evaluations of voice quality. This presents a more reliable and practical alternative to conventional methods that primarily emphasize perceptual features. This study evaluates how well different VC models, such as StarGANv2-VC, Retrival-based VC, Suno-Bark, and Diff-VC replicate speech production parameters across various languages and accents, including English, Kannada, Hindi, and the low-resource Soliga language. The results provide insights into improving the evaluation of voice conversion technologies by focusing on speech production attributes, helping to bridge the gap between perceptual similarity and physiological accuracy. The proposed work lays the groundwork for developing standardized, objective evaluation methods for VC models based on speech production characteristics.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.