使用基于语音生产参数的语音转换模型的性能评估

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Computer Speech and Language Pub Date : 2025-06-28 DOI:10.1016/j.csl.2025.101853

Ashwini Dasare, K.T. Deepak

{"title":"使用基于语音生产参数的语音转换模型的性能评估","authors":"Ashwini Dasare, K.T. Deepak","doi":"10.1016/j.csl.2025.101853","DOIUrl":null,"url":null,"abstract":"<div><div>Voice Conversion (VC) transforms a source voice to sound like a target voice. However, the field requires more standardized objective metrics to evaluate its performance thoroughly. Traditional evaluation methods, such as Mel-Cepstral Distortion (MCD), F0-Root Mean Squared Error (F0RMSE), and Modulated-Spectral Distance (MSD), primarily focus on perceptual features and often overlook speech production attributes. This can result in a mismatch between perceived voice similarity and the physiological aspects of the voice, leading to a reliance on subjective methods like the Mean Opinion Score (MOS). While MOS provides valuable insights, it is resource-intensive and inherently subjective, limiting its practicality for widespread use. This research proposes an objective framework for evaluating voice quality in VC tasks by focusing on key speech production parameters, including jitter, shimmer, harmonics-to-noise ratio, and vocal tract length. Our findings suggest that these parameters, which encapsulate the distinct characteristics of a speaker’s voice, provide a more precise basis for assessing perceptual similarity between converted and target voices. Compared to traditional objective metrics like MCD, MSD, F0RMSE, and also non-intrusive measures like MOSNET, UTMOS, our proposed method consistently shows a correlation with MOS, suggesting that it better aligns with subjective evaluations of voice quality. This presents a more reliable and practical alternative to conventional methods that primarily emphasize perceptual features. This study evaluates how well different VC models, such as StarGANv2-VC, Retrival-based VC, Suno-Bark, and Diff-VC replicate speech production parameters across various languages and accents, including English, Kannada, Hindi, and the low-resource Soliga language. The results provide insights into improving the evaluation of voice conversion technologies by focusing on speech production attributes, helping to bridge the gap between perceptual similarity and physiological accuracy. The proposed work lays the groundwork for developing standardized, objective evaluation methods for VC models based on speech production characteristics.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"95 ","pages":"Article 101853"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance assessment of voice conversion models using speech production-based parameters\",\"authors\":\"Ashwini Dasare, K.T. Deepak\",\"doi\":\"10.1016/j.csl.2025.101853\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Voice Conversion (VC) transforms a source voice to sound like a target voice. However, the field requires more standardized objective metrics to evaluate its performance thoroughly. Traditional evaluation methods, such as Mel-Cepstral Distortion (MCD), F0-Root Mean Squared Error (F0RMSE), and Modulated-Spectral Distance (MSD), primarily focus on perceptual features and often overlook speech production attributes. This can result in a mismatch between perceived voice similarity and the physiological aspects of the voice, leading to a reliance on subjective methods like the Mean Opinion Score (MOS). While MOS provides valuable insights, it is resource-intensive and inherently subjective, limiting its practicality for widespread use. This research proposes an objective framework for evaluating voice quality in VC tasks by focusing on key speech production parameters, including jitter, shimmer, harmonics-to-noise ratio, and vocal tract length. Our findings suggest that these parameters, which encapsulate the distinct characteristics of a speaker’s voice, provide a more precise basis for assessing perceptual similarity between converted and target voices. Compared to traditional objective metrics like MCD, MSD, F0RMSE, and also non-intrusive measures like MOSNET, UTMOS, our proposed method consistently shows a correlation with MOS, suggesting that it better aligns with subjective evaluations of voice quality. This presents a more reliable and practical alternative to conventional methods that primarily emphasize perceptual features. This study evaluates how well different VC models, such as StarGANv2-VC, Retrival-based VC, Suno-Bark, and Diff-VC replicate speech production parameters across various languages and accents, including English, Kannada, Hindi, and the low-resource Soliga language. The results provide insights into improving the evaluation of voice conversion technologies by focusing on speech production attributes, helping to bridge the gap between perceptual similarity and physiological accuracy. The proposed work lays the groundwork for developing standardized, objective evaluation methods for VC models based on speech production characteristics.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":\"95 \",\"pages\":\"Article 101853\"},\"PeriodicalIF\":3.4000,\"publicationDate\":\"2025-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230825000786\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230825000786","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

VC （Voice Conversion）是一种将源声音转换成目标声音的技术。然而，该领域需要更标准化的客观指标来彻底评估其性能。传统的评价方法，如mel -倒谱失真（MCD）、f0 -均方根误差（F0RMSE）和调制频谱距离（MSD），主要关注感知特征，往往忽略了语音产生属性。这可能会导致感知到的声音相似度与声音的生理特征之间的不匹配，从而导致对平均意见评分（Mean Opinion Score， MOS）等主观方法的依赖。虽然MOS提供了有价值的见解，但它是资源密集型的，并且本质上是主观的，限制了其广泛使用的实用性。本研究提出了一个客观的框架，通过关注关键的语音产生参数，包括抖动、闪烁、谐波噪声比和声道长度，来评估VC任务中的语音质量。我们的研究结果表明，这些参数概括了说话者声音的独特特征，为评估转换声音和目标声音之间的感知相似性提供了更精确的基础。与传统的客观指标（如MCD、MSD、F0RMSE）以及非侵入性指标（如MOSNET、UTMOS）相比，我们提出的方法始终显示出与MOS的相关性，表明它更符合语音质量的主观评价。这为主要强调感知特征的传统方法提供了一种更可靠和实用的替代方法。本研究评估了不同的VC模型（如StarGANv2-VC、基于检索的VC、Suno-Bark和Diff-VC）在不同语言和口音（包括英语、卡纳达语、印地语和低资源的Soliga语言）中复制语音生成参数的效果。该结果为通过关注语音产生属性来改进语音转换技术的评估提供了见解，有助于弥合感知相似性和生理准确性之间的差距。所提出的工作为基于语音产生特征的VC模型的标准化、客观评价方法的开发奠定了基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance assessment of voice conversion models using speech production-based parameters

Voice Conversion (VC) transforms a source voice to sound like a target voice. However, the field requires more standardized objective metrics to evaluate its performance thoroughly. Traditional evaluation methods, such as Mel-Cepstral Distortion (MCD), F0-Root Mean Squared Error (F0RMSE), and Modulated-Spectral Distance (MSD), primarily focus on perceptual features and often overlook speech production attributes. This can result in a mismatch between perceived voice similarity and the physiological aspects of the voice, leading to a reliance on subjective methods like the Mean Opinion Score (MOS). While MOS provides valuable insights, it is resource-intensive and inherently subjective, limiting its practicality for widespread use. This research proposes an objective framework for evaluating voice quality in VC tasks by focusing on key speech production parameters, including jitter, shimmer, harmonics-to-noise ratio, and vocal tract length. Our findings suggest that these parameters, which encapsulate the distinct characteristics of a speaker’s voice, provide a more precise basis for assessing perceptual similarity between converted and target voices. Compared to traditional objective metrics like MCD, MSD, F0RMSE, and also non-intrusive measures like MOSNET, UTMOS, our proposed method consistently shows a correlation with MOS, suggesting that it better aligns with subjective evaluations of voice quality. This presents a more reliable and practical alternative to conventional methods that primarily emphasize perceptual features. This study evaluates how well different VC models, such as StarGANv2-VC, Retrival-based VC, Suno-Bark, and Diff-VC replicate speech production parameters across various languages and accents, including English, Kannada, Hindi, and the low-resource Soliga language. The results provide insights into improving the evaluation of voice conversion technologies by focusing on speech production attributes, helping to bridge the gap between perceptual similarity and physiological accuracy. The proposed work lays the groundwork for developing standardized, objective evaluation methods for VC models based on speech production characteristics.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.