Speech stimulus continuum synthesis using deep learning methods

IF 3 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2025-06-17 DOI:10.1016/j.specom.2025.103266

Zhu Li, Yuqing Zhang, Yanlu Xie

{"title":"Speech stimulus continuum synthesis using deep learning methods","authors":"Zhu Li, Yuqing Zhang, Yanlu Xie","doi":"10.1016/j.specom.2025.103266","DOIUrl":null,"url":null,"abstract":"<div><div>Creating a naturalistic speech stimulus continuum (i.e., a series of stimuli equally spaced along a specific acoustic dimension between two given categories) is an indispensable component in categorical perception studies. A common method is to manually modify the key acoustic parameter of speech sounds, yet the quality of synthetic speech is still unsatisfying. This work explores how to use deep learning techniques for speech stimulus continuum synthesis, with the aim of improving the naturalness of the synthesized continuum. Drawing on recent advances in speech disentanglement learning, we implement a supervised disentanglement framework based on adversarial training (AT) to separate the specific acoustic feature (e.g., fundamental frequency, formant features) from other contents in speech signals and achieve controllable speech stimulus generation by sampling from the latent space of the key acoustic feature. In addition, drawing on the idea of mutual information (MI) in information theory, we design an unsupervised MI-based disentanglement framework to disentangle the specific acoustic feature from other contents in speech signals. Experiments on stimulus generation of several continua validate the effectiveness of our proposed method in both objective and subjective evaluations.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"173 ","pages":"Article 103266"},"PeriodicalIF":3.0000,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000810","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Creating a naturalistic speech stimulus continuum (i.e., a series of stimuli equally spaced along a specific acoustic dimension between two given categories) is an indispensable component in categorical perception studies. A common method is to manually modify the key acoustic parameter of speech sounds, yet the quality of synthetic speech is still unsatisfying. This work explores how to use deep learning techniques for speech stimulus continuum synthesis, with the aim of improving the naturalness of the synthesized continuum. Drawing on recent advances in speech disentanglement learning, we implement a supervised disentanglement framework based on adversarial training (AT) to separate the specific acoustic feature (e.g., fundamental frequency, formant features) from other contents in speech signals and achieve controllable speech stimulus generation by sampling from the latent space of the key acoustic feature. In addition, drawing on the idea of mutual information (MI) in information theory, we design an unsupervised MI-based disentanglement framework to disentangle the specific acoustic feature from other contents in speech signals. Experiments on stimulus generation of several continua validate the effectiveness of our proposed method in both objective and subjective evaluations.

查看原文本刊更多论文

基于深度学习方法的语音刺激连续统合成

创造一个自然的言语刺激连续体（即，在两个给定类别之间沿特定声学维度均匀间隔的一系列刺激）是类别感知研究中不可或缺的组成部分。常用的方法是手动修改语音的关键声学参数，但合成语音的质量仍然不令人满意。这项工作探索了如何使用深度学习技术进行语音刺激连续统合成，目的是提高合成连续统的自然度。借鉴语音解纠缠学习的最新进展，我们实现了一种基于对抗性训练（AT）的监督解纠缠框架，将语音信号中的特定声学特征（如基频、形成峰特征）与其他内容分离，并通过从关键声学特征的潜在空间采样来实现可控的语音刺激生成。此外，我们借鉴信息论中的互信息（MI）思想，设计了一个基于互信息的无监督解纠缠框架，将语音信号中的特定声学特征与其他内容解纠缠。对多个连续体的刺激生成实验，从客观和主观评价两方面验证了本文方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.