Speech-driven head motion generation from waveforms

IF 2.4 3区计算机科学 Q2 ACOUSTICS

Speech Communication Pub Date : 2024-03-01 DOI:10.1016/j.specom.2024.103056

JinHong Lu, Hiroshi Shimodaira

{"title":"Speech-driven head motion generation from waveforms","authors":"JinHong Lu, Hiroshi Shimodaira","doi":"10.1016/j.specom.2024.103056","DOIUrl":null,"url":null,"abstract":"<div>Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, <math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math>, shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the <math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math>-based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the <math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math>feature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where <math><msub><mrow><mtext>Wav</mtext></mrow><mrow><mtext>CCCAE</mtext></mrow></msub></math>was chosen to be better than the other models by the participants of MUSHRA test.</div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"159 ","pages":"Article 103056"},"PeriodicalIF":2.4000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167639324000281/pdfft?md5=3e4ce95ea878ead804890332c3362074&pid=1-s2.0-S0167639324000281-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324000281","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Head motion generation task for speech-driven virtual agent animation is commonly explored with handcrafted audio features, such as MFCCs as input features, plus additional features, such as energy and F0 in the literature. In this paper, we study the direct use of speech waveform to generate head motion. We claim that creating a task-specific feature from waveform to generate head motion leads to better performance than using standard acoustic features to generate head motion overall. At the same time, we completely abandon the handcrafted feature extraction process, leading to more effectiveness. However, the difficulty of creating a task-specific feature from waveform is their staggering quantity of irrelevant information, implicating potential cumbrance for neural network training. Thus, we apply a canonical-correlation-constrained autoencoder (CCCAE), where we are able to compress the high-dimensional waveform into a low-dimensional embedded feature, with the minimal error in reconstruction, and sustain the relevant information with the maximal cannonical correlation to head motion. We extend our previous research by including more speakers in our dataset and also adapt with a recurrent neural network, to show the feasibility of our proposed feature. Through comparisons between different acoustic features, our proposed feature, ${Wav}_{CCCAE}$ , shows at least a 20% improvement in the correlation from the waveform, and outperforms the popular acoustic feature, MFCC, by at least 5% respectively for all speakers. Through the comparison in the feedforward neural network regression (FNN-regression) system, the ${Wav}_{CCCAE}$ -based system shows comparable performance in objective evaluation. In long short-term memory (LSTM) experiments, LSTM-models improve the overall performance in normalised mean square error (NMSE) and CCA metrics, and adapt the ${Wav}_{CCCAE}$ feature better, which makes the proposed LSTM-regression system outperform the MFCC-based system. We also re-design the subjective evaluation, and the subjective results show the animations generated by models where ${Wav}_{CCCAE}$ was chosen to be better than the other models by the participants of MUSHRA test.

查看原文本刊更多论文

根据波形生成语音驱动的头部动作

针对语音驱动的虚拟代理动画的头部动作生成任务，文献中通常使用手工制作的音频特征（如 MFCC）作为输入特征，再加上额外的特征（如能量和 F0）进行探索。在本文中，我们研究了直接使用语音波形生成头部动作的方法。我们认为，从波形中创建特定任务特征来生成头部运动，比使用标准声学特征来生成头部运动的整体效果更好。同时，我们完全放弃了手工特征提取过程，从而提高了效率。然而，从波形中创建特定任务特征的难点在于其数量惊人的不相关信息，这对神经网络训练造成了潜在的负担。因此，我们应用了一种规范相关约束自动编码器（CCCAE），它能将高维波形压缩成低维嵌入特征，重建误差最小，并以与头部运动的最大规范相关性维持相关信息。我们扩展了之前的研究，在数据集中加入了更多的扬声器，并使用递归神经网络进行调整，以证明我们提出的特征的可行性。通过不同声学特征之间的比较，我们提出的特征 WavCCCAE 在与波形的相关性方面至少提高了 20%，在所有扬声器中分别比流行的声学特征 MFCC 高出至少 5%。通过在前馈神经网络回归（FNN-regression）系统中的比较，基于 WavCCCAE 的系统在客观评估中表现出了相当的性能。在长短期记忆（LSTM）实验中，LSTM 模型改善了归一化均方误差（NMSE）和 CCA 指标的整体性能，并更好地适应了 WavCCCAE 特征，这使得所提出的 LSTM 回归系统优于基于 MFCC 的系统。我们还重新设计了主观评价，主观结果显示了 MUSHRA 测试参与者选择 WavCCCAE 优于其他模型的模型生成的动画。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Speech Communication 工程技术-计算机：跨学科应用

CiteScore

6.80

自引率

6.20%

发文量

审稿时长

19.2 weeks

期刊介绍： Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.