Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition

IF 5.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-03 DOI:10.1109/TASLP.2024.3451982

Sei Ueno;Akinobu Lee;Tatsuya Kawahara

{"title":"Refining Synthesized Speech Using Speaker Information and Phone Masking for Data Augmentation of Speech Recognition","authors":"Sei Ueno;Akinobu Lee;Tatsuya Kawahara","doi":"10.1109/TASLP.2024.3451982","DOIUrl":null,"url":null,"abstract":"While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched text to speech (TTS) has been investigated as one approach to data augmentation. The quality and diversity of the synthesized speech are critical in this approach. To ensure quality, a neural vocoder is widely used to generate speech waveforms in conventional studies, but it requires a huge amount of computation and another conversion to spectral-domain features such as the log-Mel filterbank (lmfb) output typically used for ASR. In this study, we explore the direct refinement of these features. Unlike conventional speech enhancement, we can use information on the ground-truth phone sequences of the speech and designated speaker to improve the quality and diversity. This process is realized as a Mel-to-Mel network, which can be placed after a text-to-Mel synthesis system such as FastSpeech 2. These two networks can be trained jointly. Moreover, semantic masking is applied to the lmfb features for robust training. Experimental evaluations demonstrate the effect of phone information, speaker information, and semantic masking. For speaker information, x-vector performs better than the simple speaker embedding. The proposed method achieves even better ASR performance with a much shorter computation time than the conventional method using a vocoder.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3924-3933"},"PeriodicalIF":5.1000,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10664004/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

While end-to-end automatic speech recognition (ASR) has shown impressive performance, it requires a huge amount of speech and transcription data. The conversion of domain-matched text to speech (TTS) has been investigated as one approach to data augmentation. The quality and diversity of the synthesized speech are critical in this approach. To ensure quality, a neural vocoder is widely used to generate speech waveforms in conventional studies, but it requires a huge amount of computation and another conversion to spectral-domain features such as the log-Mel filterbank (lmfb) output typically used for ASR. In this study, we explore the direct refinement of these features. Unlike conventional speech enhancement, we can use information on the ground-truth phone sequences of the speech and designated speaker to improve the quality and diversity. This process is realized as a Mel-to-Mel network, which can be placed after a text-to-Mel synthesis system such as FastSpeech 2. These two networks can be trained jointly. Moreover, semantic masking is applied to the lmfb features for robust training. Experimental evaluations demonstrate the effect of phone information, speaker information, and semantic masking. For speaker information, x-vector performs better than the simple speaker embedding. The proposed method achieves even better ASR performance with a much shorter computation time than the conventional method using a vocoder.

查看原文本刊更多论文

利用说话人信息和电话掩码完善合成语音，实现语音识别的数据增强

虽然端到端自动语音识别（ASR）已显示出令人印象深刻的性能，但它需要大量的语音和转录数据。将领域匹配的文本转换为语音 (TTS) 作为数据扩增的一种方法进行了研究。在这种方法中，合成语音的质量和多样性至关重要。为确保质量，传统研究中广泛使用神经声码器生成语音波形，但这需要大量计算，还需要转换为频谱域特征，如通常用于 ASR 的 log-Mel filterbank（lmfb）输出。在本研究中，我们探索了直接改进这些特征的方法。与传统的语音增强不同，我们可以利用语音和指定说话人的真实电话序列信息来提高质量和多样性。这一过程以 Mel-to-Mel 网络的形式实现，该网络可置于文本到 Mel 的合成系统（如 FastSpeech 2）之后。这两个网络可以联合训练。此外，还对 lmfb 特征进行了语义屏蔽，以实现稳健的训练。实验评估证明了电话信息、说话者信息和语义屏蔽的效果。对于说话人信息，x-vector 比简单的说话人嵌入效果更好。与使用声码器的传统方法相比，建议的方法以更短的计算时间实现了更好的 ASR 性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.