利用动态深度学习规划文本到语音合成模型和数据集的开发

IF 5.2 2区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Hawraz A. Ahmad , Tarik A. Rashid
{"title":"利用动态深度学习规划文本到语音合成模型和数据集的开发","authors":"Hawraz A. Ahmad ,&nbsp;Tarik A. Rashid","doi":"10.1016/j.jksuci.2024.102131","DOIUrl":null,"url":null,"abstract":"<div><p>Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.</p></div>","PeriodicalId":48547,"journal":{"name":"Journal of King Saud University-Computer and Information Sciences","volume":"36 7","pages":"Article 102131"},"PeriodicalIF":5.2000,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1319157824002209/pdfft?md5=73c94f11cbc25ec7eb6841c1af93654a&pid=1-s2.0-S1319157824002209-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning\",\"authors\":\"Hawraz A. Ahmad ,&nbsp;Tarik A. Rashid\",\"doi\":\"10.1016/j.jksuci.2024.102131\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.</p></div>\",\"PeriodicalId\":48547,\"journal\":{\"name\":\"Journal of King Saud University-Computer and Information Sciences\",\"volume\":\"36 7\",\"pages\":\"Article 102131\"},\"PeriodicalIF\":5.2000,\"publicationDate\":\"2024-07-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1319157824002209/pdfft?md5=73c94f11cbc25ec7eb6841c1af93654a&pid=1-s2.0-S1319157824002209-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of King Saud University-Computer and Information Sciences\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1319157824002209\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of King Saud University-Computer and Information Sciences","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1319157824002209","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

摘要

文本到语音合成(TTS)是一个将自然语言文本翻译成语音的过程。语音合成人员在识别书面文本的前音要素(如语调(说话时声音的起伏)和长度)时面临着巨大挑战。相比之下,连续语音特征会受到艺术家个性和情感的影响。合成语音片段会保存在一个数据库中。它的输出取决于发音人发音的相似程度以及发音人的暗示能力。在过去几年中,深度学习这一人工智能技术的出现对文本到语音合成领域产生了巨大影响,并得到了广泛普及。本综述论文介绍了基于深度学习的模型和架构分类法,并讨论了在 TTS 过程中使用的各种数据集。本文还介绍了常用的评估矩阵。文章最后展望了系统的未来发展方向,并介绍了一些在该领域取得良好效果的深度学习模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Planning the development of text-to-speech synthesis models and datasets with dynamic deep learning

Synthesis of Text-to-speech (TTS) is a process that involves translating a natural language text into a speech. Speech synthesisers face a major challenge when recognizing the prosodic elements of written text, such as intonation (the rise and fall of the voice in speaking), and length. In contrast, continuous speech features are influenced by the personality and emotions of the artist. A database is maintained to store the synthesized speech pieces. Its output is determined by how similar the person utters the words and how capable they are of being implied. In the past few years, the field of text-to-speech synthesis has been heavily impacted by the emergence of deep learning, an AI technology that has gained widespread popularity. This review paper presents a taxonomy of models and architectures that are based on deep learning and discusses the various datasets that are utilised in the TTS process. It also covers the evaluation matrices that are commonly used. The paper ends with a look at the future directions of the system and reaches to some Deep learning models that give promising results in this field.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
10.50
自引率
8.70%
发文量
656
审稿时长
29 days
期刊介绍: In 2022 the Journal of King Saud University - Computer and Information Sciences will become an author paid open access journal. Authors who submit their manuscript after October 31st 2021 will be asked to pay an Article Processing Charge (APC) after acceptance of their paper to make their work immediately, permanently, and freely accessible to all. The Journal of King Saud University Computer and Information Sciences is a refereed, international journal that covers all aspects of both foundations of computer and its practical applications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信