Balinese text-to-speech dataset as digital cultural heritage

IF 1 Q3 MULTIDISCIPLINARY SCIENCES
I Gusti Agung Gede Arya Kadyanan, Ngurah Agus Sanjaya ER, Anak Agung Istri Ngurah Eka Karyawati, I Gede Ngurah Arya Wira Putra, ⁠I Made Suma Gunawan, Ni Made Julia Budiantari, Hana Christine Octavia
{"title":"Balinese text-to-speech dataset as digital cultural heritage","authors":"I Gusti Agung Gede Arya Kadyanan,&nbsp;Ngurah Agus Sanjaya ER,&nbsp;Anak Agung Istri Ngurah Eka Karyawati,&nbsp;I Gede Ngurah Arya Wira Putra,&nbsp;⁠I Made Suma Gunawan,&nbsp;Ni Made Julia Budiantari,&nbsp;Hana Christine Octavia","doi":"10.1016/j.dib.2025.111528","DOIUrl":null,"url":null,"abstract":"<div><div>Balinese language has a complex and unique language level system, yet still lacks representation in speech-based technologies such as Text-to-Speech (TTS) and speech recognition. As one of the linguistically rich regional languages, Balinese language digitization efforts have not been optimally developed, limiting research in natural language processing (NLP) as well as the application of regional language-based voice technologies. The limitation of voice-based datasets in Balinese is a major challenge in the development of this technology. Therefore, this research aims to develop a dataset of Balinese native speaker audio recordings covering various language levels to support applications in Text-to-Speech (TTS) systems, speech recognition, and voice-to-text technology. The dataset was developed through a data acquisition process that involved recording the voices of native Balinese speakers of the Badung dialect. Data was collected by recording the voices of native Balinese speakers using the Badung dialect. The resulting recordings were then processed using denoising techniques to improve audio quality, before being categorized based on Balinese politeness levels (Alus Singgih, Alus Sor, Alus Mider, Mider, and Andap) as well as including additional phrases and alphabets to provide a wider variety to the dataset. The results show that this dataset consists of 1187 recordings that reflect a wide range of social variation in Balinese. By providing this resource, this research not only contributes to the development of speech-based technologies, but also plays a role in the preservation of Balinese in the digital age, as well as opening up further research opportunities in NLP for languages with limited resources.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"60 ","pages":"Article 111528"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925002604","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

Balinese language has a complex and unique language level system, yet still lacks representation in speech-based technologies such as Text-to-Speech (TTS) and speech recognition. As one of the linguistically rich regional languages, Balinese language digitization efforts have not been optimally developed, limiting research in natural language processing (NLP) as well as the application of regional language-based voice technologies. The limitation of voice-based datasets in Balinese is a major challenge in the development of this technology. Therefore, this research aims to develop a dataset of Balinese native speaker audio recordings covering various language levels to support applications in Text-to-Speech (TTS) systems, speech recognition, and voice-to-text technology. The dataset was developed through a data acquisition process that involved recording the voices of native Balinese speakers of the Badung dialect. Data was collected by recording the voices of native Balinese speakers using the Badung dialect. The resulting recordings were then processed using denoising techniques to improve audio quality, before being categorized based on Balinese politeness levels (Alus Singgih, Alus Sor, Alus Mider, Mider, and Andap) as well as including additional phrases and alphabets to provide a wider variety to the dataset. The results show that this dataset consists of 1187 recordings that reflect a wide range of social variation in Balinese. By providing this resource, this research not only contributes to the development of speech-based technologies, but also plays a role in the preservation of Balinese in the digital age, as well as opening up further research opportunities in NLP for languages with limited resources.
作为数字文化遗产的巴厘文本-语音数据集
巴厘语具有复杂而独特的语言层次体系,但在文本到语音(TTS)和语音识别等基于语音的技术中仍缺乏体现。作为语言丰富的区域语言之一,巴厘语的数字化工作尚未得到很好的发展,这限制了自然语言处理(NLP)的研究以及基于区域语言的语音技术的应用。巴厘语语音数据集的局限性是该技术发展的主要挑战。因此,本研究旨在建立峇里语母语者录音资料集,涵盖不同语言水平,以支持文本转语音(TTS)系统、语音识别和语音转文本技术的应用。该数据集是通过数据采集过程开发的,该过程包括记录巴厘语巴东方言母语人士的声音。数据是通过记录巴厘语母语者使用巴东方言的声音来收集的。然后使用去噪技术处理产生的录音以提高音频质量,然后根据巴厘语的礼貌水平(Alus Singgih, Alus Sor, Alus Mider, Mider和Andap)进行分类,并包括额外的短语和字母,以提供更广泛的数据集。结果表明,该数据集由1187条记录组成,反映了巴厘人广泛的社会差异。通过提供这一资源,本研究不仅有助于基于语音的技术的发展,而且在数字时代保护巴厘语方面发挥作用,同时也为资源有限的语言开辟了进一步的NLP研究机会。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Data in Brief
Data in Brief MULTIDISCIPLINARY SCIENCES-
CiteScore
3.10
自引率
0.00%
发文量
996
审稿时长
70 days
期刊介绍: Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信