Holy Quran Kurdish Sorani translation dataset for language modelling

IF 1 Q3 MULTIDISCIPLINARY SCIENCES
Muhammad Bamoki , Shakhawan Hares Wady , Soran Badawi
{"title":"Holy Quran Kurdish Sorani translation dataset for language modelling","authors":"Muhammad Bamoki ,&nbsp;Shakhawan Hares Wady ,&nbsp;Soran Badawi","doi":"10.1016/j.dib.2025.111533","DOIUrl":null,"url":null,"abstract":"<div><div>The Holy Quran serves as a foundational text in Islamic theology and has been translated into numerous languages across the globe. This paper introduces a manual translation of the Holy Quran into the Kurdish language, specifically designed to aid natural language processing (NLP) research and linguistic analysis. The translation process employed a thorough methodology that combined advanced linguistic tools with the expertise of bilingual religious scholars, translators, and professional proofreaders over several years. Careful attention was given to maintaining both semantic accuracy and theological precision, ensuring a faithful representation of the original Arabic text. The dataset comprises two primary files: a raw translation and a refined linguistic version. We performed various statistical analyses, including the identification of the top 20 most frequent words, a comparative analysis of verse lengths between the Kurdish and Arabic versions, and an evaluation of unique word distributions in both the raw and processed texts. This Kurdish Quran translation dataset represents a significant resource for computational linguistics, particularly in the development of neural machine translation models and in linguistic research focused on under-resourced languages.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"60 ","pages":"Article 111533"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925002653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
引用次数: 0

Abstract

The Holy Quran serves as a foundational text in Islamic theology and has been translated into numerous languages across the globe. This paper introduces a manual translation of the Holy Quran into the Kurdish language, specifically designed to aid natural language processing (NLP) research and linguistic analysis. The translation process employed a thorough methodology that combined advanced linguistic tools with the expertise of bilingual religious scholars, translators, and professional proofreaders over several years. Careful attention was given to maintaining both semantic accuracy and theological precision, ensuring a faithful representation of the original Arabic text. The dataset comprises two primary files: a raw translation and a refined linguistic version. We performed various statistical analyses, including the identification of the top 20 most frequent words, a comparative analysis of verse lengths between the Kurdish and Arabic versions, and an evaluation of unique word distributions in both the raw and processed texts. This Kurdish Quran translation dataset represents a significant resource for computational linguistics, particularly in the development of neural machine translation models and in linguistic research focused on under-resourced languages.
用于语言建模的《古兰经》库尔德语索拉尼语翻译数据集
《古兰经》是伊斯兰神学的基础文本,在全球被翻译成多种语言。本文介绍了一种将《古兰经》手工翻译成库尔德语的方法,专门用于帮助自然语言处理(NLP)研究和语言分析。翻译过程采用了一种彻底的方法,结合了先进的语言工具和双语宗教学者、翻译人员和专业校对人员多年来的专业知识。对保持语义的准确性和神学的准确性给予了认真的注意,以确保忠实地反映阿拉伯文原文。该数据集包括两个主要文件:原始翻译和精炼的语言版本。我们进行了各种统计分析,包括识别前20个最常见的单词,比较分析库尔德语和阿拉伯语版本之间的诗歌长度,以及评估原始文本和处理文本中独特的单词分布。这个库尔德语可兰经翻译数据集代表了计算语言学的重要资源,特别是在神经机器翻译模型的开发和专注于资源不足语言的语言学研究中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Data in Brief
Data in Brief MULTIDISCIPLINARY SCIENCES-
CiteScore
3.10
自引率
0.00%
发文量
996
审稿时长
70 days
期刊介绍: Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信