Holy Quran Kurdish Sorani translation dataset for language modelling

IF 1 Q3 MULTIDISCIPLINARY SCIENCES

Data in Brief Pub Date : 2025-04-03 DOI:10.1016/j.dib.2025.111533

Muhammad Bamoki , Shakhawan Hares Wady , Soran Badawi

{"title":"Holy Quran Kurdish Sorani translation dataset for language modelling","authors":"Muhammad Bamoki , Shakhawan Hares Wady , Soran Badawi","doi":"10.1016/j.dib.2025.111533","DOIUrl":null,"url":null,"abstract":"<div><div>The Holy Quran serves as a foundational text in Islamic theology and has been translated into numerous languages across the globe. This paper introduces a manual translation of the Holy Quran into the Kurdish language, specifically designed to aid natural language processing (NLP) research and linguistic analysis. The translation process employed a thorough methodology that combined advanced linguistic tools with the expertise of bilingual religious scholars, translators, and professional proofreaders over several years. Careful attention was given to maintaining both semantic accuracy and theological precision, ensuring a faithful representation of the original Arabic text. The dataset comprises two primary files: a raw translation and a refined linguistic version. We performed various statistical analyses, including the identification of the top 20 most frequent words, a comparative analysis of verse lengths between the Kurdish and Arabic versions, and an evaluation of unique word distributions in both the raw and processed texts. This Kurdish Quran translation dataset represents a significant resource for computational linguistics, particularly in the development of neural machine translation models and in linguistic research focused on under-resourced languages.</div></div>","PeriodicalId":10973,"journal":{"name":"Data in Brief","volume":"60 ","pages":"Article 111533"},"PeriodicalIF":1.0000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Data in Brief","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2352340925002653","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The Holy Quran serves as a foundational text in Islamic theology and has been translated into numerous languages across the globe. This paper introduces a manual translation of the Holy Quran into the Kurdish language, specifically designed to aid natural language processing (NLP) research and linguistic analysis. The translation process employed a thorough methodology that combined advanced linguistic tools with the expertise of bilingual religious scholars, translators, and professional proofreaders over several years. Careful attention was given to maintaining both semantic accuracy and theological precision, ensuring a faithful representation of the original Arabic text. The dataset comprises two primary files: a raw translation and a refined linguistic version. We performed various statistical analyses, including the identification of the top 20 most frequent words, a comparative analysis of verse lengths between the Kurdish and Arabic versions, and an evaluation of unique word distributions in both the raw and processed texts. This Kurdish Quran translation dataset represents a significant resource for computational linguistics, particularly in the development of neural machine translation models and in linguistic research focused on under-resourced languages.

查看原文本刊更多论文

用于语言建模的《古兰经》库尔德语索拉尼语翻译数据集

《古兰经》是伊斯兰神学的基础文本，在全球被翻译成多种语言。本文介绍了一种将《古兰经》手工翻译成库尔德语的方法，专门用于帮助自然语言处理（NLP）研究和语言分析。翻译过程采用了一种彻底的方法，结合了先进的语言工具和双语宗教学者、翻译人员和专业校对人员多年来的专业知识。对保持语义的准确性和神学的准确性给予了认真的注意，以确保忠实地反映阿拉伯文原文。该数据集包括两个主要文件：原始翻译和精炼的语言版本。我们进行了各种统计分析，包括识别前20个最常见的单词，比较分析库尔德语和阿拉伯语版本之间的诗歌长度，以及评估原始文本和处理文本中独特的单词分布。这个库尔德语可兰经翻译数据集代表了计算语言学的重要资源，特别是在神经机器翻译模型的开发和专注于资源不足语言的语言学研究中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Data in Brief MULTIDISCIPLINARY SCIENCES-

CiteScore

3.10

自引率

0.00%

发文量

996

审稿时长

70 days

期刊介绍： Data in Brief provides a way for researchers to easily share and reuse each other''s datasets by publishing data articles that: -Thoroughly describe your data, facilitating reproducibility. -Make your data, which is often buried in supplementary material, easier to find. -Increase traffic towards associated research articles and data, leading to more citations. -Open up doors for new collaborations. Because you never know what data will be useful to someone else, Data in Brief welcomes submissions that describe data from all research areas.