KurdSum:一个新的库尔德文本摘要基准数据集

Natural Language Processing Journal Pub Date : 2023-12-01 DOI:10.1016/j.nlp.2023.100043

Soran Badawi

{"title":"KurdSum:一个新的库尔德文本摘要基准数据集","authors":"Soran Badawi","doi":"10.1016/j.nlp.2023.100043","DOIUrl":null,"url":null,"abstract":"<div><p>Summarizing a text is the process of condensing its content while still maintaining its essential information. With the abundance of digital information available, summarization has become a significant task in various fields, including information retrieval, NLP (Natural Language Processing), and machine learning. This task has been extensively studied in languages such as English and Chinese, but research on Kurdish language summarization is lacking. Therefore, we present the first-ever Kurdish summarization news dataset, KurdSum, which includes over 40,000 texts. We collected news articles from Kurdish websites, preprocessed the data, and manually created a summary for each article. We further assessed the performance of our benchmark dataset on four extractive systems (LEXRANK, TEXTRANK, ORACLE, and LEAD0-3) and three abstractive methods (Pointer-Generator, Sequence-to-Sequence and transformer-abstractive). Our experiments showed that the Pointer-Generator approach yielded superior ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores compared to other techniques and ORACLE outperformed other extractive methods. Our findings offer a promising direction for the summarization of Kurdish text and can contribute to developing NLP tools for processing the Kurdish language. Likewise, the dataset can serve as a benchmark dataset for Kurdish language summarization and a valuable resource for researchers interested in developing Kurdish summarization models.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"5 ","pages":"Article 100043"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000407/pdfft?md5=521256c2a7cc54955d3efe4ae46c25e4&pid=1-s2.0-S2949719123000407-main.pdf","citationCount":"0","resultStr":"{\"title\":\"KurdSum: A new benchmark dataset for the Kurdish text summarization\",\"authors\":\"Soran Badawi\",\"doi\":\"10.1016/j.nlp.2023.100043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Summarizing a text is the process of condensing its content while still maintaining its essential information. With the abundance of digital information available, summarization has become a significant task in various fields, including information retrieval, NLP (Natural Language Processing), and machine learning. This task has been extensively studied in languages such as English and Chinese, but research on Kurdish language summarization is lacking. Therefore, we present the first-ever Kurdish summarization news dataset, KurdSum, which includes over 40,000 texts. We collected news articles from Kurdish websites, preprocessed the data, and manually created a summary for each article. We further assessed the performance of our benchmark dataset on four extractive systems (LEXRANK, TEXTRANK, ORACLE, and LEAD0-3) and three abstractive methods (Pointer-Generator, Sequence-to-Sequence and transformer-abstractive). Our experiments showed that the Pointer-Generator approach yielded superior ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores compared to other techniques and ORACLE outperformed other extractive methods. Our findings offer a promising direction for the summarization of Kurdish text and can contribute to developing NLP tools for processing the Kurdish language. Likewise, the dataset can serve as a benchmark dataset for Kurdish language summarization and a valuable resource for researchers interested in developing Kurdish summarization models.</p></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"5 \",\"pages\":\"Article 100043\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2949719123000407/pdfft?md5=521256c2a7cc54955d3efe4ae46c25e4&pid=1-s2.0-S2949719123000407-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719123000407\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719123000407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

摘要是在保持文本基本信息的同时，对文本内容进行浓缩的过程。随着数字信息的丰富，摘要已经成为各个领域的重要任务，包括信息检索、自然语言处理和机器学习。该任务已在英语和汉语等语言中进行了广泛的研究，但对库尔德语摘要的研究尚缺乏。因此，我们提出了第一个库尔德摘要新闻数据集，KurdSum，其中包括超过40,000个文本。我们从库尔德网站上收集新闻文章，对数据进行预处理，并手动为每篇文章创建摘要。我们进一步评估了基准数据集在四个提取系统(LEXRANK、TEXTRANK、ORACLE和LEAD0-3)和三个抽象方法(指针生成器、序列到序列和转换抽象)上的性能。我们的实验表明，与其他技术相比，指针生成器方法产生了更高的ROUGE(面向回忆的注册评估替补)分数，ORACLE优于其他提取方法。我们的研究结果为库尔德语文本的总结提供了一个有希望的方向，并有助于开发用于处理库尔德语的NLP工具。同样，该数据集可以作为库尔德语摘要的基准数据集和对开发库尔德语摘要模型感兴趣的研究人员的宝贵资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

KurdSum: A new benchmark dataset for the Kurdish text summarization

Summarizing a text is the process of condensing its content while still maintaining its essential information. With the abundance of digital information available, summarization has become a significant task in various fields, including information retrieval, NLP (Natural Language Processing), and machine learning. This task has been extensively studied in languages such as English and Chinese, but research on Kurdish language summarization is lacking. Therefore, we present the first-ever Kurdish summarization news dataset, KurdSum, which includes over 40,000 texts. We collected news articles from Kurdish websites, preprocessed the data, and manually created a summary for each article. We further assessed the performance of our benchmark dataset on four extractive systems (LEXRANK, TEXTRANK, ORACLE, and LEAD0-3) and three abstractive methods (Pointer-Generator, Sequence-to-Sequence and transformer-abstractive). Our experiments showed that the Pointer-Generator approach yielded superior ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores compared to other techniques and ORACLE outperformed other extractive methods. Our findings offer a promising direction for the summarization of Kurdish text and can contribute to developing NLP tools for processing the Kurdish language. Likewise, the dataset can serve as a benchmark dataset for Kurdish language summarization and a valuable resource for researchers interested in developing Kurdish summarization models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Natural Language Processing Journal

自引率

0.00%

发文量