{"title":"KurdSum:一个新的库尔德文本摘要基准数据集","authors":"Soran Badawi","doi":"10.1016/j.nlp.2023.100043","DOIUrl":null,"url":null,"abstract":"<div><p>Summarizing a text is the process of condensing its content while still maintaining its essential information. With the abundance of digital information available, summarization has become a significant task in various fields, including information retrieval, NLP (Natural Language Processing), and machine learning. This task has been extensively studied in languages such as English and Chinese, but research on Kurdish language summarization is lacking. Therefore, we present the first-ever Kurdish summarization news dataset, KurdSum, which includes over 40,000 texts. We collected news articles from Kurdish websites, preprocessed the data, and manually created a summary for each article. We further assessed the performance of our benchmark dataset on four extractive systems (LEXRANK, TEXTRANK, ORACLE, and LEAD0-3) and three abstractive methods (Pointer-Generator, Sequence-to-Sequence and transformer-abstractive). Our experiments showed that the Pointer-Generator approach yielded superior ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores compared to other techniques and ORACLE outperformed other extractive methods. Our findings offer a promising direction for the summarization of Kurdish text and can contribute to developing NLP tools for processing the Kurdish language. Likewise, the dataset can serve as a benchmark dataset for Kurdish language summarization and a valuable resource for researchers interested in developing Kurdish summarization models.</p></div>","PeriodicalId":100944,"journal":{"name":"Natural Language Processing Journal","volume":"5 ","pages":"Article 100043"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2949719123000407/pdfft?md5=521256c2a7cc54955d3efe4ae46c25e4&pid=1-s2.0-S2949719123000407-main.pdf","citationCount":"0","resultStr":"{\"title\":\"KurdSum: A new benchmark dataset for the Kurdish text summarization\",\"authors\":\"Soran Badawi\",\"doi\":\"10.1016/j.nlp.2023.100043\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Summarizing a text is the process of condensing its content while still maintaining its essential information. With the abundance of digital information available, summarization has become a significant task in various fields, including information retrieval, NLP (Natural Language Processing), and machine learning. This task has been extensively studied in languages such as English and Chinese, but research on Kurdish language summarization is lacking. Therefore, we present the first-ever Kurdish summarization news dataset, KurdSum, which includes over 40,000 texts. We collected news articles from Kurdish websites, preprocessed the data, and manually created a summary for each article. We further assessed the performance of our benchmark dataset on four extractive systems (LEXRANK, TEXTRANK, ORACLE, and LEAD0-3) and three abstractive methods (Pointer-Generator, Sequence-to-Sequence and transformer-abstractive). Our experiments showed that the Pointer-Generator approach yielded superior ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores compared to other techniques and ORACLE outperformed other extractive methods. Our findings offer a promising direction for the summarization of Kurdish text and can contribute to developing NLP tools for processing the Kurdish language. Likewise, the dataset can serve as a benchmark dataset for Kurdish language summarization and a valuable resource for researchers interested in developing Kurdish summarization models.</p></div>\",\"PeriodicalId\":100944,\"journal\":{\"name\":\"Natural Language Processing Journal\",\"volume\":\"5 \",\"pages\":\"Article 100043\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2949719123000407/pdfft?md5=521256c2a7cc54955d3efe4ae46c25e4&pid=1-s2.0-S2949719123000407-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Natural Language Processing Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949719123000407\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Processing Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949719123000407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
KurdSum: A new benchmark dataset for the Kurdish text summarization
Summarizing a text is the process of condensing its content while still maintaining its essential information. With the abundance of digital information available, summarization has become a significant task in various fields, including information retrieval, NLP (Natural Language Processing), and machine learning. This task has been extensively studied in languages such as English and Chinese, but research on Kurdish language summarization is lacking. Therefore, we present the first-ever Kurdish summarization news dataset, KurdSum, which includes over 40,000 texts. We collected news articles from Kurdish websites, preprocessed the data, and manually created a summary for each article. We further assessed the performance of our benchmark dataset on four extractive systems (LEXRANK, TEXTRANK, ORACLE, and LEAD0-3) and three abstractive methods (Pointer-Generator, Sequence-to-Sequence and transformer-abstractive). Our experiments showed that the Pointer-Generator approach yielded superior ROUGE (Recall-Oriented Understudy for Gisting Evaluation) scores compared to other techniques and ORACLE outperformed other extractive methods. Our findings offer a promising direction for the summarization of Kurdish text and can contribute to developing NLP tools for processing the Kurdish language. Likewise, the dataset can serve as a benchmark dataset for Kurdish language summarization and a valuable resource for researchers interested in developing Kurdish summarization models.