Automatic Arabic text summarization using clustering and keyphrase extraction

Proceedings of the 6th International Conference on Information Technology and Multimedia Pub Date : 2014-11-01 DOI:10.1109/ICIMU.2014.7066647

Hamzah Noori Fejer, N. Omar

{"title":"Automatic Arabic text summarization using clustering and keyphrase extraction","authors":"Hamzah Noori Fejer, N. Omar","doi":"10.1109/ICIMU.2014.7066647","DOIUrl":null,"url":null,"abstract":"As the number of electronic documents increases rapidly, the need for faster techniques to assess the relevance of these documents emerges. A summary is a concise representation of underlying text. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. This paper propose a hybrid clustering method(partitioning and hierarchical) to group many Arabic documents into several clusters .Then keyphrase extraction module is applied to extract important Keyphrases from each cluster, which helps identify the most important sentences and find similar sentences based on several similarity algorithms. It applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences (i.e., sentences that have a greater similarity than the predefined threshold). This model is designed for both single-and multi-document Arabic text summarization. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) matrix used for the evaluation. For the summarization dataset, Essex Arabic Summaries Corpus was used. It has many topic based articles with multiple human summaries. This model achieved an accuracy of 80 % for single-document and 62% for multi-document summarization.","PeriodicalId":408534,"journal":{"name":"Proceedings of the 6th International Conference on Information Technology and Multimedia","volume":"241 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Information Technology and Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIMU.2014.7066647","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

As the number of electronic documents increases rapidly, the need for faster techniques to assess the relevance of these documents emerges. A summary is a concise representation of underlying text. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. This paper propose a hybrid clustering method(partitioning and hierarchical) to group many Arabic documents into several clusters .Then keyphrase extraction module is applied to extract important Keyphrases from each cluster, which helps identify the most important sentences and find similar sentences based on several similarity algorithms. It applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences (i.e., sentences that have a greater similarity than the predefined threshold). This model is designed for both single-and multi-document Arabic text summarization. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) matrix used for the evaluation. For the summarization dataset, Essex Arabic Summaries Corpus was used. It has many topic based articles with multiple human summaries. This model achieved an accuracy of 80 % for single-document and 62% for multi-document summarization.

查看原文本刊更多论文

自动阿拉伯语文本摘要使用聚类和关键短语提取

随着电子文档数量的迅速增加，需要更快的技术来评估这些文档的相关性。摘要是底层文本的简明表示。要形成一个理想的摘要，充分了解文件是必不可少的。然而，对计算机来说，实现完全理解是困难的，甚至是不可能的。因此，从原文中选择重要的句子并将这些句子作为摘要呈现是自动文本摘要中最常用的技术。本文提出了一种混合聚类方法(分块和分层)，将大量阿拉伯语文档划分为多个聚类，然后利用关键词提取模块从每个聚类中提取重要的关键短语，从而识别出最重要的句子，并基于多种相似度算法找到相似的句子。它适用于从一组相似句子中提取一个句子，而忽略其他相似句子(即具有比预定义阈值更高的相似度的句子)。该模型是为单文档和多文档阿拉伯文本摘要而设计的。用于评价的是面向回忆的登记替代评价(ROGUE)矩阵。摘要数据集使用埃塞克斯阿拉伯语摘要语料库。它有许多基于主题的文章和多个人工摘要。该模型对单文档的总结准确率达到80%，对多文档的总结准确率达到62%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 6th International Conference on Information Technology and Multimedia

自引率

0.00%

发文量