Automatic Arabic text summarization using clustering and keyphrase extraction

Hamzah Noori Fejer, N. Omar
{"title":"Automatic Arabic text summarization using clustering and keyphrase extraction","authors":"Hamzah Noori Fejer, N. Omar","doi":"10.1109/ICIMU.2014.7066647","DOIUrl":null,"url":null,"abstract":"As the number of electronic documents increases rapidly, the need for faster techniques to assess the relevance of these documents emerges. A summary is a concise representation of underlying text. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. This paper propose a hybrid clustering method(partitioning and hierarchical) to group many Arabic documents into several clusters .Then keyphrase extraction module is applied to extract important Keyphrases from each cluster, which helps identify the most important sentences and find similar sentences based on several similarity algorithms. It applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences (i.e., sentences that have a greater similarity than the predefined threshold). This model is designed for both single-and multi-document Arabic text summarization. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) matrix used for the evaluation. For the summarization dataset, Essex Arabic Summaries Corpus was used. It has many topic based articles with multiple human summaries. This model achieved an accuracy of 80 % for single-document and 62% for multi-document summarization.","PeriodicalId":408534,"journal":{"name":"Proceedings of the 6th International Conference on Information Technology and Multimedia","volume":"241 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 6th International Conference on Information Technology and Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICIMU.2014.7066647","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 20

Abstract

As the number of electronic documents increases rapidly, the need for faster techniques to assess the relevance of these documents emerges. A summary is a concise representation of underlying text. A full understanding of the document is essential to form an ideal summary. However, achieving full understanding is either difficult or impossible for computers. Therefore, selecting important sentences from the original text and presenting these sentences as a summary present the most common techniques in automated text summarization. This paper propose a hybrid clustering method(partitioning and hierarchical) to group many Arabic documents into several clusters .Then keyphrase extraction module is applied to extract important Keyphrases from each cluster, which helps identify the most important sentences and find similar sentences based on several similarity algorithms. It applied to extract one sentence from a group of similar sentences while ignoring the other similar sentences (i.e., sentences that have a greater similarity than the predefined threshold). This model is designed for both single-and multi-document Arabic text summarization. The Recall-Oriented Understudy for Gisting Evaluation (ROGUE) matrix used for the evaluation. For the summarization dataset, Essex Arabic Summaries Corpus was used. It has many topic based articles with multiple human summaries. This model achieved an accuracy of 80 % for single-document and 62% for multi-document summarization.
自动阿拉伯语文本摘要使用聚类和关键短语提取
随着电子文档数量的迅速增加,需要更快的技术来评估这些文档的相关性。摘要是底层文本的简明表示。要形成一个理想的摘要,充分了解文件是必不可少的。然而,对计算机来说,实现完全理解是困难的,甚至是不可能的。因此,从原文中选择重要的句子并将这些句子作为摘要呈现是自动文本摘要中最常用的技术。本文提出了一种混合聚类方法(分块和分层),将大量阿拉伯语文档划分为多个聚类,然后利用关键词提取模块从每个聚类中提取重要的关键短语,从而识别出最重要的句子,并基于多种相似度算法找到相似的句子。它适用于从一组相似句子中提取一个句子,而忽略其他相似句子(即具有比预定义阈值更高的相似度的句子)。该模型是为单文档和多文档阿拉伯文本摘要而设计的。用于评价的是面向回忆的登记替代评价(ROGUE)矩阵。摘要数据集使用埃塞克斯阿拉伯语摘要语料库。它有许多基于主题的文章和多个人工摘要。该模型对单文档的总结准确率达到80%,对多文档的总结准确率达到62%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信