{"title":"Automatic Arabic Text Summarization for Large Scale Multiple Documents Using Genetic Algorithm and MapReduce","authors":"R. Baraka, Sulaiman N. Al Breem","doi":"10.1109/PICICT.2017.32","DOIUrl":null,"url":null,"abstract":"Multi document summarization focuses on extracting the most significant information from a collection of textual documents. Most summarization techniques require the data to be centralized, which may not be feasible in many cases due to computational and storage limitations. The huge increase of data emerging by the progress of technology and the various sources makes automatic text summarization of large scale of data a challenging task. We propose an approach for automatic text summarization of large scale Arabic multiple documents using Genetic algorithm and MapReduce parallel programming model. The approach insures scalability, speed and accuracy in summary generation. It eliminates sentence redundancy and increases readability and cohesion factors between the sentences of summaries. The experiments resulted in acceptable precision and recall scores. This indicates that the system successfully identifies the most important sentences. In Addition to all to that, the approach provided up to 10x speedup score, which is faster than on a single machine. Therefore, it can deal with large-scale datasets successfully. Finally, the efficiency score of the proposed approach indicates that the large data set utilizes the available resources up to 62%.","PeriodicalId":259869,"journal":{"name":"2017 Palestinian International Conference on Information and Communication Technology (PICICT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 Palestinian International Conference on Information and Communication Technology (PICICT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PICICT.2017.32","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
Multi document summarization focuses on extracting the most significant information from a collection of textual documents. Most summarization techniques require the data to be centralized, which may not be feasible in many cases due to computational and storage limitations. The huge increase of data emerging by the progress of technology and the various sources makes automatic text summarization of large scale of data a challenging task. We propose an approach for automatic text summarization of large scale Arabic multiple documents using Genetic algorithm and MapReduce parallel programming model. The approach insures scalability, speed and accuracy in summary generation. It eliminates sentence redundancy and increases readability and cohesion factors between the sentences of summaries. The experiments resulted in acceptable precision and recall scores. This indicates that the system successfully identifies the most important sentences. In Addition to all to that, the approach provided up to 10x speedup score, which is faster than on a single machine. Therefore, it can deal with large-scale datasets successfully. Finally, the efficiency score of the proposed approach indicates that the large data set utilizes the available resources up to 62%.