Improved Classification of Arabic Unstructured Documents Based on Automated Domain Dictionary Construction

2013 23rd International Conference on Computer Theory and Applications (ICCTA) Pub Date : 2013-10-29 DOI:10.1109/ICCTA32607.2013.9529514

W. Aly, M. Youssef, Wafaa Hanna Sharaby, Hany Atef Kelleny

{"title":"Improved Classification of Arabic Unstructured Documents Based on Automated Domain Dictionary Construction","authors":"W. Aly, M. Youssef, Wafaa Hanna Sharaby, Hany Atef Kelleny","doi":"10.1109/ICCTA32607.2013.9529514","DOIUrl":null,"url":null,"abstract":"This paper aims at developing a system that is capable of dealing with Arabic unstructured documents. It aims at classifying these documents by constructing a new automated Domain-Based Dictionary (ADDC) algorithm and a classifier algorithm. The proposed system will explore the received Arabic documents, identify them, index them, and then based on their contents classify them automatically. In three consecutive stages, in the first stage this system will develop domain-based automated dictionaries which correspond to the input set of classified documents; it will explore, tokenize and apply preprocessing techniques on these classified documents, which would in turn be processed by the ADDC algorithm to ultimately generate the targeted dictionaries as an initial stage. In the second stage, a set of unclassified Arabic documents is preprocessed yielding processed documents, which would be calculated subsequently via normalized term weighting technique together with the previously generated dictionaries. In the third stage, a new developed classifier algorithm will be operated on the output documents of the previous two stages to classify the Arabic data set. It was found that the proposed system achieved general accuracy about 95%.","PeriodicalId":405465,"journal":{"name":"2013 23rd International Conference on Computer Theory and Applications (ICCTA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 23rd International Conference on Computer Theory and Applications (ICCTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCTA32607.2013.9529514","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

This paper aims at developing a system that is capable of dealing with Arabic unstructured documents. It aims at classifying these documents by constructing a new automated Domain-Based Dictionary (ADDC) algorithm and a classifier algorithm. The proposed system will explore the received Arabic documents, identify them, index them, and then based on their contents classify them automatically. In three consecutive stages, in the first stage this system will develop domain-based automated dictionaries which correspond to the input set of classified documents; it will explore, tokenize and apply preprocessing techniques on these classified documents, which would in turn be processed by the ADDC algorithm to ultimately generate the targeted dictionaries as an initial stage. In the second stage, a set of unclassified Arabic documents is preprocessed yielding processed documents, which would be calculated subsequently via normalized term weighting technique together with the previously generated dictionaries. In the third stage, a new developed classifier algorithm will be operated on the output documents of the previous two stages to classify the Arabic data set. It was found that the proposed system achieved general accuracy about 95%.

查看原文本刊更多论文

基于自动领域词典构建的阿拉伯语非结构化文档改进分类

本文旨在开发一个能够处理阿拉伯语非结构化文档的系统。目的是通过构建一种新的自动领域词典算法和分类器算法对这些文档进行分类。所提议的系统将探索收到的阿拉伯语文件，识别它们，索引它们，然后根据它们的内容自动分类。在连续的三个阶段，在第一阶段，该系统将开发基于领域的自动字典，与分类文档的输入集相对应;它将对这些分类文档进行探索、标记并应用预处理技术，这些文档将依次由ADDC算法处理，最终作为初始阶段生成目标字典。在第二阶段，对一组未分类的阿拉伯语文档进行预处理，生成经过处理的文档，然后通过规范化术语加权技术与先前生成的字典一起计算这些文档。在第三阶段，将对前两个阶段的输出文件操作一种新开发的分类器算法，对阿拉伯语数据集进行分类。结果表明，该系统的总体准确率约为95%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 23rd International Conference on Computer Theory and Applications (ICCTA)

自引率

0.00%

发文量