W. Aly, M. Youssef, Wafaa Hanna Sharaby, Hany Atef Kelleny
{"title":"Improved Classification of Arabic Unstructured Documents Based on Automated Domain Dictionary Construction","authors":"W. Aly, M. Youssef, Wafaa Hanna Sharaby, Hany Atef Kelleny","doi":"10.1109/ICCTA32607.2013.9529514","DOIUrl":null,"url":null,"abstract":"This paper aims at developing a system that is capable of dealing with Arabic unstructured documents. It aims at classifying these documents by constructing a new automated Domain-Based Dictionary (ADDC) algorithm and a classifier algorithm. The proposed system will explore the received Arabic documents, identify them, index them, and then based on their contents classify them automatically. In three consecutive stages, in the first stage this system will develop domain-based automated dictionaries which correspond to the input set of classified documents; it will explore, tokenize and apply preprocessing techniques on these classified documents, which would in turn be processed by the ADDC algorithm to ultimately generate the targeted dictionaries as an initial stage. In the second stage, a set of unclassified Arabic documents is preprocessed yielding processed documents, which would be calculated subsequently via normalized term weighting technique together with the previously generated dictionaries. In the third stage, a new developed classifier algorithm will be operated on the output documents of the previous two stages to classify the Arabic data set. It was found that the proposed system achieved general accuracy about 95%.","PeriodicalId":405465,"journal":{"name":"2013 23rd International Conference on Computer Theory and Applications (ICCTA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 23rd International Conference on Computer Theory and Applications (ICCTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCTA32607.2013.9529514","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
This paper aims at developing a system that is capable of dealing with Arabic unstructured documents. It aims at classifying these documents by constructing a new automated Domain-Based Dictionary (ADDC) algorithm and a classifier algorithm. The proposed system will explore the received Arabic documents, identify them, index them, and then based on their contents classify them automatically. In three consecutive stages, in the first stage this system will develop domain-based automated dictionaries which correspond to the input set of classified documents; it will explore, tokenize and apply preprocessing techniques on these classified documents, which would in turn be processed by the ADDC algorithm to ultimately generate the targeted dictionaries as an initial stage. In the second stage, a set of unclassified Arabic documents is preprocessed yielding processed documents, which would be calculated subsequently via normalized term weighting technique together with the previously generated dictionaries. In the third stage, a new developed classifier algorithm will be operated on the output documents of the previous two stages to classify the Arabic data set. It was found that the proposed system achieved general accuracy about 95%.