Improved Classification of Arabic Unstructured Documents Based on Automated Domain Dictionary Construction

W. Aly, M. Youssef, Wafaa Hanna Sharaby, Hany Atef Kelleny
{"title":"Improved Classification of Arabic Unstructured Documents Based on Automated Domain Dictionary Construction","authors":"W. Aly, M. Youssef, Wafaa Hanna Sharaby, Hany Atef Kelleny","doi":"10.1109/ICCTA32607.2013.9529514","DOIUrl":null,"url":null,"abstract":"This paper aims at developing a system that is capable of dealing with Arabic unstructured documents. It aims at classifying these documents by constructing a new automated Domain-Based Dictionary (ADDC) algorithm and a classifier algorithm. The proposed system will explore the received Arabic documents, identify them, index them, and then based on their contents classify them automatically. In three consecutive stages, in the first stage this system will develop domain-based automated dictionaries which correspond to the input set of classified documents; it will explore, tokenize and apply preprocessing techniques on these classified documents, which would in turn be processed by the ADDC algorithm to ultimately generate the targeted dictionaries as an initial stage. In the second stage, a set of unclassified Arabic documents is preprocessed yielding processed documents, which would be calculated subsequently via normalized term weighting technique together with the previously generated dictionaries. In the third stage, a new developed classifier algorithm will be operated on the output documents of the previous two stages to classify the Arabic data set. It was found that the proposed system achieved general accuracy about 95%.","PeriodicalId":405465,"journal":{"name":"2013 23rd International Conference on Computer Theory and Applications (ICCTA)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 23rd International Conference on Computer Theory and Applications (ICCTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCTA32607.2013.9529514","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

This paper aims at developing a system that is capable of dealing with Arabic unstructured documents. It aims at classifying these documents by constructing a new automated Domain-Based Dictionary (ADDC) algorithm and a classifier algorithm. The proposed system will explore the received Arabic documents, identify them, index them, and then based on their contents classify them automatically. In three consecutive stages, in the first stage this system will develop domain-based automated dictionaries which correspond to the input set of classified documents; it will explore, tokenize and apply preprocessing techniques on these classified documents, which would in turn be processed by the ADDC algorithm to ultimately generate the targeted dictionaries as an initial stage. In the second stage, a set of unclassified Arabic documents is preprocessed yielding processed documents, which would be calculated subsequently via normalized term weighting technique together with the previously generated dictionaries. In the third stage, a new developed classifier algorithm will be operated on the output documents of the previous two stages to classify the Arabic data set. It was found that the proposed system achieved general accuracy about 95%.
基于自动领域词典构建的阿拉伯语非结构化文档改进分类
本文旨在开发一个能够处理阿拉伯语非结构化文档的系统。目的是通过构建一种新的自动领域词典算法和分类器算法对这些文档进行分类。所提议的系统将探索收到的阿拉伯语文件,识别它们,索引它们,然后根据它们的内容自动分类。在连续的三个阶段,在第一阶段,该系统将开发基于领域的自动字典,与分类文档的输入集相对应;它将对这些分类文档进行探索、标记并应用预处理技术,这些文档将依次由ADDC算法处理,最终作为初始阶段生成目标字典。在第二阶段,对一组未分类的阿拉伯语文档进行预处理,生成经过处理的文档,然后通过规范化术语加权技术与先前生成的字典一起计算这些文档。在第三阶段,将对前两个阶段的输出文件操作一种新开发的分类器算法,对阿拉伯语数据集进行分类。结果表明,该系统的总体准确率约为95%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信