Automated text document categorization

2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS) Pub Date : 2015-12-01 DOI:10.1109/INTELCIS.2015.7397271

R. Yasotha, E. Charles

{"title":"Automated text document categorization","authors":"R. Yasotha, E. Charles","doi":"10.1109/INTELCIS.2015.7397271","DOIUrl":null,"url":null,"abstract":"During the last two decades the number of text documents in digital form has grown enormously. It is necessary to categorize documents into topics and sub topics for easy retrieval. Manual categorization of text documents can be done only by experts and it is a time consuming task. As a consequence, it is of great practical importance to be able to automatically organize and classify documents. There are two approaches, rule-based and machine learning-based, that are used to automate classification task. Both have some limitations. Rules may conflict each other and have to be reconstructed when a target domain changes, are such two limitations in the rule based approaches. Machine learning approaches require proper training data and they do not accountable with the classification results. Motivated by such limitations, this paper proposes a Latent Dirichlet Allocation (LDA) based approach to automatically classify text documents. In order to develop and test the proposed approach on a realistic set up, ACM (Association for Computing Machinery) Computing Classification System (CCS) is selected as the target platform and 9100 computer science related articles categorized under ACM-CCS were selected. The experimental results show that the proposed approach is effective for classifying text documents and is applicable to a domain with large number of categories in multiple levels.","PeriodicalId":6478,"journal":{"name":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","volume":"22 1","pages":"522-528"},"PeriodicalIF":0.0000,"publicationDate":"2015-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INTELCIS.2015.7397271","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

During the last two decades the number of text documents in digital form has grown enormously. It is necessary to categorize documents into topics and sub topics for easy retrieval. Manual categorization of text documents can be done only by experts and it is a time consuming task. As a consequence, it is of great practical importance to be able to automatically organize and classify documents. There are two approaches, rule-based and machine learning-based, that are used to automate classification task. Both have some limitations. Rules may conflict each other and have to be reconstructed when a target domain changes, are such two limitations in the rule based approaches. Machine learning approaches require proper training data and they do not accountable with the classification results. Motivated by such limitations, this paper proposes a Latent Dirichlet Allocation (LDA) based approach to automatically classify text documents. In order to develop and test the proposed approach on a realistic set up, ACM (Association for Computing Machinery) Computing Classification System (CCS) is selected as the target platform and 9100 computer science related articles categorized under ACM-CCS were selected. The experimental results show that the proposed approach is effective for classifying text documents and is applicable to a domain with large number of categories in multiple levels.

查看原文本刊更多论文

自动文本文档分类

在过去的二十年中，数字形式的文本文档的数量有了巨大的增长。为了便于检索，有必要将文档分为主题和子主题。文本文档的手动分类只能由专家来完成，并且是一项耗时的任务。因此，能够自动组织和分类文档具有重要的实际意义。有两种方法，基于规则和基于机器学习，用于自动分类任务。两者都有一定的局限性。规则可能会相互冲突，并且在目标域发生变化时必须重新构建，这是基于规则的方法中的两个限制。机器学习方法需要适当的训练数据，它们不负责分类结果。基于这种局限性，本文提出了一种基于潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)的文本文档自动分类方法。为了在现实环境中开发和测试所提出的方法，选择ACM(美国计算机协会)计算分类系统(CCS)作为目标平台，并选择了在ACM-CCS下分类的9100篇计算机科学相关文章。实验结果表明，该方法对文本文档分类是有效的，适用于分类数量多、层次多的领域。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 IEEE Seventh International Conference on Intelligent Computing and Information Systems (ICICIS)

自引率

0.00%

发文量