Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering

ACM Trans. Manag. Inf. Syst. Pub Date : 2015-04-03 DOI:10.1145/2688488

Yen-Hsien Lee, P. H. Hu, Ching-Yi Tu

{"title":"Ontology-Based Mapping for Automated Document Management: A Concept-Based Technique for Word Mismatch and Ambiguity Problems in Document Clustering","authors":"Yen-Hsien Lee, P. H. Hu, Ching-Yi Tu","doi":"10.1145/2688488","DOIUrl":null,"url":null,"abstract":"Document clustering is crucial to automated document management, especially for the fast-growing volume of textual documents available digitally. Traditional lexicon-based approaches depend on document content analysis and measure overlap of the feature vectors representing different documents, which cannot effectively address word mismatch or ambiguity problems. Alternative query expansion and local context discovery approaches are developed but suffer from limited efficiency and effectiveness, because the large number of expanded terms create noise and increase the dimensionality and complexity of the overall feature space. Several techniques extend lexicon-based analysis by incorporating latent semantic indexing but produce less comprehensible clustering results and questionable performance. We instead propose a concept-based document representation and clustering (CDRC) technique and empirically examine its effectiveness using 433 articles concerning information systems and technology, randomly selected from a popular digital library. Our evaluation includes two widely used benchmark techniques and shows that CDRC outperforms them. Overall, our results reveal that clustering documents at an ontology-based, concept-based level is more effective than techniques using lexicon-based document features and can generate more comprehensible clustering results.","PeriodicalId":178565,"journal":{"name":"ACM Trans. Manag. Inf. Syst.","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Manag. Inf. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2688488","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Document clustering is crucial to automated document management, especially for the fast-growing volume of textual documents available digitally. Traditional lexicon-based approaches depend on document content analysis and measure overlap of the feature vectors representing different documents, which cannot effectively address word mismatch or ambiguity problems. Alternative query expansion and local context discovery approaches are developed but suffer from limited efficiency and effectiveness, because the large number of expanded terms create noise and increase the dimensionality and complexity of the overall feature space. Several techniques extend lexicon-based analysis by incorporating latent semantic indexing but produce less comprehensible clustering results and questionable performance. We instead propose a concept-based document representation and clustering (CDRC) technique and empirically examine its effectiveness using 433 articles concerning information systems and technology, randomly selected from a popular digital library. Our evaluation includes two widely used benchmark techniques and shows that CDRC outperforms them. Overall, our results reveal that clustering documents at an ontology-based, concept-based level is more effective than techniques using lexicon-based document features and can generate more comprehensible clustering results.

查看原文本刊更多论文

基于本体的自动文档管理映射:一种基于概念的文档聚类中词错配和歧义问题的解决方法

文档聚类对于自动化文档管理至关重要，特别是对于快速增长的数字文本文档。传统的基于词典的方法依赖于文档内容分析和度量代表不同文档的特征向量的重叠，不能有效地解决单词不匹配或歧义问题。人们开发了其他查询扩展和局部上下文发现方法，但效率和有效性有限，因为大量扩展的术语会产生噪声，并增加整个特征空间的维数和复杂性。有几种技术通过合并潜在语义索引扩展了基于词典的分析，但会产生难以理解的聚类结果，性能也有问题。我们提出了一种基于概念的文档表示和聚类(CDRC)技术，并从一个流行的数字图书馆中随机选择了433篇有关信息系统和技术的文章，对其有效性进行了实证检验。我们的评估包括两种广泛使用的基准技术，并表明CDRC优于它们。总的来说，我们的结果表明，在基于本体、基于概念的级别上聚类文档比使用基于词典的文档特征的技术更有效，并且可以生成更易于理解的聚类结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Trans. Manag. Inf. Syst.

自引率

0.00%

发文量