Document similarity based on concept tree distance

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia Pub Date : 2008-06-19 DOI:10.1145/1379092.1379118

Praveen Lakkaraju, Susan Gauch, M. Speretta

引用次数: 79

Abstract

The Web is quickly moving from the era of search engines to the era of discovery engines. Whereas search engines help you find information you are looking for, discovery engines help you find things that you never knew existed. A common discovery technique is to automatically identify and display objects similar to ones previously viewed by the user. Core to this approach is an accurate method to identify similar documents. In this paper, we present a new approach to identifying similar documents based on a conceptual tree-similarity measure. We represent each document as a concept tree using the concept associations obtained from a classifier. Then, we make employ a tree-similarity measure based on a tree edit distance to compute similarities between concept trees. Experiments on documents from the CiteSeer collection showed that our algorithm performed significantly better than document similarity based on the traditional vector space model.

查看原文本刊更多论文

基于概念树距离的文档相似度

网络正迅速从搜索引擎时代过渡到发现引擎时代。搜索引擎帮助你找到你想要的信息，而发现引擎帮助你找到你从来不知道的东西。一种常见的发现技术是自动识别和显示与用户以前查看过的对象相似的对象。这种方法的核心是一种识别相似文档的准确方法。在本文中，我们提出了一种基于概念树相似度度量来识别相似文档的新方法。我们使用从分类器获得的概念关联将每个文档表示为概念树。然后，我们采用基于树编辑距离的树相似度度量来计算概念树之间的相似度。在CiteSeer数据库的文档上进行的实验表明，我们的算法明显优于基于传统向量空间模型的文档相似度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the nineteenth ACM conference on Hypertext and hypermedia

自引率

0.00%

发文量