Document similarity based on concept tree distance

Praveen Lakkaraju, Susan Gauch, M. Speretta
{"title":"Document similarity based on concept tree distance","authors":"Praveen Lakkaraju, Susan Gauch, M. Speretta","doi":"10.1145/1379092.1379118","DOIUrl":null,"url":null,"abstract":"The Web is quickly moving from the era of search engines to the era of discovery engines. Whereas search engines help you find information you are looking for, discovery engines help you find things that you never knew existed. A common discovery technique is to automatically identify and display objects similar to ones previously viewed by the user. Core to this approach is an accurate method to identify similar documents. In this paper, we present a new approach to identifying similar documents based on a conceptual tree-similarity measure. We represent each document as a concept tree using the concept associations obtained from a classifier. Then, we make employ a tree-similarity measure based on a tree edit distance to compute similarities between concept trees. Experiments on documents from the CiteSeer collection showed that our algorithm performed significantly better than document similarity based on the traditional vector space model.","PeriodicalId":285799,"journal":{"name":"Proceedings of the nineteenth ACM conference on Hypertext and hypermedia","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"79","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the nineteenth ACM conference on Hypertext and hypermedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1379092.1379118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 79

Abstract

The Web is quickly moving from the era of search engines to the era of discovery engines. Whereas search engines help you find information you are looking for, discovery engines help you find things that you never knew existed. A common discovery technique is to automatically identify and display objects similar to ones previously viewed by the user. Core to this approach is an accurate method to identify similar documents. In this paper, we present a new approach to identifying similar documents based on a conceptual tree-similarity measure. We represent each document as a concept tree using the concept associations obtained from a classifier. Then, we make employ a tree-similarity measure based on a tree edit distance to compute similarities between concept trees. Experiments on documents from the CiteSeer collection showed that our algorithm performed significantly better than document similarity based on the traditional vector space model.
基于概念树距离的文档相似度
网络正迅速从搜索引擎时代过渡到发现引擎时代。搜索引擎帮助你找到你想要的信息,而发现引擎帮助你找到你从来不知道的东西。一种常见的发现技术是自动识别和显示与用户以前查看过的对象相似的对象。这种方法的核心是一种识别相似文档的准确方法。在本文中,我们提出了一种基于概念树相似度度量来识别相似文档的新方法。我们使用从分类器获得的概念关联将每个文档表示为概念树。然后,我们采用基于树编辑距离的树相似度度量来计算概念树之间的相似度。在CiteSeer数据库的文档上进行的实验表明,我们的算法明显优于基于传统向量空间模型的文档相似度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信