{"title":"Document similarity based on concept tree distance","authors":"Praveen Lakkaraju, Susan Gauch, M. Speretta","doi":"10.1145/1379092.1379118","DOIUrl":null,"url":null,"abstract":"The Web is quickly moving from the era of search engines to the era of discovery engines. Whereas search engines help you find information you are looking for, discovery engines help you find things that you never knew existed. A common discovery technique is to automatically identify and display objects similar to ones previously viewed by the user. Core to this approach is an accurate method to identify similar documents. In this paper, we present a new approach to identifying similar documents based on a conceptual tree-similarity measure. We represent each document as a concept tree using the concept associations obtained from a classifier. Then, we make employ a tree-similarity measure based on a tree edit distance to compute similarities between concept trees. Experiments on documents from the CiteSeer collection showed that our algorithm performed significantly better than document similarity based on the traditional vector space model.","PeriodicalId":285799,"journal":{"name":"Proceedings of the nineteenth ACM conference on Hypertext and hypermedia","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2008-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"79","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the nineteenth ACM conference on Hypertext and hypermedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1379092.1379118","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 79
Abstract
The Web is quickly moving from the era of search engines to the era of discovery engines. Whereas search engines help you find information you are looking for, discovery engines help you find things that you never knew existed. A common discovery technique is to automatically identify and display objects similar to ones previously viewed by the user. Core to this approach is an accurate method to identify similar documents. In this paper, we present a new approach to identifying similar documents based on a conceptual tree-similarity measure. We represent each document as a concept tree using the concept associations obtained from a classifier. Then, we make employ a tree-similarity measure based on a tree edit distance to compute similarities between concept trees. Experiments on documents from the CiteSeer collection showed that our algorithm performed significantly better than document similarity based on the traditional vector space model.