高质量的基于图的相似度搜索

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval Pub Date : 2015-08-09 DOI:10.1145/2766462.2767720

Weiren Yu, J. Mccann

{"title":"高质量的基于图的相似度搜索","authors":"Weiren Yu, J. Mccann","doi":"10.1145/2766462.2767720","DOIUrl":null,"url":null,"abstract":"SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted \"connectivity trait\": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a \"varied-D\" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a \"kernel-based\" model to improve the quality of SimRank, and circumvent the \"connectivity trait\" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: \"if D is replaced by a scaled identity matrix, top-K rankings will not be affected much\". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":"{\"title\":\"High Quality Graph-Based Similarity Search\",\"authors\":\"Weiren Yu, J. Mccann\",\"doi\":\"10.1145/2766462.2767720\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted \\\"connectivity trait\\\": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a \\\"varied-D\\\" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a \\\"kernel-based\\\" model to improve the quality of SimRank, and circumvent the \\\"connectivity trait\\\" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: \\\"if D is replaced by a scaled identity matrix, top-K rankings will not be affected much\\\". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.\",\"PeriodicalId\":297035,\"journal\":{\"name\":\"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-08-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"30\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2766462.2767720\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2766462.2767720","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

摘要

simmrank是一种很有影响力的基于链接的相似性度量方法，已被用于网络搜索和社会计量学的许多领域。然而，由Kusumoto等人提出的同类最佳方法并不总是提供高质量的结果，因为它不能准确地获得其对角修正矩阵d。此外，simmrank还受到不必要的“连接特性”的限制:增加节点a和b之间的路径数量通常会导致分数s(a,b)的降低。最著名的解决方案simrank++不能解决这个问题，因为如果a和b没有共同的内邻居，修改后的分数将为零。在本文中，我们考虑高质量的相似度搜索。我们的方案sr#是高效且有语义意义的:(1)我们首先制定了精确的D，并设计了一个“变D”方法来精确计算线性存储器中的simmrank。此外，通过分组计算，我们还减少了迭代次数从二次到线性的时间。(2)设计了“基于核”的simmrank模型，提高了simmrank的质量，规避了“连通性”问题。(3)我们对simmrank及其变体之间的语义差异进行了数学分析，并纠正了一个论点:“如果D被缩放的单位矩阵取代，top-K排名不会受到太大影响”。实验证实，sr#可以准确地提取高质量的分数，并且比最先进的竞争对手快得多。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

High Quality Graph-Based Similarity Search

SimRank is an influential link-based similarity measure that has been used in many fields of Web search and sociometry. The best-of-breed method by Kusumoto et. al., however, does not always deliver high-quality results, since it fails to accurately obtain its diagonal correction matrix D. Besides, SimRank is also limited by an unwanted "connectivity trait": increasing the number of paths between nodes a and b often incurs a decrease in score s(a,b). The best-known solution, SimRank++, cannot resolve this problem, since a revised score will be zero if a and b have no common in-neighbors. In this paper, we consider high-quality similarity search. Our scheme, SR#, is efficient and semantically meaningful: (1) We first formulate the exact D, and devise a "varied-D" method to accurately compute SimRank in linear memory. Moreover, by grouping computation, we also reduce the time of from quadratic to linear in the number of iterations. (2) We design a "kernel-based" model to improve the quality of SimRank, and circumvent the "connectivity trait" issue. (3) We give mathematical insights to the semantic difference between SimRank and its variant, and correct an argument: "if D is replaced by a scaled identity matrix, top-K rankings will not be affected much". The experiments confirm that SR# can accurately extract high-quality scores, and is much faster than the state-of-the-art competitors.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

自引率

0.00%

发文量