利用剪枝技术检测图数据集上的近重复项

2020 IEEE India Council International Subsections Conference (INDISCON) Pub Date : 2020-10-01 DOI:10.1109/INDISCON50162.2020.00068

P. Naveena, P. S. Rao

{"title":"利用剪枝技术检测图数据集上的近重复项","authors":"P. Naveena, P. S. Rao","doi":"10.1109/INDISCON50162.2020.00068","DOIUrl":null,"url":null,"abstract":"Graphs are widely used formalism to model data in various domains such as natural language processing, chemoinformatics, computer vision, information retrieval and software engineering. Finding similar graphs is essential for many applications in these domains. Graph isomorphism finds exact duplicate graphs. However, it fails to quantify similarity and it's computationally expensive. To overcome both these bottlenecks, a number of graph similarity measures have been proposed. Graph Similarity Self-Join (GSSJ) is the problem of finding all pairs of graphs that have similarity score above a predefined threshold. For a graph dataset with n graphs, Naive solution involves similarity score computation for all (n/2) pairs of graphs. This problem is both compute and data intensive. Existing algorithms for this problem support only graph edit distance as the similarity measure. Overarching goal of this research is to develop algorithms for graph similarity self-join that support multiple graph similarity measures. Major contribution of this research will be better indexing mechanisms for graphs and tight bounds on graph similarity.","PeriodicalId":371571,"journal":{"name":"2020 IEEE India Council International Subsections Conference (INDISCON)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Detection of Near Duplicates over Graph Datasets Using Pruning\",\"authors\":\"P. Naveena, P. S. Rao\",\"doi\":\"10.1109/INDISCON50162.2020.00068\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphs are widely used formalism to model data in various domains such as natural language processing, chemoinformatics, computer vision, information retrieval and software engineering. Finding similar graphs is essential for many applications in these domains. Graph isomorphism finds exact duplicate graphs. However, it fails to quantify similarity and it's computationally expensive. To overcome both these bottlenecks, a number of graph similarity measures have been proposed. Graph Similarity Self-Join (GSSJ) is the problem of finding all pairs of graphs that have similarity score above a predefined threshold. For a graph dataset with n graphs, Naive solution involves similarity score computation for all (n/2) pairs of graphs. This problem is both compute and data intensive. Existing algorithms for this problem support only graph edit distance as the similarity measure. Overarching goal of this research is to develop algorithms for graph similarity self-join that support multiple graph similarity measures. Major contribution of this research will be better indexing mechanisms for graphs and tight bounds on graph similarity.\",\"PeriodicalId\":371571,\"journal\":{\"name\":\"2020 IEEE India Council International Subsections Conference (INDISCON)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE India Council International Subsections Conference (INDISCON)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INDISCON50162.2020.00068\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE India Council International Subsections Conference (INDISCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INDISCON50162.2020.00068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

图是一种广泛应用于自然语言处理、化学信息学、计算机视觉、信息检索和软件工程等领域的数据建模形式。对于这些领域中的许多应用程序来说，找到相似的图是必不可少的。图同构找到完全重复的图。然而，它无法量化相似性，而且计算成本很高。为了克服这两个瓶颈，已经提出了许多图相似度度量。图相似度自连接(GSSJ)是寻找相似度得分高于预定义阈值的所有图对的问题。对于有n个图的图数据集，朴素解决方案涉及所有(n/2)对图的相似度评分计算。这个问题是计算和数据密集型的。现有算法只支持图编辑距离作为相似度度量。本研究的首要目标是开发支持多种图相似度度量的图相似度自连接算法。本研究的主要贡献是更好的图的索引机制和图相似度的严格界限。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Detection of Near Duplicates over Graph Datasets Using Pruning

Graphs are widely used formalism to model data in various domains such as natural language processing, chemoinformatics, computer vision, information retrieval and software engineering. Finding similar graphs is essential for many applications in these domains. Graph isomorphism finds exact duplicate graphs. However, it fails to quantify similarity and it's computationally expensive. To overcome both these bottlenecks, a number of graph similarity measures have been proposed. Graph Similarity Self-Join (GSSJ) is the problem of finding all pairs of graphs that have similarity score above a predefined threshold. For a graph dataset with n graphs, Naive solution involves similarity score computation for all (n/2) pairs of graphs. This problem is both compute and data intensive. Existing algorithms for this problem support only graph edit distance as the similarity measure. Overarching goal of this research is to develop algorithms for graph similarity self-join that support multiple graph similarity measures. Major contribution of this research will be better indexing mechanisms for graphs and tight bounds on graph similarity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE India Council International Subsections Conference (INDISCON)

自引率

0.00%

发文量