{"title":"利用剪枝技术检测图数据集上的近重复项","authors":"P. Naveena, P. S. Rao","doi":"10.1109/INDISCON50162.2020.00068","DOIUrl":null,"url":null,"abstract":"Graphs are widely used formalism to model data in various domains such as natural language processing, chemoinformatics, computer vision, information retrieval and software engineering. Finding similar graphs is essential for many applications in these domains. Graph isomorphism finds exact duplicate graphs. However, it fails to quantify similarity and it's computationally expensive. To overcome both these bottlenecks, a number of graph similarity measures have been proposed. Graph Similarity Self-Join (GSSJ) is the problem of finding all pairs of graphs that have similarity score above a predefined threshold. For a graph dataset with n graphs, Naive solution involves similarity score computation for all (n/2) pairs of graphs. This problem is both compute and data intensive. Existing algorithms for this problem support only graph edit distance as the similarity measure. Overarching goal of this research is to develop algorithms for graph similarity self-join that support multiple graph similarity measures. Major contribution of this research will be better indexing mechanisms for graphs and tight bounds on graph similarity.","PeriodicalId":371571,"journal":{"name":"2020 IEEE India Council International Subsections Conference (INDISCON)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Detection of Near Duplicates over Graph Datasets Using Pruning\",\"authors\":\"P. Naveena, P. S. Rao\",\"doi\":\"10.1109/INDISCON50162.2020.00068\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Graphs are widely used formalism to model data in various domains such as natural language processing, chemoinformatics, computer vision, information retrieval and software engineering. Finding similar graphs is essential for many applications in these domains. Graph isomorphism finds exact duplicate graphs. However, it fails to quantify similarity and it's computationally expensive. To overcome both these bottlenecks, a number of graph similarity measures have been proposed. Graph Similarity Self-Join (GSSJ) is the problem of finding all pairs of graphs that have similarity score above a predefined threshold. For a graph dataset with n graphs, Naive solution involves similarity score computation for all (n/2) pairs of graphs. This problem is both compute and data intensive. Existing algorithms for this problem support only graph edit distance as the similarity measure. Overarching goal of this research is to develop algorithms for graph similarity self-join that support multiple graph similarity measures. Major contribution of this research will be better indexing mechanisms for graphs and tight bounds on graph similarity.\",\"PeriodicalId\":371571,\"journal\":{\"name\":\"2020 IEEE India Council International Subsections Conference (INDISCON)\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE India Council International Subsections Conference (INDISCON)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/INDISCON50162.2020.00068\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE India Council International Subsections Conference (INDISCON)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/INDISCON50162.2020.00068","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Detection of Near Duplicates over Graph Datasets Using Pruning
Graphs are widely used formalism to model data in various domains such as natural language processing, chemoinformatics, computer vision, information retrieval and software engineering. Finding similar graphs is essential for many applications in these domains. Graph isomorphism finds exact duplicate graphs. However, it fails to quantify similarity and it's computationally expensive. To overcome both these bottlenecks, a number of graph similarity measures have been proposed. Graph Similarity Self-Join (GSSJ) is the problem of finding all pairs of graphs that have similarity score above a predefined threshold. For a graph dataset with n graphs, Naive solution involves similarity score computation for all (n/2) pairs of graphs. This problem is both compute and data intensive. Existing algorithms for this problem support only graph edit distance as the similarity measure. Overarching goal of this research is to develop algorithms for graph similarity self-join that support multiple graph similarity measures. Major contribution of this research will be better indexing mechanisms for graphs and tight bounds on graph similarity.