{"title":"一种大数据场景下关系搜索加速方法","authors":"Lihua Liu, Mao Wang, Kaiming Xiao, Haiwen Chen","doi":"10.1145/3579654.3579686","DOIUrl":null,"url":null,"abstract":"We propose an acceleration method for relations search over big data scenario. The relations search is in essence an induced subgraph search of graph data model, which is to search all edges with all adjacent vertices falling into a given vertex set. For example, in the risk-control scenario for anti-fraud, for a swindling gang of highly similar fraud behaviors, figuring out all relations between these swindlers can be greatly beneficial to precisely attack all similar gangs. In-memory induced subgraph search requires lots of random accesses, while in big data scenario, relations data tend to be stored on disk when the memory is limited, and the corresponding random lookups over disk incorporate huge overhead. Also, even a singleton disk access could be costly since there are usually many properties data in a relation. Hence, existing methods would suffer considerable performance bottlenecks when applied over data on disk. We build graph over disk-based edge index, and propose multiple BFS (Bread First Search) level gaps based induces subgraph acceleration method, with efficient performance guarantee when the memory is limited. We avoid imbalance in performance for query evaluation with edge index based data organization, with which we further reduce redundant I/O accesses. We filter most invalid no-result edge queries based our BFS level gaps framework. We find that, for each BFS, we can confirm the no edge status between a pair of vertices if their BFS accessing level are of distance more than 1, which indicates the nonexistence of edge. Extensive experiments over real world graph datasets confirm the effectiveness of our method. We find that more than 97% invalid edge queries are filtered by our method, which greatly improve the system performance. Hence, our method with multiple BFS level gap encoding over vertices could greatly filter no-result edge queries and improve the relation search performance. Out method also reduce the redundant I/O access over disk with performance stability guarantee.","PeriodicalId":146783,"journal":{"name":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","volume":"371 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A method for accelerating relations search over big data scenario\",\"authors\":\"Lihua Liu, Mao Wang, Kaiming Xiao, Haiwen Chen\",\"doi\":\"10.1145/3579654.3579686\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose an acceleration method for relations search over big data scenario. The relations search is in essence an induced subgraph search of graph data model, which is to search all edges with all adjacent vertices falling into a given vertex set. For example, in the risk-control scenario for anti-fraud, for a swindling gang of highly similar fraud behaviors, figuring out all relations between these swindlers can be greatly beneficial to precisely attack all similar gangs. In-memory induced subgraph search requires lots of random accesses, while in big data scenario, relations data tend to be stored on disk when the memory is limited, and the corresponding random lookups over disk incorporate huge overhead. Also, even a singleton disk access could be costly since there are usually many properties data in a relation. Hence, existing methods would suffer considerable performance bottlenecks when applied over data on disk. We build graph over disk-based edge index, and propose multiple BFS (Bread First Search) level gaps based induces subgraph acceleration method, with efficient performance guarantee when the memory is limited. We avoid imbalance in performance for query evaluation with edge index based data organization, with which we further reduce redundant I/O accesses. We filter most invalid no-result edge queries based our BFS level gaps framework. We find that, for each BFS, we can confirm the no edge status between a pair of vertices if their BFS accessing level are of distance more than 1, which indicates the nonexistence of edge. Extensive experiments over real world graph datasets confirm the effectiveness of our method. We find that more than 97% invalid edge queries are filtered by our method, which greatly improve the system performance. Hence, our method with multiple BFS level gap encoding over vertices could greatly filter no-result edge queries and improve the relation search performance. Out method also reduce the redundant I/O access over disk with performance stability guarantee.\",\"PeriodicalId\":146783,\"journal\":{\"name\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"volume\":\"371 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3579654.3579686\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2022 5th International Conference on Algorithms, Computing and Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3579654.3579686","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
提出了一种大数据场景下关系搜索的加速方法。关系搜索本质上是图数据模型的一种诱导子图搜索,即在给定的顶点集中搜索所有相邻顶点的所有边。例如,在反欺诈的风险控制场景中,对于一个欺诈行为高度相似的诈骗团伙,弄清楚这些骗子之间的所有关系,可以极大地有利于精确打击所有相似的团伙。内存诱导子图搜索需要大量的随机访问,而在大数据场景中,当内存有限时,关系数据往往存储在磁盘上,相应地在磁盘上进行随机查找会带来巨大的开销。此外,即使是单例磁盘访问也可能代价高昂,因为一个关系中通常有许多属性数据。因此,当将现有方法应用于磁盘上的数据时,会遇到相当大的性能瓶颈。在基于磁盘的边缘索引上构建图,提出了基于面包优先搜索(Bread First Search, BFS)水平间隙的诱导子图加速方法,在内存有限的情况下能有效保证性能。基于边缘索引的数据组织避免了查询评估的性能不平衡,从而进一步减少了冗余I/O访问。我们过滤大多数无效的无结果边查询基于我们的BFS水平差距框架。我们发现,对于每个BFS,当它们的BFS访问级别的距离大于1时,我们可以确定一对顶点之间的无边状态,这表明不存在边。在真实世界的图形数据集上进行的大量实验证实了我们方法的有效性。结果表明,该方法过滤掉了97%以上的无效边缘查询,极大地提高了系统性能。因此,我们的方法在顶点上使用多个BFS级间隙编码,可以极大地过滤无结果边缘查询,提高关系搜索性能。Out方法还减少了磁盘上的冗余I/O访问,保证了性能的稳定性。
A method for accelerating relations search over big data scenario
We propose an acceleration method for relations search over big data scenario. The relations search is in essence an induced subgraph search of graph data model, which is to search all edges with all adjacent vertices falling into a given vertex set. For example, in the risk-control scenario for anti-fraud, for a swindling gang of highly similar fraud behaviors, figuring out all relations between these swindlers can be greatly beneficial to precisely attack all similar gangs. In-memory induced subgraph search requires lots of random accesses, while in big data scenario, relations data tend to be stored on disk when the memory is limited, and the corresponding random lookups over disk incorporate huge overhead. Also, even a singleton disk access could be costly since there are usually many properties data in a relation. Hence, existing methods would suffer considerable performance bottlenecks when applied over data on disk. We build graph over disk-based edge index, and propose multiple BFS (Bread First Search) level gaps based induces subgraph acceleration method, with efficient performance guarantee when the memory is limited. We avoid imbalance in performance for query evaluation with edge index based data organization, with which we further reduce redundant I/O accesses. We filter most invalid no-result edge queries based our BFS level gaps framework. We find that, for each BFS, we can confirm the no edge status between a pair of vertices if their BFS accessing level are of distance more than 1, which indicates the nonexistence of edge. Extensive experiments over real world graph datasets confirm the effectiveness of our method. We find that more than 97% invalid edge queries are filtered by our method, which greatly improve the system performance. Hence, our method with multiple BFS level gap encoding over vertices could greatly filter no-result edge queries and improve the relation search performance. Out method also reduce the redundant I/O access over disk with performance stability guarantee.