Representative and Back-In-Time Sampling from Real-World Hypergraphs

IF 4.8 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Transactions on Knowledge Discovery from Data Pub Date : 2024-03-19 DOI:10.1145/3653306

Minyoung Choe, Jaemin Yoo, Geon Lee, Woonsung Baek, U Kang, Kijung Shin

{"title":"Representative and Back-In-Time Sampling from Real-World Hypergraphs","authors":"Minyoung Choe, Jaemin Yoo, Geon Lee, Woonsung Baek, U Kang, Kijung Shin","doi":"10.1145/3653306","DOIUrl":null,"url":null,"abstract":"Graphs are widely used for representing pairwise interactions in complex systems. Since such real-world graphs are large and often evergrowing, sampling subgraphs is useful for various purposes, including simulation, visualization, stream processing, representation learning, and crawling. However, many complex systems consist of group interactions (e.g., collaborations of researchers and discussions on online Q&A platforms) and thus are represented more naturally and accurately by hypergraphs than by ordinary graphs. Motivated by the prevalence of large-scale hypergraphs, we study the problem of sampling from real-world hypergraphs, aiming to answer (Q1) how can we measure the goodness of sub-hypergraphs, and (Q2) how can we efficiently find a “good” sub-hypergraph. Regarding Q1, we distinguish between two goals: (a) representative sampling, which aims to capture the characteristics of the input hypergraph, and (b) back-in-time sampling, which aims to closely approximate a past snapshot of the input time-evolving hypergraph. To evaluate the similarity of the sampled sub-hypergraph to the target (i.e., the input hypergraph or its past snapshot), we consider 10 graph-level, hyperedge-level, and node-level statistics. Regarding Q2, we first conduct a thorough analysis of various intuitive approaches using 11 real-world hypergraphs, Then, based on this analysis, we propose MiDaS and MiDaS-B, designed for representative sampling and back-in-time sampling, respectively. Regarding representative sampling, we demonstrate through extensive experiments that MiDaS, which employs a sampling bias towards high-degree nodes in hyperedge selection, is (a) Representative: finding overall the most representative samples among 15 considered approaches, (b) Fast: several orders of magnitude faster than the strongest competitors, and (c) Automatic: automatically tuning the degree of sampling bias. Regarding back-in-time sampling, we demonstrate that MiDaS-B inherits the strengths of MiDaS despite an additional challenge—the unavailability of the target (i.e., past snapshot). It effectively handles this challenge by focusing on replicating universal evolutionary patterns, rather than directly replicating the target.","PeriodicalId":49249,"journal":{"name":"ACM Transactions on Knowledge Discovery from Data","volume":"26 1","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2024-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Knowledge Discovery from Data","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3653306","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Graphs are widely used for representing pairwise interactions in complex systems. Since such real-world graphs are large and often evergrowing, sampling subgraphs is useful for various purposes, including simulation, visualization, stream processing, representation learning, and crawling. However, many complex systems consist of group interactions (e.g., collaborations of researchers and discussions on online Q&A platforms) and thus are represented more naturally and accurately by hypergraphs than by ordinary graphs. Motivated by the prevalence of large-scale hypergraphs, we study the problem of sampling from real-world hypergraphs, aiming to answer (Q1) how can we measure the goodness of sub-hypergraphs, and (Q2) how can we efficiently find a “good” sub-hypergraph. Regarding Q1, we distinguish between two goals: (a) representative sampling, which aims to capture the characteristics of the input hypergraph, and (b) back-in-time sampling, which aims to closely approximate a past snapshot of the input time-evolving hypergraph. To evaluate the similarity of the sampled sub-hypergraph to the target (i.e., the input hypergraph or its past snapshot), we consider 10 graph-level, hyperedge-level, and node-level statistics. Regarding Q2, we first conduct a thorough analysis of various intuitive approaches using 11 real-world hypergraphs, Then, based on this analysis, we propose MiDaS and MiDaS-B, designed for representative sampling and back-in-time sampling, respectively. Regarding representative sampling, we demonstrate through extensive experiments that MiDaS, which employs a sampling bias towards high-degree nodes in hyperedge selection, is (a) Representative: finding overall the most representative samples among 15 considered approaches, (b) Fast: several orders of magnitude faster than the strongest competitors, and (c) Automatic: automatically tuning the degree of sampling bias. Regarding back-in-time sampling, we demonstrate that MiDaS-B inherits the strengths of MiDaS despite an additional challenge—the unavailability of the target (i.e., past snapshot). It effectively handles this challenge by focusing on replicating universal evolutionary patterns, rather than directly replicating the target.

查看原文本刊更多论文

从真实世界超图中进行代表性和实时采样

图被广泛用于表示复杂系统中的成对交互。由于这种真实世界的图很大，而且经常不断增长，因此对子图进行采样可用于各种目的，包括模拟、可视化、流处理、表征学习和爬行。然而，许多复杂系统由群体交互组成（如研究人员的合作和在线问答平台上的讨论），因此超图比普通图更自然、更准确地表示这些系统。受大规模超图普遍存在的启发，我们研究了从真实世界超图中抽样的问题，旨在回答（问题 1）如何衡量子超图的好坏，以及（问题 2）如何高效地找到 "好 "的子超图。关于问题 1，我们区分了两个目标：(a) 代表性采样，其目的是捕捉输入超图的特征；(b) 时间回溯采样，其目的是接近输入的随时间演变的超图的过去快照。为了评估采样子超图与目标图（即输入超图或其过去快照）的相似性，我们考虑了 10 个图级、超边级和节点级统计数据。关于 Q2，我们首先使用 11 个真实超图对各种直观方法进行了深入分析，然后在此基础上提出了 MiDaS 和 MiDaS-B，分别用于代表性采样和时间回溯采样。在代表性采样方面，我们通过大量实验证明，MiDaS 在选择超图时采用了偏向高阶节点的采样方法，具有以下优点：（a）代表性：在所考虑的 15 种方法中找到的样本总体上最具代表性；（b）快速：比最强的竞争对手快几个数量级；（c）自动：可自动调整采样偏差程度。关于回溯时间采样，我们证明 MiDaS-B 继承了 MiDaS 的优势，尽管它还面临一个额外的挑战--目标（即过去的快照）不可用。它专注于复制普遍的进化模式，而不是直接复制目标，从而有效地应对了这一挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Knowledge Discovery from Data COMPUTER SCIENCE, INFORMATION SYSTEMS-COMPUTER SCIENCE, SOFTWARE ENGINEERING

CiteScore

6.70

自引率

5.60%

发文量

172

审稿时长

3 months

期刊介绍： TKDD welcomes papers on a full range of research in the knowledge discovery and analysis of diverse forms of data. Such subjects include, but are not limited to: scalable and effective algorithms for data mining and big data analysis, mining brain networks, mining data streams, mining multi-media data, mining high-dimensional data, mining text, Web, and semi-structured data, mining spatial and temporal data, data mining for community generation, social network analysis, and graph structured data, security and privacy issues in data mining, visual, interactive and online data mining, pre-processing and post-processing for data mining, robust and scalable statistical methods, data mining languages, foundations of data mining, KDD framework and process, and novel applications and infrastructures exploiting data mining technology including massively parallel processing and cloud computing platforms. TKDD encourages papers that explore the above subjects in the context of large distributed networks of computers, parallel or multiprocessing computers, or new data devices. TKDD also encourages papers that describe emerging data mining applications that cannot be satisfied by the current data mining technology.