The VLDB JournalPub Date : 2024-09-16DOI: 10.1007/s00778-024-00875-8
Yiran Li, Gongyao Guo, Jieming Shi, Renchi Yang, Shiqi Shen, Qing Li, Jun Luo
{"title":"A versatile framework for attributed network clustering via K-nearest neighbor augmentation","authors":"Yiran Li, Gongyao Guo, Jieming Shi, Renchi Yang, Shiqi Shen, Qing Li, Jun Luo","doi":"10.1007/s00778-024-00875-8","DOIUrl":"https://doi.org/10.1007/s00778-024-00875-8","url":null,"abstract":"<p>Attributed networks containing entity-specific information in node attributes are ubiquitous in modeling social networks, e-commerce, bioinformatics, etc. Their inherent network topology ranges from simple graphs to hypergraphs with high-order interactions and multiplex graphs with separate layers. An important graph mining task is node clustering, aiming to partition the nodes of an attributed network into <i>k</i> disjoint clusters such that intra-cluster nodes are closely connected and share similar attributes, while inter-cluster nodes are far apart and dissimilar. It is highly challenging to capture multi-hop connections via nodes or attributes for effective clustering on multiple types of attributed networks. In this paper, we first present <span>AHCKA</span> as an efficient approach to <i>attributed hypergraph clustering</i> (AHC). <span>AHCKA</span> includes a carefully-crafted <i>K</i>-nearest neighbor augmentation strategy for the optimized exploitation of attribute information on hypergraphs, a joint hypergraph random walk model to devise an effective AHC objective, and an efficient solver with speedup techniques for the objective optimization. The proposed techniques are extensible to various types of attributed networks, and thus, we develop <span>ANCKA</span> as a versatile attributed network clustering framework, capable of <i>attributed graph clustering</i>, <i>attributed multiplex graph clustering</i>, and AHC. Moreover, we devise <span>ANCKA-GPU</span> with algorithmic designs tailored for GPU acceleration to boost efficiency. We have conducted extensive experiments to compare our methods with 19 competitors on 8 attributed hypergraphs, 16 competitors on 6 attributed graphs, and 16 competitors on 3 attributed multiplex graphs, all demonstrating the superb clustering quality and efficiency of our methods.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-08-24DOI: 10.1007/s00778-024-00871-y
Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, Ying Zhang
{"title":"Discovering critical vertices for reinforcement of large-scale bipartite networks","authors":"Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, Ying Zhang","doi":"10.1007/s00778-024-00871-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00871-y","url":null,"abstract":"<p>Bipartite networks model relationships between two types of vertices and are prevalent in real-world applications. The departure of vertices in a bipartite network reduces the connections of other vertices, triggering their departures as well. This may lead to a breakdown of the bipartite network and undermine any downstream applications. Such cascading vertex departure can be captured by <span>((alpha ,beta ))</span>-core, a cohesive subgraph model on bipartite networks that maintains the minimum engagement levels of vertices. Based on <span>((alpha ,beta ))</span>-core, we aim to ensure the vertices are highly engaged with the bipartite network from two perspectives. (1) From a pre-emptive perspective, we study the anchored <span>((alpha ,beta ))</span>-core problem, which aims to maximize the size of the <span>((alpha ,beta ))</span>-core by including some “anchor” vertices. (2) From a defensive perspective, we study the collapsed <span>((alpha ,beta ))</span>-core problem, which aims to identify the critical vertices whose departure can lead to the largest shrink of the <span>((alpha ,beta ))</span>-core. We prove the NP-hardness of these problems and resort to heuristic algorithms that choose the best anchor/collapser iteratively under a filter-verification framework. Filter-stage optimizations are proposed to reduce “dominated” candidates and allow computation-sharing. In the verification stage, we select multiple candidates for improved efficiency. Extensive experiments on 18 real-world datasets and a billion-scale synthetic dataset validate the effectiveness and efficiency of our proposed techniques.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142193154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-08-21DOI: 10.1007/s00778-024-00874-9
Zeyu Wang, Qitong Wang, Peng Wang, Themis Palpanas, Wei Wang
{"title":"DumpyOS: A data-adaptive multi-ary index for scalable data series similarity search","authors":"Zeyu Wang, Qitong Wang, Peng Wang, Themis Palpanas, Wei Wang","doi":"10.1007/s00778-024-00874-9","DOIUrl":"https://doi.org/10.1007/s00778-024-00874-9","url":null,"abstract":"<p>Data series indexes are necessary for managing and analyzing the increasing amounts of data series collections that are nowadays available. These indexes support both exact and approximate similarity search, with approximate search providing high-quality results within milliseconds, which makes it very attractive for certain modern applications. Reducing the pre-processing (i.e., index building) time and improving the accuracy of search results are two major challenges. DSTree and the iSAX index family are state-of-the-art solutions for this problem. However, DSTree suffers from long index building times, while iSAX suffers from low search accuracy. In this paper, we identify two problems of the iSAX index family that adversely affect the overall performance. First, we observe the presence of a <i>proximity-compactness trade-off</i> related to the index structure design (i.e., the node fanout degree), significantly limiting the efficiency and accuracy of the resulting index. Second, a skewed data distribution will negatively affect the performance of iSAX. To overcome these problems, we propose Dumpy, an index that employs a novel multi-ary data structure with an adaptive node splitting algorithm and an efficient building workflow. Furthermore, we devise Dumpy-Fuzzy as a variant of Dumpy which further improves search accuracy by proper duplication of series. To fully leverage the potential of modern hardware including multicore CPUs and Solid State Drives (SSDs), we parallelize Dumpy to DumpyOS with sophisticated indexing and pruning-based querying algorithms. An optimized approximate search algorithm, DumpyOS-F that prominently improves the search accuracy without violating the index, is also proposed. Experiments with a variety of large, real datasets demonstrate that the Dumpy solutions achieve considerably better efficiency, scalability and search accuracy than its competitors. DumpyOS further improves on Dumpy, by delivering several times faster index building and querying, and DumpyOS-F improves the search accuracy of Dumpy-Fuzzy without the additional space cost of Dumpy-Fuzzy. This paper is an extension of the previously published SIGMOD paper [81].</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142193185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-08-07DOI: 10.1007/s00778-024-00873-w
Zhuochen Fan, Bowen Ye, Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Zirui Liu, Bin Cui
{"title":"Enabling space-time efficient range queries with REncoder","authors":"Zhuochen Fan, Bowen Ye, Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Zirui Liu, Bin Cui","doi":"10.1007/s00778-024-00873-w","DOIUrl":"https://doi.org/10.1007/s00778-024-00873-w","url":null,"abstract":"<p>A range filter is a data structure to answer range membership queries. Range queries are common in modern applications, and range filters have gained rising attention for improving the performance of range queries by ruling out empty range queries. However, state-of-the-art range filters, such as SuRF and Rosetta, suffer either high false positive rate or low throughput. In this paper, we propose a novel range filter, called REncoder. It organizes all prefixes of keys into a segment tree, and locally encodes the segment tree into a Bloom filter to accelerate queries. REncoder supports diverse workloads by adaptively choosing how many levels of the segment tree to store. In addition, we also propose a customized blacklist optimization for it to further improve the accuracy of multi-round queries. We theoretically prove that the error of REncoder is bounded and derive the asymptotic space complexity under the bounded error. We conduct extensive experiments on both synthetic datasets and real datasets. The experimental results show that REncoder outperforms all state-of-the-art range filters, and the proposed blacklist optimization can effectively further reduce the false positive rate.\u0000</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141948995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-07-30DOI: 10.1007/s00778-024-00872-x
Xinle Wu, Xingjian Wu, Bin Yang, Lekui Zhou, Chenjuan Guo, Xiangfei Qiu, Jilin Hu, Zhenli Sheng, Christian S. Jensen
{"title":"AutoCTS++: zero-shot joint neural architecture and hyperparameter search for correlated time series forecasting","authors":"Xinle Wu, Xingjian Wu, Bin Yang, Lekui Zhou, Chenjuan Guo, Xiangfei Qiu, Jilin Hu, Zhenli Sheng, Christian S. Jensen","doi":"10.1007/s00778-024-00872-x","DOIUrl":"https://doi.org/10.1007/s00778-024-00872-x","url":null,"abstract":"<p>Sensors in cyber-physical systems often capture interconnected processes and thus emit correlated time series (CTS), the forecasting of which enables important applications. Recent deep learning based forecasting methods show strong capabilities at capturing both the temporal dynamics of time series and the spatial correlations among time series, thus achieving impressive accuracy. In particular, automated CTS forecasting, where a deep learning architecture is configured automatically, enables forecasting accuracy that surpasses what has been achieved by manual approaches. However, automated CTS forecasting remains in its infancy, as existing proposals are only able to find optimal architectures for predefined hyperparameters and for specific datasets and forecasting settings (e.g., short vs. long term forecasting). These limitations hinder real-world industrial application, where forecasting faces diverse datasets and forecasting settings. We propose AutoCTS++, a zero-shot, joint search framework, to efficiently configure effective CTS forecasting models (including both neural architectures and hyperparameters), even when facing unseen datasets and foreacsting settings. Specifically, we propose an architecture-hyperparameter joint search space by encoding candidate architecture and accompanying hyperparameters into a graph representation. We then introduce a zero-shot Task-aware Architecture-Hyperparameter Comparator (T-AHC) to rank architecture-hyperparameter pairs according to different tasks (i.e., datasets and forecasting settings). We propose zero-shot means to train T-AHC, enabling it to rank architecture-hyperparameter pairs given unseen datasets and forecasting settings. A final forecasting model is then selected from the top-ranked pairs. Extensive experiments involving multiple benchmark datasets and forecasting settings demonstrate that AutoCTS++ is able to efficiently devise forecasting models for unseen datasets and forecasting settings that are capable of outperforming existing manually designed and automated models.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-07-29DOI: 10.1007/s00778-024-00869-6
Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang
{"title":"WavingSketch: an unbiased and generic sketch for finding top-k items in data streams","authors":"Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang","doi":"10.1007/s00778-024-00869-6","DOIUrl":"https://doi.org/10.1007/s00778-024-00869-6","url":null,"abstract":"<p>Finding top-<i>k</i> items in data streams is a fundamental problem in data mining. Unbiased estimation is well acknowledged as an elegant and important property for top-<i>k</i> algorithms. In this paper, we propose a novel sketch algorithm, called WavingSketch, which is more accurate than existing unbiased algorithms. We theoretically prove that WavingSketchcan provide unbiased estimation, and derive its error bound. WavingSketchis generic to measurement tasks, and we apply it to five applications: finding top-<i>k</i> frequent items, finding top-<i>k</i> heavy changes, finding top-<i>k</i> persistent items, finding top-<i>k</i> Super-Spreaders, and join-aggregate estimation. Our experimental results show that, compared with the state-of-the-art Unbiased Space-Saving, WavingSketchachieves <span>(10 times )</span> faster speed and <span>(10^3 times )</span> smaller error on finding frequent items. For other applications, WavingSketchalso achieves higher accuracy and faster speed. All related codes are open-sourced at GitHub (https://github.com/WavingSketch/Waving-Sketch).</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872410","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-07-26DOI: 10.1007/s00778-024-00865-w
Zhiyu Liang, Hongzhi Wang
{"title":"FedST: secure federated shapelet transformation for time series classification","authors":"Zhiyu Liang, Hongzhi Wang","doi":"10.1007/s00778-024-00865-w","DOIUrl":"https://doi.org/10.1007/s00778-024-00865-w","url":null,"abstract":"<p>This paper explores how to build a shapelet-based time series classification (TSC) model in the federated learning (FL) scenario, that is, using more data from multiple owners without actually sharing the data. We propose FedST, a novel federated TSC framework extended from a centralized shapelet transformation method. We recognize the federated shapelet search step as the kernel of FedST. Thus, we design a basic protocol for the FedST kernel that we prove to be secure and accurate. However, we identify that the basic protocol suffers from efficiency bottlenecks and the centralized acceleration techniques lose their efficacy due to the security issues. To speed up the federated protocol with security guarantee, we propose several optimizations tailored for the FL setting. Our theoretical analysis shows that the proposed methods are secure and more efficient. We conduct extensive experiments using both synthetic and real-world datasets. Empirical results show that our FedST solution is effective in terms of TSC accuracy, and the proposed optimizations can achieve three orders of magnitude of speedup.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-07-22DOI: 10.1007/s00778-024-00870-z
Xingyi Zhang, Jinchao Huang, Fangyuan Zhang, Sibo Wang
{"title":"FICOM: an effective and scalable active learning framework for GNNs on semi-supervised node classification","authors":"Xingyi Zhang, Jinchao Huang, Fangyuan Zhang, Sibo Wang","doi":"10.1007/s00778-024-00870-z","DOIUrl":"https://doi.org/10.1007/s00778-024-00870-z","url":null,"abstract":"<p>Active learning for graph neural networks (GNNs) aims to select <i>B</i> nodes to label for the best possible GNN performance. Carefully selected labeled nodes can help improve GNN performance and hence motivates a line of research works. Unfortunately, existing methods still provide inferior GNN performance or cannot scale to large networks.Motivated by these limitations, in this paper, we present <i>FICOM</i>, an effective and scalable GNN active learning framework. Firstly, we formulate the node selection as an optimization problem where we consider the importance of a node from (i) the importance of a node during the feature propagation with a connection to the personalized PageRank (PPR), and (ii) the diversity of a node brings in the embedding space generated by feature propagation. We show that the defined problem is submodular, and a greedy solution can provide a <span>((1-1/e))</span>-approximate solution.However, a standard greedy solution requires getting the node with the maximum marginal gain of the objective score in each iteration, which incurs a prohibitive running cost and cannot scale to large datasets. As our main contribution, we present FICOM, an efficient and scalable solution that provides <span>((1-1/e))</span>-approximation guarantee and scales to graphs with millions of nodes on a single machine. The main idea is that we adaptively maintain the lower- and upper-bound of the marginal gain for each node <i>v</i>. In each iteration, we can first derive a small subset of candidate nodes and then compute the exact score for this subset of candidate nodes so that we can find the node with the maximum marginal gain efficiently. Extensive experiments on six benchmark datasets using four GNNs, including GCN, SGC, APPNP, and GCNII, show that our FICOM consistently outperforms existing active learning approaches on semi-supervised node classification tasks using different GNNs. Moreover, our solution can finish within 5 h on a million-node graph.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-07-18DOI: 10.1007/s00778-024-00866-9
Hong Lin, Ke Chen, Dawei Jiang, Lidan Shou, Gang Chen
{"title":"Correction to: “Refiner: a reliable and efficient incentive-driven federated learning system powered by blockchain”","authors":"Hong Lin, Ke Chen, Dawei Jiang, Lidan Shou, Gang Chen","doi":"10.1007/s00778-024-00866-9","DOIUrl":"https://doi.org/10.1007/s00778-024-00866-9","url":null,"abstract":"","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":" 43","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141825444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The VLDB JournalPub Date : 2024-07-15DOI: 10.1007/s00778-024-00864-x
James Jie Pan, Jianguo Wang, Guoliang Li
{"title":"Survey of vector database management systems","authors":"James Jie Pan, Jianguo Wang, Guoliang Li","doi":"10.1007/s00778-024-00864-x","DOIUrl":"https://doi.org/10.1007/s00778-024-00864-x","url":null,"abstract":"<p>There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely the ambiguity of semantic similarity, large size of vectors, high cost of similarity comparison, lack of structural properties that can be used for indexing, and difficulty of efficiently answering “hybrid” queries that jointly search both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning techniques based on randomization, learned partitioning, and “navigable” partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, distributed query processing, data manipulation queries, and hardware accelerated query execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including “native” systems that are specialized for vectors and “extended” systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally outline research challenges and point the direction for future work.</p>","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}