The VLDB Journal最新文献

A versatile framework for attributed network clustering via K-nearest neighbor augmentation 通过 K 近邻增强实现属性网络聚类的多功能框架

The VLDB Journal Pub Date : 2024-09-16 DOI: 10.1007/s00778-024-00875-8

Yiran Li, Gongyao Guo, Jieming Shi, Renchi Yang, Shiqi Shen, Qing Li, Jun Luo

{"title":"A versatile framework for attributed network clustering via K-nearest neighbor augmentation","authors":"Yiran Li, Gongyao Guo, Jieming Shi, Renchi Yang, Shiqi Shen, Qing Li, Jun Luo","doi":"10.1007/s00778-024-00875-8","DOIUrl":"https://doi.org/10.1007/s00778-024-00875-8","url":null,"abstract":"Attributed networks containing entity-specific information in node attributes are ubiquitous in modeling social networks, e-commerce, bioinformatics, etc. Their inherent network topology ranges from simple graphs to hypergraphs with high-order interactions and multiplex graphs with separate layers. An important graph mining task is node clustering, aiming to partition the nodes of an attributed network into k disjoint clusters such that intra-cluster nodes are closely connected and share similar attributes, while inter-cluster nodes are far apart and dissimilar. It is highly challenging to capture multi-hop connections via nodes or attributes for effective clustering on multiple types of attributed networks. In this paper, we first present AHCKA as an efficient approach to attributed hypergraph clustering (AHC). AHCKA includes a carefully-crafted K-nearest neighbor augmentation strategy for the optimized exploitation of attribute information on hypergraphs, a joint hypergraph random walk model to devise an effective AHC objective, and an efficient solver with speedup techniques for the objective optimization. The proposed techniques are extensible to various types of attributed networks, and thus, we develop ANCKA as a versatile attributed network clustering framework, capable of attributed graph clustering, attributed multiplex graph clustering, and AHC. Moreover, we devise ANCKA-GPU with algorithmic designs tailored for GPU acceleration to boost efficiency. We have conducted extensive experiments to compare our methods with 19 competitors on 8 attributed hypergraphs, 16 competitors on 6 attributed graphs, and 16 competitors on 3 attributed multiplex graphs, all demonstrating the superb clustering quality and efficiency of our methods.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Discovering critical vertices for reinforcement of large-scale bipartite networks 发现关键顶点，强化大规模双方位网络

The VLDB Journal Pub Date : 2024-08-24 DOI: 10.1007/s00778-024-00871-y

Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, Ying Zhang

{"title":"Discovering critical vertices for reinforcement of large-scale bipartite networks","authors":"Yizhang He, Kai Wang, Wenjie Zhang, Xuemin Lin, Ying Zhang","doi":"10.1007/s00778-024-00871-y","DOIUrl":"https://doi.org/10.1007/s00778-024-00871-y","url":null,"abstract":"Bipartite networks model relationships between two types of vertices and are prevalent in real-world applications. The departure of vertices in a bipartite network reduces the connections of other vertices, triggering their departures as well. This may lead to a breakdown of the bipartite network and undermine any downstream applications. Such cascading vertex departure can be captured by ((alpha ,beta ))-core, a cohesive subgraph model on bipartite networks that maintains the minimum engagement levels of vertices. Based on ((alpha ,beta ))-core, we aim to ensure the vertices are highly engaged with the bipartite network from two perspectives. (1) From a pre-emptive perspective, we study the anchored ((alpha ,beta ))-core problem, which aims to maximize the size of the ((alpha ,beta ))-core by including some “anchor” vertices. (2) From a defensive perspective, we study the collapsed ((alpha ,beta ))-core problem, which aims to identify the critical vertices whose departure can lead to the largest shrink of the ((alpha ,beta ))-core. We prove the NP-hardness of these problems and resort to heuristic algorithms that choose the best anchor/collapser iteratively under a filter-verification framework. Filter-stage optimizations are proposed to reduce “dominated” candidates and allow computation-sharing. In the verification stage, we select multiple candidates for improved efficiency. Extensive experiments on 18 real-world datasets and a billion-scale synthetic dataset validate the effectiveness and efficiency of our proposed techniques.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"35 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142193154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DumpyOS: A data-adaptive multi-ary index for scalable data series similarity search DumpyOS：用于可扩展数据序列相似性搜索的数据自适应多ary索引

The VLDB Journal Pub Date : 2024-08-21 DOI: 10.1007/s00778-024-00874-9

Zeyu Wang, Qitong Wang, Peng Wang, Themis Palpanas, Wei Wang

{"title":"DumpyOS: A data-adaptive multi-ary index for scalable data series similarity search","authors":"Zeyu Wang, Qitong Wang, Peng Wang, Themis Palpanas, Wei Wang","doi":"10.1007/s00778-024-00874-9","DOIUrl":"https://doi.org/10.1007/s00778-024-00874-9","url":null,"abstract":"Data series indexes are necessary for managing and analyzing the increasing amounts of data series collections that are nowadays available. These indexes support both exact and approximate similarity search, with approximate search providing high-quality results within milliseconds, which makes it very attractive for certain modern applications. Reducing the pre-processing (i.e., index building) time and improving the accuracy of search results are two major challenges. DSTree and the iSAX index family are state-of-the-art solutions for this problem. However, DSTree suffers from long index building times, while iSAX suffers from low search accuracy. In this paper, we identify two problems of the iSAX index family that adversely affect the overall performance. First, we observe the presence of a proximity-compactness trade-off related to the index structure design (i.e., the node fanout degree), significantly limiting the efficiency and accuracy of the resulting index. Second, a skewed data distribution will negatively affect the performance of iSAX. To overcome these problems, we propose Dumpy, an index that employs a novel multi-ary data structure with an adaptive node splitting algorithm and an efficient building workflow. Furthermore, we devise Dumpy-Fuzzy as a variant of Dumpy which further improves search accuracy by proper duplication of series. To fully leverage the potential of modern hardware including multicore CPUs and Solid State Drives (SSDs), we parallelize Dumpy to DumpyOS with sophisticated indexing and pruning-based querying algorithms. An optimized approximate search algorithm, DumpyOS-F that prominently improves the search accuracy without violating the index, is also proposed. Experiments with a variety of large, real datasets demonstrate that the Dumpy solutions achieve considerably better efficiency, scalability and search accuracy than its competitors. DumpyOS further improves on Dumpy, by delivering several times faster index building and querying, and DumpyOS-F improves the search accuracy of Dumpy-Fuzzy without the additional space cost of Dumpy-Fuzzy. This paper is an extension of the previously published SIGMOD paper [81].","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142193185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling space-time efficient range queries with REncoder 利用 REncoder 实现时空高效范围查询

The VLDB Journal Pub Date : 2024-08-07 DOI: 10.1007/s00778-024-00873-w

Zhuochen Fan, Bowen Ye, Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Zirui Liu, Bin Cui

{"title":"Enabling space-time efficient range queries with REncoder","authors":"Zhuochen Fan, Bowen Ye, Ziwei Wang, Zheng Zhong, Jiarui Guo, Yuhan Wu, Haoyu Li, Tong Yang, Yaofeng Tu, Zirui Liu, Bin Cui","doi":"10.1007/s00778-024-00873-w","DOIUrl":"https://doi.org/10.1007/s00778-024-00873-w","url":null,"abstract":"A range filter is a data structure to answer range membership queries. Range queries are common in modern applications, and range filters have gained rising attention for improving the performance of range queries by ruling out empty range queries. However, state-of-the-art range filters, such as SuRF and Rosetta, suffer either high false positive rate or low throughput. In this paper, we propose a novel range filter, called REncoder. It organizes all prefixes of keys into a segment tree, and locally encodes the segment tree into a Bloom filter to accelerate queries. REncoder supports diverse workloads by adaptively choosing how many levels of the segment tree to store. In addition, we also propose a customized blacklist optimization for it to further improve the accuracy of multi-round queries. We theoretically prove that the error of REncoder is bounded and derive the asymptotic space complexity under the bounded error. We conduct extensive experiments on both synthetic datasets and real datasets. The experimental results show that REncoder outperforms all state-of-the-art range filters, and the proposed blacklist optimization can effectively further reduce the false positive rate.\u0000","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"52 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141948995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AutoCTS++: zero-shot joint neural architecture and hyperparameter search for correlated time series forecasting AutoCTS++：用于相关时间序列预测的零点联合神经架构和超参数搜索

The VLDB Journal Pub Date : 2024-07-30 DOI: 10.1007/s00778-024-00872-x

Xinle Wu, Xingjian Wu, Bin Yang, Lekui Zhou, Chenjuan Guo, Xiangfei Qiu, Jilin Hu, Zhenli Sheng, Christian S. Jensen

{"title":"AutoCTS++: zero-shot joint neural architecture and hyperparameter search for correlated time series forecasting","authors":"Xinle Wu, Xingjian Wu, Bin Yang, Lekui Zhou, Chenjuan Guo, Xiangfei Qiu, Jilin Hu, Zhenli Sheng, Christian S. Jensen","doi":"10.1007/s00778-024-00872-x","DOIUrl":"https://doi.org/10.1007/s00778-024-00872-x","url":null,"abstract":"Sensors in cyber-physical systems often capture interconnected processes and thus emit correlated time series (CTS), the forecasting of which enables important applications. Recent deep learning based forecasting methods show strong capabilities at capturing both the temporal dynamics of time series and the spatial correlations among time series, thus achieving impressive accuracy. In particular, automated CTS forecasting, where a deep learning architecture is configured automatically, enables forecasting accuracy that surpasses what has been achieved by manual approaches. However, automated CTS forecasting remains in its infancy, as existing proposals are only able to find optimal architectures for predefined hyperparameters and for specific datasets and forecasting settings (e.g., short vs. long term forecasting). These limitations hinder real-world industrial application, where forecasting faces diverse datasets and forecasting settings. We propose AutoCTS++, a zero-shot, joint search framework, to efficiently configure effective CTS forecasting models (including both neural architectures and hyperparameters), even when facing unseen datasets and foreacsting settings. Specifically, we propose an architecture-hyperparameter joint search space by encoding candidate architecture and accompanying hyperparameters into a graph representation. We then introduce a zero-shot Task-aware Architecture-Hyperparameter Comparator (T-AHC) to rank architecture-hyperparameter pairs according to different tasks (i.e., datasets and forecasting settings). We propose zero-shot means to train T-AHC, enabling it to rank architecture-hyperparameter pairs given unseen datasets and forecasting settings. A final forecasting model is then selected from the top-ranked pairs. Extensive experiments involving multiple benchmark datasets and forecasting settings demonstrate that AutoCTS++ is able to efficiently devise forecasting models for unseen datasets and forecasting settings that are capable of outperforming existing manually designed and automated models.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141872488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WavingSketch: an unbiased and generic sketch for finding top-k items in data streams WavingSketch：用于在数据流中查找前 k 项的无偏通用草图

The VLDB Journal Pub Date : 2024-07-29 DOI: 10.1007/s00778-024-00869-6

Zirui Liu, Fenghao Dong, Chengwu Liu, Xiangwei Deng, Tong Yang, Yikai Zhao, Jizhou Li, Bin Cui, Gong Zhang

引用次数: 0

FedST: secure federated shapelet transformation for time series classification FedST：用于时间序列分类的安全联合小形变换

The VLDB Journal Pub Date : 2024-07-26 DOI: 10.1007/s00778-024-00865-w

Zhiyu Liang, Hongzhi Wang

引用次数: 0

FICOM: an effective and scalable active learning framework for GNNs on semi-supervised node classification FICOM：半监督节点分类中有效且可扩展的 GNN 主动学习框架

The VLDB Journal Pub Date : 2024-07-22 DOI: 10.1007/s00778-024-00870-z

Xingyi Zhang, Jinchao Huang, Fangyuan Zhang, Sibo Wang

{"title":"FICOM: an effective and scalable active learning framework for GNNs on semi-supervised node classification","authors":"Xingyi Zhang, Jinchao Huang, Fangyuan Zhang, Sibo Wang","doi":"10.1007/s00778-024-00870-z","DOIUrl":"https://doi.org/10.1007/s00778-024-00870-z","url":null,"abstract":"Active learning for graph neural networks (GNNs) aims to select B nodes to label for the best possible GNN performance. Carefully selected labeled nodes can help improve GNN performance and hence motivates a line of research works. Unfortunately, existing methods still provide inferior GNN performance or cannot scale to large networks.Motivated by these limitations, in this paper, we present FICOM, an effective and scalable GNN active learning framework. Firstly, we formulate the node selection as an optimization problem where we consider the importance of a node from (i) the importance of a node during the feature propagation with a connection to the personalized PageRank (PPR), and (ii) the diversity of a node brings in the embedding space generated by feature propagation. We show that the defined problem is submodular, and a greedy solution can provide a ((1-1/e))-approximate solution.However, a standard greedy solution requires getting the node with the maximum marginal gain of the objective score in each iteration, which incurs a prohibitive running cost and cannot scale to large datasets. As our main contribution, we present FICOM, an efficient and scalable solution that provides ((1-1/e))-approximation guarantee and scales to graphs with millions of nodes on a single machine. The main idea is that we adaptively maintain the lower- and upper-bound of the marginal gain for each node v. In each iteration, we can first derive a small subset of candidate nodes and then compute the exact score for this subset of candidate nodes so that we can find the node with the maximum marginal gain efficiently. Extensive experiments on six benchmark datasets using four GNNs, including GCN, SGC, APPNP, and GCNII, show that our FICOM consistently outperforms existing active learning approaches on semi-supervised node classification tasks using different GNNs. Moreover, our solution can finish within 5 h on a million-node graph.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141780565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Correction to: “Refiner: a reliable and efficient incentive-driven federated learning system powered by blockchain” 更正："Refiner：由区块链驱动的可靠高效的激励驱动联合学习系统"

The VLDB Journal Pub Date : 2024-07-18 DOI: 10.1007/s00778-024-00866-9

Hong Lin, Ke Chen, Dawei Jiang, Lidan Shou, Gang Chen

引用次数: 0

Survey of vector database management systems 矢量数据库管理系统调查

The VLDB Journal Pub Date : 2024-07-15 DOI: 10.1007/s00778-024-00864-x

James Jie Pan, Jianguo Wang, Guoliang Li

{"title":"Survey of vector database management systems","authors":"James Jie Pan, Jianguo Wang, Guoliang Li","doi":"10.1007/s00778-024-00864-x","DOIUrl":"https://doi.org/10.1007/s00778-024-00864-x","url":null,"abstract":"There are now over 20 commercial vector database management systems (VDBMSs), all produced within the past five years. But embedding-based retrieval has been studied for over ten years, and similarity search a staggering half century and more. Driving this shift from algorithms to systems are new data intensive applications, notably large language models, that demand vast stores of unstructured data coupled with reliable, secure, fast, and scalable query processing capability. A variety of new data management techniques now exist for addressing these needs, however there is no comprehensive survey to thoroughly review these techniques and systems. We start by identifying five main obstacles to vector data management, namely the ambiguity of semantic similarity, large size of vectors, high cost of similarity comparison, lack of structural properties that can be used for indexing, and difficulty of efficiently answering “hybrid” queries that jointly search both attributes and vectors. Overcoming these obstacles has led to new approaches to query processing, storage and indexing, and query optimization and execution. For query processing, a variety of similarity scores and query types are now well understood; for storage and indexing, techniques include vector compression, namely quantization, and partitioning techniques based on randomization, learned partitioning, and “navigable” partitioning; for query optimization and execution, we describe new operators for hybrid queries, as well as techniques for plan enumeration, plan selection, distributed query processing, data manipulation queries, and hardware accelerated query execution. These techniques lead to a variety of VDBMSs across a spectrum of design and runtime characteristics, including “native” systems that are specialized for vectors and “extended” systems that incorporate vector capabilities into existing systems. We then discuss benchmarks, and finally outline research challenges and point the direction for future work.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141719154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0