Proc. VLDB Endow.最新文献_第6页

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel PyTorch FSDP:扩展完全分片数据并行的经验

Proc. VLDB Endow. Pub Date : 2023-04-21 DOI: 10.48550/arXiv.2304.11277

Yanli Zhao, A. Gu, R. Varma, Liangchen Luo, Chien-chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Y. Hao, Shen Li

{"title":"PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel","authors":"Yanli Zhao, A. Gu, R. Varma, Liangchen Luo, Chien-chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, Alban Desmaison, Can Balioglu, Bernard Nguyen, Geeta Chauhan, Y. Hao, Shen Li","doi":"10.48550/arXiv.2304.11277","DOIUrl":"https://doi.org/10.48550/arXiv.2304.11277","url":null,"abstract":"It is widely acknowledged that large models have the potential to deliver superior performance across a broad range of domains. Despite the remarkable progress made in the field of machine learning systems research, which has enabled the development and exploration of large models, such abilities remain confined to a small group of advanced users and industry leaders, resulting in an implicit technical barrier for the wider community to access and leverage these technologies. In this paper, we introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training. FSDP has been closely co-designed with several key PyTorch core components including Tensor implementation, dispatcher system, and CUDA memory caching allocator, to provide non-intrusive user experiences and high training efficiency. Additionally, FSDP natively incorporates a range of techniques and settings to optimize resource utilization across a variety of hardware configurations. The experimental results demonstrate that FSDP is capable of achieving comparable performance to Distributed Data Parallel while providing support for significantly larger models with near-linear scalability in terms of TFLOPS.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89610502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Proc. VLDB Endow. Pub Date : 2023-04-07 DOI: 10.48550/arXiv.2304.04759

C. Aguerrebere, Ishwar Bhati, Mark Hildebrand, Mariano Tepper, Ted L. Willke

{"title":"Similarity search in the blink of an eye with compressed indices","authors":"C. Aguerrebere, Ishwar Bhati, Mark Hildebrand, Mariano Tepper, Ted L. Willke","doi":"10.48550/arXiv.2304.04759","DOIUrl":"https://doi.org/10.48550/arXiv.2304.04759","url":null,"abstract":"Nowadays, data is represented by vectors. Retrieving those vectors, among millions and billions, that are similar to a given query is a ubiquitous problem, known as similarity search, of relevance for a wide range of applications. Graph-based indices are currently the best performing techniques for billion-scale similarity search. However, their random-access memory pattern presents challenges to realize their full potential. In this work, we present new techniques and systems for creating faster and smaller graph-based indices. To this end, we introduce a novel vector compression method, Locally-adaptive Vector Quantization (LVQ), that uses per-vector scaling and scalar quantization to improve search performance with fast similarity computations and a reduced effective bandwidth, while decreasing memory footprint and barely impacting accuracy. LVQ, when combined with a new high-performance computing system for graph-based similarity search, establishes the new state of the art in terms of performance and memory footprint. For billions of vectors, LVQ outcompetes the second-best alternatives: (1) in the low-memory regime, by up to 20.7x in throughput with up to a 3x memory footprint reduction, and (2) in the high-throughput regime by 5.8x with 1.4x less memory.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84224984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fine-Grained Re-Execution for Efficient Batched Commit of Distributed Transactions 为分布式事务的高效批处理提交提供细粒度重执行

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594523

Zhiyuan Dong, Zhaoguo Wang, Xiaodong Zhang, Xian Xu, Changgeng Zhao, Haibo Chen, Aurojit Panda, Jinyang Li

{"title":"Fine-Grained Re-Execution for Efficient Batched Commit of Distributed Transactions","authors":"Zhiyuan Dong, Zhaoguo Wang, Xiaodong Zhang, Xian Xu, Changgeng Zhao, Haibo Chen, Aurojit Panda, Jinyang Li","doi":"10.14778/3594512.3594523","DOIUrl":"https://doi.org/10.14778/3594512.3594523","url":null,"abstract":"Distributed transaction systems incur extensive cross-node communication to execute and commit serializable OLTP transactions. As a result, their performance greatly suffers. Caching data at nodes that execute transactions can cut down remote reads. Batching transactions for validation and persistence can amortize the communication cost during committing. However, caching and batching can significantly increase the likelihood of conflicts, causing expensive aborts.\u0000 \u0000 In this paper, we develop Hackwrench to address the challenge of caching and batching. Instead of aborting conflicted transactions, Hackwrench tries to repair them using\u0000 fine-grained re-execution\u0000 by tracking the dependencies of operations among a batch of transactions. Tracked dependencies allow Hackwrench to selectively invalidate and re-execute only those operations necessary to \"fix\" the conflict, which is cheaper than aborting and executing an entire batch of transactions. Evaluations using TPC-C and other micro-benchmarks show that Hackwrench can outperform existing commercial and research systems including FoundationDB, Calvin, COCO, and Sundial under comparable settings.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82550975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Benchmarking the Utility of w-event Differential Privacy Mechanisms - When Baselines Become Mighty Competitors 对w-事件差分隐私机制的效用进行基准测试——当基线成为强大的竞争者时

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594515

Christine Schäler, Thomas Hütter, Martin Schäler

{"title":"Benchmarking the Utility of w-event Differential Privacy Mechanisms - When Baselines Become Mighty Competitors","authors":"Christine Schäler, Thomas Hütter, Martin Schäler","doi":"10.14778/3594512.3594515","DOIUrl":"https://doi.org/10.14778/3594512.3594515","url":null,"abstract":"\u0000 The\u0000 w\u0000 -event framework is the current standard for ensuring differential privacy on continuously monitored data streams. Following the proposition of\u0000 w\u0000 -event differential privacy, various mechanisms to implement the framework are proposed. Their comparability in empirical studies is vital for both practitioners to choose a suitable mechanism, and researchers to identify current limitations and propose novel mechanisms. By conducting a literature survey, we observe that the results of existing studies are hardly comparable and partially intrinsically inconsistent.\u0000 \u0000 \u0000 To this end, we formalize an empirical study of\u0000 w\u0000 -event mechanisms by re-occurring elements found in our survey. We introduce requirements on these elements that ensure the comparability of experimental results. Moreover, we propose a benchmark that meets all requirements and establishes a new way to evaluate existing and newly proposed mechanisms. Conducting a large-scale empirical study, we gain valuable new insights into the strengths and weaknesses of existing mechanisms. An unexpected - yet explainable - result is a baseline supremacy, i.e., using one of the two baseline mechanisms is expected to deliver good or even the best utility. Finally, we provide guidelines for practitioners to select suitable mechanisms and improvement options for researchers.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78674664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Computing Graph Edit Distance via Neural Graph Matching 通过神经图匹配计算图编辑距离

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594514

Chengzhi Piao, Tingyang Xu, Xiangguo Sun, Yu Rong, Kangfei Zhao, Hongtao Cheng

{"title":"Computing Graph Edit Distance via Neural Graph Matching","authors":"Chengzhi Piao, Tingyang Xu, Xiangguo Sun, Yu Rong, Kangfei Zhao, Hongtao Cheng","doi":"10.14778/3594512.3594514","DOIUrl":"https://doi.org/10.14778/3594512.3594514","url":null,"abstract":"\u0000 Graph edit distance (GED) computation is a fundamental NP-hard problem in graph theory. Given a graph pair (\u0000 G\u0000 1\u0000 ,\u0000 G\u0000 2\u0000 ), GED is defined as the minimum number of primitive operations converting\u0000 G\u0000 1\u0000 to\u0000 G\u0000 2\u0000 . Early studies focus on search-based inexact algorithms such as A*-beam search, and greedy algorithms using bipartite matching due to its NP-hardness. They can obtain a sub-optimal solution by constructing an edit path (the sequence of operations that converts\u0000 G\u0000 1\u0000 to\u0000 G\u0000 2\u0000 ). Recent studies convert the GED between a given graph pair (\u0000 G\u0000 1\u0000 ,\u0000 G\u0000 2\u0000 ) into a similarity score in the range (0, 1) by a well designed function. Then machine learning models (mostly based on graph neural networks) are applied to predict the similarity score. They achieve a much higher numerical precision than the sub-optimal solutions found by classical algorithms. However, a major limitation is that these machine learning models cannot generate an edit path. They treat the GED computation as a pure regression task to bypass its intrinsic complexity, but ignore the essential task of converting\u0000 G\u0000 1\u0000 to\u0000 G\u0000 2\u0000 . This severely limits the interpretability and usability of the solution.\u0000 \u0000 \u0000 In this paper, we propose a novel deep learning framework that solves the GED problem in a two-step manner: 1) The proposed graph neural network GEDGNN is in charge of predicting the GED value and a matching matrix; and 2) A post-processing algorithm based on\u0000 k\u0000 -best matching is used to derive\u0000 k\u0000 possible node matchings from the matching matrix generated by GEDGNN. The best matching will finally lead to a high-quality edit path. Extensive experiments are conducted on three real graph data sets and synthetic power-law graphs to demonstrate the effectiveness of our framework. Compared to the best result of existing GNN-based models, the mean absolute error (MAE) on GED value prediction decreases by 4.9% ~ 74.3%. Compared to the state-of-the-art searching algorithm Noah, the MAE on GED value based on edit path reduces by 53.6% ~ 88.1%.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73153903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Migration-Free Just-In-Case Data Archival for Future Cloud Data Lakes 面向未来云数据湖的免迁移即时数据归档

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594522

Eugenio Marinelli, Yiqing Yan, V. Magnone, Charlotte Dumargne, P. Barbry, T. Heinis, Raja Appuswamy

引用次数: 0

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594530

John Paparrizos, Kaize Wu, Aaron J. Elmore, C. Faloutsos, M. Franklin

{"title":"Accelerating Similarity Search for Elastic Measures: A Study and New Generalization of Lower Bounding Distances","authors":"John Paparrizos, Kaize Wu, Aaron J. Elmore, C. Faloutsos, M. Franklin","doi":"10.14778/3594512.3594530","DOIUrl":"https://doi.org/10.14778/3594512.3594530","url":null,"abstract":"\u0000 Similarity search is a core analytical task, and its performance critically depends on the choice of distance measure. For time-series querying, elastic measures achieve state-of-the-art accuracy but are computationally expensive. Thus, fast lower bounding (LB) measures prune unnecessary comparisons with elastic distances to accelerate similarity search. Despite decades of attention, there has never been a study to assess the progress in this area. In addition, the research has disproportionately focused on one popular elastic measure, while other accurate measures have received little or no attention. Therefore, there is merit in developing a framework to accumulate knowledge from previously developed LBs and eliminate the notoriously challenging task of designing separate LBs for each elastic measure. In this paper, we perform the first comprehensive study of 11 LBs spanning 5 elastic measures using 128 datasets. We identify four properties that constitute the effectiveness of LBs and propose the Generalized Lower Bounding (GLB) framework to satisfy all desirable properties. GLB creates cache-friendly summaries, adaptively exploits summaries of both query and target time series, and captures boundary distances in an unsupervised manner. GLB outperforms\u0000 all\u0000 LBs in speedup (e.g., up to 13.5× faster against the strongest LB in terms of pruning power), establishes new state-of-the-art results for the 5 elastic measures, and provides the first LBs for 2 elastic measures with no known LBs. Overall, GLB enables the effective development of LBs to facilitate fast similarity search.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75057549","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Efficient framework for operating on data sketches 对数据草图进行操作的有效框架

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594526

Jakub Lemiesz

引用次数: 1

Learned Index: A Comprehensive Experimental Evaluation 学习指数:一个综合实验评价

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594528

Zhaoyan Sun, Xuanhe Zhou, Guoliang Li

{"title":"Learned Index: A Comprehensive Experimental Evaluation","authors":"Zhaoyan Sun, Xuanhe Zhou, Guoliang Li","doi":"10.14778/3594512.3594528","DOIUrl":"https://doi.org/10.14778/3594512.3594528","url":null,"abstract":"Indexes can improve query-processing performance by avoiding full table scans. Although traditional indexes (e.g., B+-tree) have been widely used, learned indexes are proposed to adopt machine learning models to reduce the query latency and index size. However, existing learned indexes are (1) not thoroughly evaluated under the same experimental framework and are (2) not comprehensively compared with different settings (e.g., key lookup, key insert, concurrent operations, bulk loading). Moreover, it is hard to select appropriate learned indexes for practitioners in different settings. To address those problems, this paper detailedly reviews existing learned indexes and discusses the design choices of key components in learned indexes, including key lookup (position inference which predicts the position of a key, and position refinement which re-searches the position if the predicted position is incorrect), key insert, concurrency, and bulk loading. Moreover, we provide a testbed to facilitate the design and test of new learned indexes for researchers. We compare state-of-the-art learned indexes in the same experimental framework, and provide findings to select suitable learned indexes under various practical scenarios.","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77554100","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Longshot: Indexing Growing Databases using MPC and Differential Privacy 远景:使用MPC和差异隐私索引增长的数据库

Proc. VLDB Endow. Pub Date : 2023-04-01 DOI: 10.14778/3594512.3594529

Yanping Zhang, Johes Bater, Kartik Nayak, Ashwin Machanavajjhala

{"title":"Longshot: Indexing Growing Databases using MPC and Differential Privacy","authors":"Yanping Zhang, Johes Bater, Kartik Nayak, Ashwin Machanavajjhala","doi":"10.14778/3594512.3594529","DOIUrl":"https://doi.org/10.14778/3594512.3594529","url":null,"abstract":"\u0000 In this work, we propose Longshot, a novel design for secure outsourced database systems that supports ad-hoc queries through the use of secure multi-party computation and differential privacy. By combining these two techniques, we build and maintain data structures (i.e., synopses, indexes, and stores) that improve query execution efficiency while maintaining strong privacy and security guarantees. As new data records are uploaded by data owners, these data structures are continually updated by Longshot using novel algorithms that leverage bounded information leakage to minimize the use of expensive cryptographic protocols. Furthermore, Long-shot organizes the data structures as a hierarchical tree based on when the update occurred, allowing for update strategies that provide logarithmic error over time. Through this approach, Longshot introduces a tunable three-way trade-off between privacy, accuracy, and efficiency. Our experimental results confirm that our optimizations are not only asymptotic improvements but also observable in practice. In particular, we see a 5x efficiency improvement to update our data structures even when the number of updates is less than 200. Moreover, the data structures significantly improve query runtimes over time, about ~10\u0000 3\u0000 x faster compared to the baseline after 20 updates.\u0000","PeriodicalId":20467,"journal":{"name":"Proc. VLDB Endow.","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85094874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0