基于组合同步提高并发GPU B+树的性能和QoS

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI:10.1145/3572848.3577474

Weihua Zhang, Chuanlei Zhao, Lu Peng, Yuzhe Lin, Fengzhe Zhang, Yunping Lu

{"title":"基于组合同步提高并发GPU B+树的性能和QoS","authors":"Weihua Zhang, Chuanlei Zhao, Lu Peng, Yuzhe Lin, Fengzhe Zhang, Yunping Lu","doi":"10.1145/3572848.3577474","DOIUrl":null,"url":null,"abstract":"Concurrent B+trees have been widely used in many systems. With the scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate concurrent B+trees performance. When many concurrent requests are processed, the conflicts should be detected and resolved. Prior methods guarantee the correctness of concurrent GPU B+trees through lock-based or software transactional memory (STM)-based approaches. However, these methods complicate the request processing logic, increase the number of memory accesses and bring execution path divergence. They lead to performance degradation and variance in response time increasing. Moreover, previous methods do not guarantee linearizability among concurrent requests. In this paper, we design a combined-based concurrency control framework, called Eirene, for GPU B+tree to reduce the overhead of conflict detection and resolution. First, a combining-based synchronization method is designed to combine and issue requests. It combines the requests with the same key, constructs their dependence, decides the issued request, and determines their return values. Since only one request for each key is issued, key conflicts are eliminated. Then, an optimistic STM method is used to reduce structure conflicts. The query and the update requests are partitioned into different kernels. For the update kernels, STM is involved only when the number of the retry reaches a threshold. Finally, a locality-aware warp reorganization optimization is proposed to improve memory behavior and reduce conflicts by exploiting the locality among requests. Evaluations on an NVIDIA A100 GPU show that Eirene is efficient (a throughput of 2.4 billion per second) and can guarantee linearizability. Compared to the state-of-the-art GPU B+tree, it can achieve a speedup of 7.43X and reduce the response time variance from 36% to 5%.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based Synchronization\",\"authors\":\"Weihua Zhang, Chuanlei Zhao, Lu Peng, Yuzhe Lin, Fengzhe Zhang, Yunping Lu\",\"doi\":\"10.1145/3572848.3577474\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Concurrent B+trees have been widely used in many systems. With the scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate concurrent B+trees performance. When many concurrent requests are processed, the conflicts should be detected and resolved. Prior methods guarantee the correctness of concurrent GPU B+trees through lock-based or software transactional memory (STM)-based approaches. However, these methods complicate the request processing logic, increase the number of memory accesses and bring execution path divergence. They lead to performance degradation and variance in response time increasing. Moreover, previous methods do not guarantee linearizability among concurrent requests. In this paper, we design a combined-based concurrency control framework, called Eirene, for GPU B+tree to reduce the overhead of conflict detection and resolution. First, a combining-based synchronization method is designed to combine and issue requests. It combines the requests with the same key, constructs their dependence, decides the issued request, and determines their return values. Since only one request for each key is issued, key conflicts are eliminated. Then, an optimistic STM method is used to reduce structure conflicts. The query and the update requests are partitioned into different kernels. For the update kernels, STM is involved only when the number of the retry reaches a threshold. Finally, a locality-aware warp reorganization optimization is proposed to improve memory behavior and reduce conflicts by exploiting the locality among requests. Evaluations on an NVIDIA A100 GPU show that Eirene is efficient (a throughput of 2.4 billion per second) and can guarantee linearizability. Compared to the state-of-the-art GPU B+tree, it can achieve a speedup of 7.43X and reduce the response time variance from 36% to 5%.\",\"PeriodicalId\":233744,\"journal\":{\"name\":\"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming\",\"volume\":\"17 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-02-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3572848.3577474\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3572848.3577474","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

并发B+树在许多系统中得到了广泛的应用。随着数据请求规模呈指数级增长，系统面临着巨大的性能压力。GPU已经显示出其加速并行B+树性能的潜力。当处理许多并发请求时，应该检测并解决冲突。先前的方法通过基于锁或基于软件事务性内存(STM)的方法来保证并发GPU B+树的正确性。然而，这些方法使请求处理逻辑复杂化，增加了内存访问的数量，并带来了执行路径的分歧。它们会导致性能下降和响应时间变化的增加。此外，以前的方法不能保证并发请求之间的线性化。本文针对GPU B+树设计了一种基于组合的并发控制框架Eirene，以减少冲突检测和解决的开销。首先，设计了一个基于组合的同步方法来组合和发出请求。它用相同的键组合请求，构造它们的依赖关系，决定发出的请求，并确定它们的返回值。由于每个键只发出一个请求，因此消除了键冲突。然后，采用一种乐观STM方法减少结构冲突。查询和更新请求被划分到不同的内核中。对于更新内核，仅当重试次数达到阈值时才涉及STM。最后，提出了一种位置感知的warp重组优化方法，通过利用请求间的局部性来改善内存行为并减少冲突。在NVIDIA A100 GPU上的评估表明，Eirene是高效的(每秒24亿的吞吐量)，并且可以保证线性化。与最先进的GPU B+树相比，它可以实现7.43X的加速，并将响应时间方差从36%减少到5%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Boosting Performance and QoS for Concurrent GPU B+trees by Combining-Based Synchronization

Concurrent B+trees have been widely used in many systems. With the scale of data requests increasing exponentially, the systems are facing tremendous performance pressure. GPU has shown its potential to accelerate concurrent B+trees performance. When many concurrent requests are processed, the conflicts should be detected and resolved. Prior methods guarantee the correctness of concurrent GPU B+trees through lock-based or software transactional memory (STM)-based approaches. However, these methods complicate the request processing logic, increase the number of memory accesses and bring execution path divergence. They lead to performance degradation and variance in response time increasing. Moreover, previous methods do not guarantee linearizability among concurrent requests. In this paper, we design a combined-based concurrency control framework, called Eirene, for GPU B+tree to reduce the overhead of conflict detection and resolution. First, a combining-based synchronization method is designed to combine and issue requests. It combines the requests with the same key, constructs their dependence, decides the issued request, and determines their return values. Since only one request for each key is issued, key conflicts are eliminated. Then, an optimistic STM method is used to reduce structure conflicts. The query and the update requests are partitioned into different kernels. For the update kernels, STM is involved only when the number of the retry reaches a threshold. Finally, a locality-aware warp reorganization optimization is proposed to improve memory behavior and reduce conflicts by exploiting the locality among requests. Evaluations on an NVIDIA A100 GPU show that Eirene is efficient (a throughput of 2.4 billion per second) and can guarantee linearizability. Compared to the state-of-the-art GPU B+tree, it can achieve a speedup of 7.43X and reduce the response time variance from 36% to 5%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming

自引率

0.00%

发文量