Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming最新文献_第2页

Provably Good Randomized Strategies for Data Placement in Distributed Key-Value Stores 分布式键值存储中可证明的良好随机化数据放置策略

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577501

Zhe Wang, Jinhao Zhao, Kunal Agrawal, Heyu Liu, Meng Xu, Jing Li

{"title":"Provably Good Randomized Strategies for Data Placement in Distributed Key-Value Stores","authors":"Zhe Wang, Jinhao Zhao, Kunal Agrawal, Heyu Liu, Meng Xu, Jing Li","doi":"10.1145/3572848.3577501","DOIUrl":"https://doi.org/10.1145/3572848.3577501","url":null,"abstract":"Distributed storage systems are used widely in clouds, databases, and file systems. These systems store a large amount of data across multiple servers. When a request to access data comes in, it is routed to the appropriate server, queued, and eventually processed. If the server's queue is full, then requests may be rejected. Thus, one important challenge when designing the algorithm for allocating data to servers is the fact that the request pattern may be unbalanced, unpredictable, and may change over time. If some servers get a large fraction of the requests, they are overloaded, leading to many rejects. In this paper, we analyze this problem theoretically under adversarial assumptions. In particular, we assume that the request sequence is generated by an adversarial process to maximize the number of rejects and analyze the performance of various algorithmic strategies in terms of the fraction of the requests rejected. We show that no deterministic strategy can perform well. On the other hand, a simple randomized strategy guarantees that at most a constant fraction of requests are rejected in expectation. We also show that moving data to load balance is essential if we want to reject a very small fraction (1/m where m is the number of servers) of requests. We design a strategy with randomization and data transfer to achieve this performance with speed augmentation. Finally, we conduct experiments and show that our algorithms perform well in practice.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114369851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Direct Convolution Using Long SIMD Instructions 使用长SIMD指令的高效直接卷积

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577435

Alexandre de Limas Santana, Adrià Armejach, Marc Casas

{"title":"Efficient Direct Convolution Using Long SIMD Instructions","authors":"Alexandre de Limas Santana, Adrià Armejach, Marc Casas","doi":"10.1145/3572848.3577435","DOIUrl":"https://doi.org/10.1145/3572848.3577435","url":null,"abstract":"This paper demonstrates that state-of-the-art proposals to compute convolutions on architectures with CPUs supporting SIMD instructions deliver poor performance for long SIMD lengths due to frequent cache conflict misses. We first discuss how to adapt the state-of-the-art SIMD direct convolution to architectures using long SIMD instructions and analyze the implications of increasing the SIMD length on the algorithm formulation. Next, we propose two new algorithmic approaches: the Bounded Direct Convolution (BDC), which adapts the amount of computation exposed to mitigate cache misses, and the Multi-Block Direct Convolution (MBDC), which redefines the activation memory layout to improve the memory access pattern. We evaluate BDC, MBDC, the state-of-the-art technique, and a proprietary library on an architecture featuring CPUs with 16,384-bit SIMD registers using ResNet convolutions. Our results show that BDC and MBDC achieve respective speed-ups of 1.44× and 1.28× compared to the state-of-the-art technique for ResNet-101, and 1.83× and 1.63× compared to the proprietary library.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"145 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122887271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Scalable Hybrid Total FETI Method for Massively Parallel FEM Simulations 大规模并行有限元模拟的可扩展混合全FETI方法

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577517

Kehao Lin, Chunbao Zhou, Y. Zeng, Ningming Nie, Jue Wang, Shigang Li, Yangde Feng, Yangang Wang, Kehan Yao, Tiechui Yao, Jilin Zhang, Jian Wan

{"title":"A Scalable Hybrid Total FETI Method for Massively Parallel FEM Simulations","authors":"Kehao Lin, Chunbao Zhou, Y. Zeng, Ningming Nie, Jue Wang, Shigang Li, Yangde Feng, Yangang Wang, Kehan Yao, Tiechui Yao, Jilin Zhang, Jian Wan","doi":"10.1145/3572848.3577517","DOIUrl":"https://doi.org/10.1145/3572848.3577517","url":null,"abstract":"The Hybrid Total Finite Element Tearing and Interconnecting (HTFETI) method plays an important role in solving large-scale and complex engineering problems. This method needs to handle numerous matrix-vector multiplications. Directly calling the vendor-optimized library for general matrix-vector multiplication (gemv) on GPU leads to low performance, since it does not consider optimizations for different matrix sizes in HTFETI, i.e. different row and column sizes. In addition, state-of-the-art graph partitioning methods cannot guarantee load balancing for HTFETI, since the matrix size is determined by the length of the subdomain boundary. To solve the problems above, we first port gemv to the multi-stream pipeline scheme and develop a new batched kernel function on GPU, which brings 15%~30% throughput improvement and 37% average GFLOPs improvement, respectively. We also propose a multi-grained load-balancing scheme based on graph repartitioning and work-stealing, and the load imbalance ratio is down to 1.05~1.09 from 1.5. We have successfully applied the scalable HTFETI method to simulate the whole core assembly of China Experimental Fast Reactor (CEFR) for steady-state analysis, and the efficiencies of weak scalability and strong scalability reach 78% and 72% on 12,288 GPUs, respectively. As far as we know, this is the first time that HTFETI has been used in large-scale and high-fidelity whole core assembly simulation.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130721344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

WISE 明智的

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577506

Serif Yesil, Azin Heidarshenas, Adam Morrison, J. Torrellas

引用次数: 5

AArch64 Atomics: Might They Be Harming Your Performance? 原子:它们会损害你的性能吗?

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3579838

Ricardo Jesus, M. Weiland

{"title":"AArch64 Atomics: Might They Be Harming Your Performance?","authors":"Ricardo Jesus, M. Weiland","doi":"10.1145/3572848.3579838","DOIUrl":"https://doi.org/10.1145/3572848.3579838","url":null,"abstract":"Atomic operations are indivisible operations guaranteed to execute as a whole. One of the most important and widely used atomic operations is \"compare-and-swap\" (CAS), which allows threads to perform concurrent read-modify-write operations on the same memory location, free of data races. On recent Arm architectures, CAS operations can be implemented either directly via CAS instructions, or via load-linked/store-conditional (LL-SC) instruction pairs. In this work we explore the performance of the CAS and LL-SC approaches to implement CAS operations on recent high-performance AArch64 CPUs, namely the A64FX, ThunderX2 (TX2), and Graviton3. We observe that these instructions can lead to fundamentally different performance profiles. On A64FX, for example, the newer CAS instructions---often preferred by compilers over the older LL-SC pairs---can lead to a quadratic increase in average time per successful CAS operation as the number of threads increases, whereas the older LL-SC pairs show the expected linear increase. For high thread counts, this translates into LL-SC being more than 20x faster than CAS. On TX2 and Graviton3, LL-SC can bring more conservative (but still significant) 2--3x speedups. We characterise the conditions under which each approach delivers better performance on each CPU.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128321240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The State-of-the-Art LCRQ Concurrent Queue Algorithm Does NOT Require CAS2 最先进的LCRQ并发队列算法不需要CAS2

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577485

Raed Romanov, N. Koval

{"title":"The State-of-the-Art LCRQ Concurrent Queue Algorithm Does NOT Require CAS2","authors":"Raed Romanov, N. Koval","doi":"10.1145/3572848.3577485","DOIUrl":"https://doi.org/10.1145/3572848.3577485","url":null,"abstract":"Concurrent queues are, arguably, one of the most important data structures in high-load applications, which require them to be extremely fast and scalable. Achieving these properties is non-trivial. The early solutions, such as the classic queue by Michael and Scott, store elements in a concurrent linked list. Reputedly, this design is non-scalable and memory-inefficient. Modern solutions utilize the Fetch-and-Add instruction to improve the algorithm's scalability and store elements in arrays to reduce the memory pressure. One of the most famous and fast such algorithms is LCRQ. The main disadvantage of its design is that it relies on the atomic CAS2 instruction, which is unavailable in most modern programming languages, such as Java, Kotlin, or Go, let alone some architectures. This paper presents the LPRQ algorithm, a portable modification of the original LCRQ design that eliminates all CAS2 usages. In contrast, it performs the synchronization utilizing only the standard Compare-and-Swap and Fetch-and-Add atomic instructions. Our experiments show that LPRQ provides the same performance as the classic LCRQ algorithm, outrunning the fastest of the existing solutions that do not use CAS2 by up to 1.6×.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122378611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

TGOpt TGOpt

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577490

Yufeng Wang, Charith Mendis

{"title":"TGOpt","authors":"Yufeng Wang, Charith Mendis","doi":"10.1145/3572848.3577490","DOIUrl":"https://doi.org/10.1145/3572848.3577490","url":null,"abstract":"Temporal Graph Neural Networks are gaining popularity in modeling interactions on dynamic graphs. Among them, Temporal Graph Attention Networks (TGAT) have gained adoption in predictive tasks, such as link prediction, in a range of application domains. Most optimizations and frameworks for Graph Neural Networks (GNNs) focus on GNN models that operate on static graphs. While a few of these optimizations exploit redundant computations on static graphs, they are either not applicable to the self-attention mechanism used in TGATs or do not exploit optimization opportunities that are tied to temporal execution behavior. In this paper, we explore redundancy-aware optimization opportunities that specifically arise from computations that involve temporal components in TGAT inference. We observe considerable redundancies in temporal node embedding computations, such as recomputing previously computed neighbor embeddings and time-encoding of repeated time delta values. To exploit these redundancy opportunities, we developed TGOpt which introduces optimization techniques based on deduplication, memoization, and precomputation to accelerate the inference performance of TGAT. Our experimental results show that TGOpt achieves a geomean speedup of 4.9× on CPU and 2.9× on GPU when performing inference on a wide variety of dynamic graphs, with up to 6.3× speedup for the Reddit Posts dataset on CPU.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115641479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

CuPBoP CuPBoP

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577504

Ruobing Han, Jun Chen, Bhanu Garg, Jeffrey S. Young, Jaewoong Sim, Hyesoon Kim

引用次数: 1

iQAN: Fast and Accurate Vector Search with Efficient Intra-Query Parallelism on Multi-Core Architectures iQAN:在多核架构上快速准确的向量搜索，具有高效的查询并行性

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577527

Zhen Peng, Minjia Zhang, K. Li, R. Jin, Bin Ren

{"title":"iQAN: Fast and Accurate Vector Search with Efficient Intra-Query Parallelism on Multi-Core Architectures","authors":"Zhen Peng, Minjia Zhang, K. Li, R. Jin, Bin Ren","doi":"10.1145/3572848.3577527","DOIUrl":"https://doi.org/10.1145/3572848.3577527","url":null,"abstract":"Vector search has drawn a rapid increase of interest in the research community due to its application in novel AI applications. Maximizing its performance is essential for many tasks but remains preliminary understood. In this work, we investigate the root causes of the scalability bottleneck of using intra-query parallelism to speedup the state-of-the-art graph-based vector search systems on multi-core architectures. Our in-depth analysis reveals several scalability challenges from both system and algorithm perspectives. Based on the insights, we propose iQAN, a parallel search algorithm with a set of optimizations that boost convergence, avoid redundant computations, and mitigate synchronization overhead. Our evaluation results on a wide range of real-world datasets show that iQAN achieves up to 37.7× and 76.6× lower latency than state-of-the-art sequential baselines on datasets ranging from a million to a hundred million datasets. We also show that iQAN achieves outstanding scalability as the graph size or the accuracy target increases, allowing it to outperform the state-of-the-art baseline on two billion-scale datasets by up to 16.0× with up to 64 cores.","PeriodicalId":233744,"journal":{"name":"Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128166020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

2PLSF

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming Pub Date : 2023-02-25 DOI: 10.1145/3572848.3577433

Pedro Ramalhete, Andreia Correia, P. Felber

引用次数: 0