{"title":"MICCO: An Enhanced Multi-GPU Scheduling Framework for Many-Body Correlation Functions","authors":"Qihan Wang, Bin Ren, Jing Chen, R. Edwards","doi":"10.1109/ipdps53621.2022.00022","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00022","url":null,"abstract":"Calculation of many-body correlation functions is one of the critical kernels utilized in many scientific computing areas, especially in Lattice Quantum Chromodynamics (Lattice QCD). It is formalized as a sum of a large number of contraction terms each of which can be represented by a graph consisting of vertices describing quarks inside a hadron node and edges designating quark propagations at specific time intervals. Due to its computation- and memory-intensive nature, real-world physics systems (e.g., multi-meson or multi-baryon systems) explored by Lattice QCD prefer to leverage multi-GPUs. Different from general graph processing, many-body correlation function calculations show two specific features: a large number of computation-/data-intensive kernels and frequently repeated appearances of original and intermediate data. The former results in expensive memory operations such as tensor movements and evictions. The latter offers data reuse opportunities to mitigate the data-intensive nature of many-body correlation function calculations. However, existing graph-based multi-GPU schedulers cannot capture these data-centric features, thus resulting in a sub-optimal performance for many-body correlation function calculations. To address this issue, this paper presents a multi-GPU scheduling framework, MICCO, to accelerate contractions for correlation functions particularly by taking the data dimension (e.g., data reuse and data eviction) into account. This work first performs a comprehensive study on the interplay of data reuse and load balance, and designs two new concepts: local reuse pattern and reuse bound to study the opportunity of achieving the optimal trade-off between them. Based on this study, MICCO proposes a heuristic scheduling algorithm and a machine-learning-based regression model to generate the optimal setting of reuse bounds. Specifically, MICCO is integrated into a real-world Lattice QCD system, Redstar, for the first time running on multiple GPUs. The evaluation demonstrates MICCO outperforms other state-of-art works, achieving up to 2.25× speedup in synthesized datasets, and 1.49× speedup in real-world correlation functions.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121739382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingxiao Sun, Yi Liu, Hailong Yang, Zhonghui Jiang, Zhongzhi Luan, D. Qian
{"title":"StencilMART: Predicting Optimization Selection for Stencil Computations across GPUs","authors":"Qingxiao Sun, Yi Liu, Hailong Yang, Zhonghui Jiang, Zhongzhi Luan, D. Qian","doi":"10.1109/ipdps53621.2022.00090","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00090","url":null,"abstract":"Stencil computations are widely used in high performance computing (HPC) applications. Many HPC platforms utilize the high computation capability of GPUs to accelerate stencil computations. In recent years, stencils have become more diverse in terms of stencil order, memory accesses and computation patterns. To adapt diverse stencils to GPUs, a variety of optimization techniques have been proposed such as streaming and retiming. However, due to the diversity of stencil patterns and GPU architectures, no single optimization technique fits all stencils. Besides, it is challenging to choose the most cost-efficient GPU for accelerating target stencils. To address the above problems, we propose StencilMART, an automatic optimization selection framework that predicts the best optimization combination and execution time under a certain parameter setting for stencils on GPUs. Specifically, the StencilMART represents the stencil patterns as binary tensors and neighboring features through tensor assignment and feature extraction. In addition, the StencilMART implements various machine learning methods such as classification and regression that utilize stencil representation and hardware characteristics for execution time prediction. The experiment results show that the StencilMART can achieve accurate optimization selection and performance prediction for various stencils across GPUs.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129593810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Multi-Versioning Ordered Key-Value Stores with Persistent Memory Support","authors":"Bogdan Nicolae","doi":"10.1109/ipdps53621.2022.00018","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00018","url":null,"abstract":"Ordered key-value stores (or sorted maps/dictionaries) are a fundamental building block in a large variety of both sequential and parallel/distributed algorithms. However, most state-of-art approaches are either based on ephemeral in-memory representations that are difficult to persist and/or not scalable enough under concurrent access (e.g., red-black trees, skip lists), and/or not lightweight enough (e.g. database engines). Furthermore, there is an increasing need to provide versioning support, which is needed in a variety of scenarios: introspection, provenance tracking, revisiting previous intermediate results. To address these challenges, we propose a new lightweight dictionary data structure that simultaneously provides support for multi-versioning, persistency and scalability under concurrent access. We demonstrate its effectiveness through a series of experiments, in which it outperforms several state-of-art approaches, both in terms of vertical and horizontal scalability.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124585953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jing Wang, Chao Li, Tao Wang, Lu Zhang, Pengyu Wang, Jun-Hua Mei, M. Guo
{"title":"Excavating the Potential of Graph Workload on RDMA-based Far Memory Architecture","authors":"Jing Wang, Chao Li, Tao Wang, Lu Zhang, Pengyu Wang, Jun-Hua Mei, M. Guo","doi":"10.1109/ipdps53621.2022.00104","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00104","url":null,"abstract":"Disaggregated architecture brings new opportunities to memory -consuming applications like graph processing. It allows one to outspread memory access pressure from local to far memory, providing an attractive alternative to disk-based processing. Although existing works on general-purpose far mem-ory platforms show great potentials for application expansion, it is unclear how graph processing applications could benefit from disaggregated architecture, and how different optimization methods influence the overall performance. In this paper, we take the first step to analyze the impact of graph processing workload on disaggregated architecture by extending the GridGraph framework on top of the RDMA-based far memory system. We design Fargraph, a far memory coordi-nation strategy for enhancing graph processing workload. Specif-ically, Fargraph reduces the overall data movement through a well-crafted, graph-aware data segment offloading mechanism. In addition, we use optimal data segment splitting and asynchronous data buffering to achieve graph iteration-friendly far memory access. We show that Fargraph achieves near-oracle performance for typical in-local-memory graph processing systems. Fargraph shows up to 8.3 x speedup compared to Fastswap, the state-of-the-art, general-purpose far memory platform.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121159836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HTS: A Threaded Multilevel Sparse Hybrid Solver","authors":"J. Booth","doi":"10.1109/ipdps53621.2022.00010","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00010","url":null,"abstract":"Large shared-memory many-core nodes have become the norm in scientific computing, and therefore the sparse linear solver stack must adapt to the multilevel structure that exists on these nodes. One adaption is the development of hybrid-solvers at the node level. We present HTS as a hybrid threaded solver that aims to provide a finer-grain algorithm to keep an increased number of threads actively working on these larger shared-memory environments without the overheads of message passing implementations. Additionally, HTS aims at utilizing the additional shared memory that may be available to improve performance, i.e., reducing iteration counts when used as a preconditioner and speeding up calculations. HTS is built around the Schur complement framework that many other hybrid solver packages already use. However, HTS uses a multilevel structure in dealing with the Schur complement and allows for fill-in in certain off-diagonal submatrices to allow for a faster and more accurate solve phase. These modifications allow for a tasking thread library, namely Cilk, to be used to speed up performance while still reducing peak memory by more than 20% on average compared to an optimized direct factorization method. We show that HTS can outperform the MPI-based hybrid solver ShyLU on a suite of sparse matrices by as much as 2×, and show that HTS can scale well on three-dimensional finite difference problems.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127987381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accuracy vs. Cost in Parallel Fixed-Precision Low-Rank Approximations of Sparse Matrices","authors":"Robert Ernstbrunner, Viktoria Mayer, W. Gansterer","doi":"10.1109/ipdps53621.2022.00051","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00051","url":null,"abstract":"We study a randomized and a deterministic algorithm for the fixed-precision low-rank approximation problem of large sparse matrices. The Randomized QB Factorization (RandQB_EI) constructs a reduced and dense representation of the originally sparse matrix based on randomization. The representation resulting from the deterministic Truncated LU Factorization with Column and Row Tournament Pivoting (LU_CRTP) is sparse, but fill-in introduced in the factorization process can affect sparsity and performance. We therefore attempt to mitigate fill-in with an incomplete LU_CRTP variant with thresholding (ILUT_CRTP). We analyze this approach and identify potential problems that may arise in practice. We design parallel implementations of RandQB_EI, LU_CRTP and ILUT_CRTP. We experimentally evaluate strong scaling properties for different problems and the runtime required for achieving a given approximation quality. Our results show that LU_CRTP tends to be particularly competitive for low approximation quality. However, when a lot of fill-in occurs, LU_CRTP is outperformed by RandQB_EI especially for higher approximation quality. ILUT_CRTP outperforms both LU_CRTP and RandQB_EI and can achieve speedups up to 40 over LU_CRTP, depending on the amount of fill-in.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"20 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114030881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olamide Timothy Tawose, Bin Li, Lei Yang, Feng Yan, Dongfang Zhao
{"title":"Topological Modeling and Parallelization of Multidimensional Data on Microelectrode Arrays","authors":"Olamide Timothy Tawose, Bin Li, Lei Yang, Feng Yan, Dongfang Zhao","doi":"10.1109/ipdps53621.2022.00082","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00082","url":null,"abstract":"Microelectrode arrays (MEAs) are physical devices widely used in various science and engineering fields. One common computational challenge when applying a high-density MEA (i.e., a larger number of wires, more accurate locations of abnormal cells) is how to efficiently compute those resistance values provided the nonlinearity of the system of equations with the unknown resistance values per the Kirchhoff law. This paper proposes an algebraic-topological model for MEAs such that we can identify the intrinsic parallelism that cannot be identified by conventional approaches. We implement a system prototype called Parma based on the proposed topological methodology. Experimental results show that Parma outperforms the state-of-the-practice in time, scalability and memory usage: the computation time is two orders of magnitude faster on up to 1,024 cores with almost linear scalability and the memory is much better utilized with proportionally less warm-up time with respect to the number of concurrent threads.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124075467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amro Alabsi Aljundi, Taha Atahan Akyildiz, K. Kaya
{"title":"Degree-Aware Kernels for Computing Jaccard Weights on GPUs","authors":"Amro Alabsi Aljundi, Taha Atahan Akyildiz, K. Kaya","doi":"10.1109/ipdps53621.2022.00092","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00092","url":null,"abstract":"Graphs provide the ability to extract valuable met-rics from the structural properties of the underlying data they represent. One such metric is the Jaccard Weight of an edge, which is the ratio of the number of common neighbors of the edge's endpoints to the union of the endpoints' neighborhood. A naive implementation of Jaccard Weights computation has a complexity that scales with the number of edges in the graph times the square of the maximum degree. Recently, GPU-based parallel algorithms have been proposed for this problem. How-ever, these algorithms cannot overcome the structural variance within a graph, i.e., the sparsity pattern and degree imbalance, which directly translates to unbalanced work distribution across threads. In this work, we propose an optimized GPU-based algorithm with an ML-based work distribution model that mitigates the unbalanced work distribution. Our algorithm is shown to be up to 35x and on average 12x faster than the state of the art in practice while using less memory. In fact, we show that by manually tweaking the load distribution, a state-of-the-art implementation can be 5x faster. In addition, we propose a multi-core, shared-memory algorithm that applies a traditional but effective technique to improve the computation asymptotically and perform comparably to the GPU algorithms. Our code is available at https://github.com/SU-HPC/Jaccard-ML.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128909532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bounding the Flow Time in Online Scheduling with Structured Processing Sets","authors":"Louis-Claude Canon, Anthony Dugois, L. Marchal","doi":"10.1109/ipdps53621.2022.00072","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00072","url":null,"abstract":"Replication in distributed key-value stores makes scheduling more challenging, as it introduces processing set restrictions, which limits the number of machines that can process a given task. We focus on the online minimization of the maximum response time in such systems, that is, we aim at bounding the latency of each task. When processing sets have no structure, Anand et al. (Algorithmica, 2017) derive a strong lower bound on the competitiveness of the problem: no online scheduling algorithm can have a competitive ratio smaller than $Omega(m)$, where $m$ is the number of machines. In practice, data replication schemes are regular, and structured processing sets may make the problem easier to solve. We derive new lower bounds for various common structures, including inclusive, nested or interval structures. In particular, we consider fixed sized intervals of machines, which mimic the standard replication strategy of key-value stores. We prove that EFT (Earliest Finish Time) scheduling is ($3-2/k$)-competitive when optimizing max-flow on disjoint intervals of size $k$. However, we show that the competitive ratio of EFT is at least $m-k+1$ when these intervals overlap, even when unit tasks are considered. We compare these two replication strategies in simulations and assess their efficiency when popularity biases are introduced, i.e., when some machines are accessed more frequently than others because they hold popular data. Even though overlapping intervals suffer from a bad worst-case in theory, they enable clusters to reach a maximum load that is up to 50% higher than with disjoint sets.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128855594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}