{"title":"DeepCAT: A Cost-Efficient Online Configuration Auto-Tuning Approach for Big Data Frameworks","authors":"Hui Dou, Yilun Wang, Yiwen Zhang, Pengfei Chen","doi":"10.1145/3545008.3545018","DOIUrl":"https://doi.org/10.1145/3545008.3545018","url":null,"abstract":"To support different application scenarios, big data frameworks usually provide a large number of performance-related configuration parameters. Online auto-tuning these parameters based on deep reinforcement learning to achieve a better performance has shown their advantages over search-based and machine learning-based approaches. Unfortunately, the time consumption during the online tuning phase of conventional DRL-based methods is still heavy, especially for big data applications. Therefore, in this paper, we propose DeepCAT, a cost-efficient deep reinforcement learning-based approach to achieve online configuration auto-tuning for big data frameworks. To reduce the total online tuning cost: 1) DeepCAT utilizes the TD3 algorithm instead of DDPG to alleviate value overestimation; 2) DeepCAT modifies the conventional experience replay to fully utilize the rare but valuable transitions via a novel reward-driven prioritized experience replay mechanism; 3) DeepCAT designs a Twin-Q Optimizer to estimate the execution time of each action without the costly configuration evaluation and optimize the sub-optimal ones to achieve a low-cost exploration-exploitation trade off. Experimental results based on a local 3-node Spark cluster and HiBench benchmark applications show that DeepCAT is able to speed up the best execution time by a factor of 1.45 × and 1.65 × on average respectively over CDBTune and OtterTune, while consuming up to 50.08% and 53.39% less total tuning time.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132277784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EasyView: Enabling and Scheduling Tensor Views in Deep Learning Compilers","authors":"Lijuan Jiang, Ping Xu, Qianchao Zhu, Xiuhong Li, Shengen Yan, Xingcheng Zhang, Dahua Lin, Wen-Jing Ma, Zhouyang Li, Jun Liu, Jinming Ma, Minxi Jin, Chao Yang","doi":"10.1145/3545008.3545037","DOIUrl":"https://doi.org/10.1145/3545008.3545037","url":null,"abstract":"In recent years, memory-intensive operations are becoming dominant in efficiency of running novel neural networks. Just-in-time operator fusion on accelerating devices like GPU proves an effective method for optimizing memory-intensive operations, and suits the numerous varying model structures. In particular, we find memory-intensive operations on tensor views are ubiquitous in neural network implementations. Tensors are the de facto representation for numerical data in deep learning areas, while tensor views cover a bunch of sophisticated syntax, which allow various interpretations on the underlying tensor data without memory copy. The support of views in deep learning compilers could greatly enlarge operator fusion scope, and appeal to optimizing novel neural networks. Nevertheless, mainstream solutions in state-of-the-art deep learning compilers exhibit imperfections either in view syntax representations or operator fusion. In this article, we propose EasyView, which enables and schedules tensor views in an end-to-end workflow from neural networks onto devices. Aiming at maximizing memory utilization and reducing data movement, we categorize various view contexts in high-level language, and lower views in accordance with different scenarios. Reference-semantic in terms of views are kept in the lowering from native high-level language features to intermediate representations. Based on the reserved reference-semantics, memory activities related to data dependence of read and write are tracked for further compute and memory optimization. Besides, ample operator fusion is applied to memory-intensive operations with views. In our tests, the proposed work could get average 5.63X, 2.44X, and 4.67X speedup compared with the XLA, JAX, and TorchScript, respectively for hotspot Python functions. In addition, operation fusion with views could bring 8.02% performance improvement in end-to-end neural networks.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128622295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Minjin Tang, Mei Wen, Yasong Cao, Junzhong Shen, Jianchao Yang, Jiawei Fei, Yang Guo, Sheng Liu
{"title":"Mentha: Enabling Sparse-Packing Computation on Systolic Arrays","authors":"Minjin Tang, Mei Wen, Yasong Cao, Junzhong Shen, Jianchao Yang, Jiawei Fei, Yang Guo, Sheng Liu","doi":"10.1145/3545008.3545053","DOIUrl":"https://doi.org/10.1145/3545008.3545053","url":null,"abstract":"Generalized Sparse Matrix-Matrix Multiplication (SpGEMM) is a critical kernel in domains like graph analytic and scientific computation. As a kind of classical special-purpose architecture, systolic arrays were first used for complex computing problems, e.g., matrix multiplication. However, classical systolic arrays are not efficient enough when handling sparse matrices due to the fact that the PEs containing zero-valued entries perform unnecessary operations that do not contribute to the result. Accordingly, in this paper, we propose Mentha, a framework that enables systolic arrays to accelerate sparse matrix computation by employing a sparse-packing algorithm suitable for various dataflow of systolic array. Firstly, Mentha supports both online and offline methods. By packing the rows or columns of the sparse matrix, the zero-valued items in the matrix are significantly reduced and the density of the matrix is improved. In addition, acceleration benefits can be obtained by the adaptation scheme even with limited resources. Moreover, we reconfigure PEs in systolic arrays at a low cost (1.28x in area, 1.21x in power) and find that our method outperforms TPU-like systolic arrays by 1.2~3.3x in terms of SpMM and 1.3~4.4x in terms of SpGEMM when dealing with moderately sparse matrices (sparsity < 0.9), while its performance is at least 9.7x better than cuSPARSE. Furthermore, experimental results show a FLOPs reduction of roughly 3.4x in the neural network.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124396823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Frank Wanye, Vitaliy Gleyzer, E. Kao, Wu-chun Feng
{"title":"On the Parallelization of MCMC for Community Detection","authors":"Frank Wanye, Vitaliy Gleyzer, E. Kao, Wu-chun Feng","doi":"10.1145/3545008.3545058","DOIUrl":"https://doi.org/10.1145/3545008.3545058","url":null,"abstract":"The rapid growth in size of real-world graph datasets necessitates the design of parallel and scalable graph analytics algorithms for large graphs. Community detection is a graph analysis technique with use cases in many domains from bioinformatics to network security. Markov chain Monte Carlo (MCMC)-based methods for performing community detection, such as the stochastic block partitioning (SBP) algorithm, are robust to graphs with a complex structure, but have traditionally been difficult to parallelize due to the serial nature of the underlying MCMC algorithm. This paper presents hybrid SBP (H-SBP), a novel hybrid method to parallelize the inherently sequential computation within each MCMC chain, for SBP. H-SBP processes a fraction of the most influential graph vertices serially and the remaining majority of the vertices in parallel using asynchronous Gibbs. We empirically show that H-SBP speeds up the MCMC computations by up to 5.6 × on real-world graphs while maintaining accuracy.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126510969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changdae Kim, Kwangwon Koh, Taehoon Kim, Daegyu Han, Jiwon Seo
{"title":"BWA-MEM-SCALE: Accelerating Genome Sequence Mapping on Commodity Servers","authors":"Changdae Kim, Kwangwon Koh, Taehoon Kim, Daegyu Han, Jiwon Seo","doi":"10.1145/3545008.3545033","DOIUrl":"https://doi.org/10.1145/3545008.3545033","url":null,"abstract":"As advances in Next-Generation Sequencing have made genome sequence data generation faster and cheaper, the acceleration of genome sequence mapping to the reference genome becomes an increasingly important problem. Much effort has been made to improve the performance of the sequence mapping process. In this paper, we propose BWA-MEM-SCALE which offers software-based acceleration techniques that fully utilize system resources to speed up genome sequence mapping. BWA-MEM-SCALE has two optimization mechanisms that exploit the system memory resource; Exact Match Filter (EMF) finds the input reads that match in full-length to the reference genome so that the expensive mapping process is bypassed for those reads. FM-index Accelerator (FMA) skips the prefix of sequences in seed matching with pre-assembled data. Moreover, we fully utilize the CPU cores in the system by carefully pipelining the mapping process and using in-memory index store. We implement the proposed mechanisms on BWA-MEM2 which is the state-of-the-art sequence mapping software. The evaluation shows that BWA-MEM-SCALE achieves substantial speedup compared to BWA-MEM2 when the system has a sufficient amount of resources. For example, with additional 104GB of memory, BWA-MEM-SCALE gives up to 3.32X speedup over BWA-MEM2. Because we support partially deploying the acceleration techniques, BWA-MEM-SCALE speeds up the mapping performance in proportion to the available system resource. Source-code: https://github.com/etri/bwa-mem-scale","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121512274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning Mean-Field Control for Delayed Information Load Balancing in Large Queuing Systems","authors":"Anam Tahir, Kai Cui, H. Koeppl","doi":"10.1145/3545008.3545025","DOIUrl":"https://doi.org/10.1145/3545008.3545025","url":null,"abstract":"Recent years have seen a great increase in the capacity and parallel processing power of data centers and cloud services. To fully utilize the said distributed systems, optimal load balancing for parallel queuing architectures must be realized. Existing state-of-the-art solutions fail to consider the effect of communication delays on the behaviour of very large systems with many clients. In this work, we consider a multi-agent load balancing system, with delayed information, consisting of many clients (load balancers) and many parallel queues. In order to obtain a tractable solution, we model this system as a mean-field control problem with enlarged state-action space in discrete time through exact discretization. Subsequently, we apply policy gradient reinforcement learning algorithms to find an optimal load balancing solution. Here, the discrete-time system model incorporates a synchronization delay under which the queue state information is synchronously broadcasted and updated at all clients. We then provide theoretical performance guarantees for our methodology in large systems. Finally, using experiments, we prove that our approach is not only scalable but also shows good performance when compared to the state-of-the-art power-of-d variant of the Join-the-Shortest-Queue (JSQ) and other policies in the presence of synchronization delays.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"41 30","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133783718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nang Hung Nguyen, Phi-Le Nguyen, D. Nguyen, Trung Thanh Nguyen, Thuy-Dung Nguyen, H. Pham, Truong Thao Nguyen
{"title":"FedDRL: Deep Reinforcement Learning-based Adaptive Aggregation for Non-IID Data in Federated Learning","authors":"Nang Hung Nguyen, Phi-Le Nguyen, D. Nguyen, Trung Thanh Nguyen, Thuy-Dung Nguyen, H. Pham, Truong Thao Nguyen","doi":"10.1145/3545008.3545085","DOIUrl":"https://doi.org/10.1145/3545008.3545085","url":null,"abstract":"The uneven distribution of local data across different edge devices (clients) results in slow model training and accuracy reduction in federated learning. Naive federated learning (FL) strategy and most alternative solutions attempted to achieve more fairness by weighted aggregating deep learning models across clients. This work introduces a novel non-IID type encountered in real-world datasets, namely cluster-skew, in which groups of clients have local data with similar distributions, causing the global model to converge to an over-fitted solution. To deal with non-IID data, particularly the cluster-skewed data, we propose FedDRL, a novel FL model that employs deep reinforcement learning to adaptively determine each client’s impact factor (which will be used as the weights in the aggregation process). Extensive experiments on a suite of federated datasets confirm that the proposed FedDRL improves favorably against FedAvg and FedProx methods, e.g., up to 4.05% and 2.17% on average for the CIFAR-100 dataset, respectively.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131225922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giulia Guidi, Gabriel Raulet, D. Rokhsar, L. Oliker, K. Yelick, A. Buluç
{"title":"Distributed-Memory Parallel Contig Generation for De Novo Long-Read Genome Assembly","authors":"Giulia Guidi, Gabriel Raulet, D. Rokhsar, L. Oliker, K. Yelick, A. Buluç","doi":"10.1145/3545008.3545050","DOIUrl":"https://doi.org/10.1145/3545008.3545050","url":null,"abstract":"De novo genome assembly, i.e., rebuilding the sequence of an unknown genome from redundant and erroneous short sequences, is a key but computationally intensive step in many genomics pipelines. The exponential growth of genomic data is increasing the computational demand and requires scalable, high-performance approaches. In this work, we present a novel distributed memory algorithm that, from a string graph representation of the genome and using sparse matrices, generates the contig set, i.e., overlapping sequences that form a map representing a region of a chromosome. Using matrix abstraction, we mask branches in the string graph, and compute the connected component to group genomic sequences that belong to the same linear chain (i.e., contig). Then, we perform multiway number partitioning to minimize the load imbalance in local assembly, i.e., concatenation of sequences from a given contig. Based on the assignment obtained by partitioning, we compute the induce subgraph function to redistribute sequences between processes, resulting in a set of local sparse matrices. Finally, we traverse each matrix using depth-first search to concatenate sequences. Our algorithm shows good scaling with parallel efficiency up to 80% on 128 nodes, resulting in uniform genome coverage and showing promising results in terms of assembly quality. Our contig generation algorithm localizes the assembly process to significantly reduce the amount of computation spent on this step. Our work is a step forward for efficient de novo long read assembly of large genomes in a distributed memory.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117007494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FLOPs as a Discriminant for Dense Linear Algebra Algorithms","authors":"F. L'opez, L. Karlsson, P. Bientinesi","doi":"10.1145/3545008.3545072","DOIUrl":"https://doi.org/10.1145/3545008.3545072","url":null,"abstract":"Expressions that involve matrices and vectors, known as linear algebra expressions, are commonly evaluated through a sequence of invocations to highly optimised kernels provided in libraries such as BLAS and LAPACK. A sequence of kernels represents an algorithm, and in general, because of associativity, algebraic identities, and multiple kernels, one expression can be evaluated via many different algorithms. These algorithms are all mathematically equivalent (i.e., in exact arithmetic, they all compute the same result), but often differ noticeably in terms of execution time. When faced with a decision, high-level languages, libraries, and tools such as Julia, Armadillo, and Linnea choose by selecting the algorithm that minimises the FLOP count. In this paper, we test the validity of the FLOP count as a discriminant for dense linear algebra algorithms, analysing ”anomalies”: problem instances for which the fastest algorithm does not perform the least number of FLOPs. To do so, we focused on relatively simple expressions and analysed when and why anomalies occurred. We found that anomalies exist and tend to cluster into large contiguous regions. For one expression anomalies were rare, whereas for the other they were abundant. We conclude that FLOPs is not a sufficiently dependable discriminant even when building algorithms with highly optimised kernels. Plus, most of the anomalies remained as such even after filtering out the inter-kernel cache effects. We conjecture that combining FLOP counts with kernel performance models will significantly improve our ability to choose optimal algorithms.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"100 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132371670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A single-tree algorithm to compute the Euclidean minimum spanning tree on GPUs","authors":"A. Prokopenko, Piyush Sao, D. Lebrun-Grandié","doi":"10.1145/3545008.3546185","DOIUrl":"https://doi.org/10.1145/3545008.3546185","url":null,"abstract":"Computing the Euclidean minimum spanning tree (Emst) is a computationally demanding step of many algorithms. While work-efficient serial and multithreaded algorithms for computing Emst are known, designing an efficient GPU algorithm is challenging due to a complex branching structure, data dependencies, and load imbalances. In this paper, we propose a single-tree Borůvka-based algorithm for computing Emst on GPUs. We use an efficient nearest neighbor algorithm and reduce the number of the required distance calculations by avoiding traversing subtrees with leaf nodes in the same component. The developed algorithms are implemented in a performance portable way using ArborX, an open-source geometric search library based on the Kokkos framework. We evaluate the proposed algorithm on various 2D and 3D datasets, show and compare it with the current state-of-the-art open-source CPU implementations. We demonstrate 4-24 × speedup over the fastest multi-threaded implementation. We prove the portability of our implementation by providing results on a variety of hardware: AMD EPYC 7763, Nvidia A100 and AMD MI250X. We show scalability of the implementation, computing Emst for 37 million 3D cosmological dataset in under a 0.5 second on a single A100 Nvidia GPU.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122551484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}