G. Abram, Vignesh Adhinarayanan, W. Feng, D. Rogers, J. Ahrens
{"title":"ETH: An Architecture for Exploring the Design Space of In-situ Scientific Visualization","authors":"G. Abram, Vignesh Adhinarayanan, W. Feng, D. Rogers, J. Ahrens","doi":"10.1109/IPDPS47924.2020.00060","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00060","url":null,"abstract":"As high-performance computing (HPC) moves towards the exascale era, large-scale scientific simulations are generating enormous datasets. Many techniques (e.g., in-situ methods, data sampling, and compression) have been proposed to help visualize these large datasets under various constraints such as storage, power, and energy. However, evaluating these techniques and understanding the trade-offs (e.g., performance, efficiency, and quality) remains a challenging task.To enable exploration of the design space across such trade-offs, we propose the Exploration Test Harness (ETH), an architecture for the early-stage exploration of visualization and rendering approaches, job layout, and visualization pipelines. ETH covers a broader parameter space than current large-scale visualization applications such as ParaView and VisIt. It also promotes the study of simulation-visualization coupling strategies through a data-centric approach, rather than requiring coupling with a specific scientific simulation code. Furthermore, with experimentation on an extensively instrumented supercomputer, we study more metrics of interest than was previously possible. Importantly, ETH will help to answer important what-if scenarios and trade-off questions in the early stages of pipeline development, helping scientists to make informed choices about how to best couple a simulation code with visualization at extreme scale.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"515-526"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76329359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DAG-Aware Joint Task Scheduling and Cache Management in Spark Clusters","authors":"Yinggen Xu, Liu Liu, Zhijun Ding","doi":"10.1109/IPDPS47924.2020.00047","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00047","url":null,"abstract":"Data dependency, often presented as directed acyclic graph (DAG), is a crucial application semantics for the performance of data analytic platforms such as Spark. Spark comes with two built-in schedulers, namely FIFO and Fair scheduler, which do not take advantage of data dependency structures. Recently proposed DAG-aware task scheduling approaches, notably GRAPHENE, have achieved significant performance improvements but paid little attention to cache management. The resulted data access patterns interact poorly with the built-in LRU caching, leading to significant cache misses and performance degradation. On the other hand, DAG-aware caching schemes, such as Most Reference Distance (MRD), are designed for FIFO scheduler instead of DAG-aware task schedulers.In this paper, we propose and develop a middleware Dagon, which leverages the complexity and heterogeneity of DAGs to jointly execute task scheduling and cache management. Dagon relies on three key mechanisms: DAG-aware task assignment that considers dependency structure and heterogeneous resource demands to reduce potential resource fragmentation, sensitivity-aware delay scheduling that prevents executors from long waiting for tasks insensitive to locality, and priority-aware caching that makes the cache eviction and prefetching decisions based on the stage priority determined by DAG-aware task assignment. We have implemented Dagon in Apache Spark. Evaluation on a testbed shows that Dagon improves the job completion time by up to 42% and CPU utilization by up to 46% respectively, compared to GRAPHENE plus MRD.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 1","pages":"378-387"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74819910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhihui Du, Xinning Hui, Yurui Wang, Jun Jiang, Jason Liu, Baokun Lu, Chongyu Wang
{"title":"Inter-Job Scheduling of High-Throughput Material Screening Applications","authors":"Zhihui Du, Xinning Hui, Yurui Wang, Jun Jiang, Jason Liu, Baokun Lu, Chongyu Wang","doi":"10.1109/IPDPS47924.2020.00091","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00091","url":null,"abstract":"Material screening entails a large number of electronic structure simulations. Traditionally, these simulation runs are treated separately as solving independent Kohn-Sham (KS) equations. In this paper, we formulate material screening as an inter-job scheduling problem for solving a system of KS equations, and in doing so allowing one to explore different scheduling methods that use the results of some equations to expedite the solution of others. We propose the concept of sharing iterative simulation and employ several optimization methods to initialize a simulation run using the distribution of particles from similar jobs as the initial condition. More specifically, we propose two similarity metrics, one qualitative and the other quantitative, to predict the simulation runtime of a material screen job based on its similarity to other jobs. Accordingly, we present two inter-job scheduling algorithms that make use the qualitative and quantitative similarity information. We conducted extensive experiments on the Sunway TaihuLight supercomputer for a practical material screening problem to evaluate the performance of the two scheduling algorithms using the proposed similarity metrics. We show that the total time required to run the large number of material screening jobs can be significantly reduced, and the algorithms are robust even with moderate inaccurate prediction on the simulation runtime. The quantitative algorithm achieves better results than the qualitative algorithm using more accurate prediction and thus achieving more significant runtime reduction.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"841-852"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79321243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Georganas, K. Banerjee, Dhiraj D. Kalamkar, Sasikanth Avancha, Anand Venkat, Michael J. Anderson, G. Henry, Hans Pabst, A. Heinecke
{"title":"Harnessing Deep Learning via a Single Building Block","authors":"E. Georganas, K. Banerjee, Dhiraj D. Kalamkar, Sasikanth Avancha, Anand Venkat, Michael J. Anderson, G. Henry, Hans Pabst, A. Heinecke","doi":"10.1109/IPDPS47924.2020.00032","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00032","url":null,"abstract":"Deep learning (DL) is one of the most prominent branches of machine learning. Due to the immense computational cost of DL workloads, industry and academia have developed DL libraries with highly-specialized kernels for each workload/architecture, leading to numerous, complex code-bases that strive for performance, yet they are hard to maintain and do not generalize. In this work, we introduce the batch-reduce GEMM kernel and show how the most popular DL algorithms can be formulated with this kernel as the basic building-block. Consequently, the DL library-development degenerates to mere (potentially automatic) tuning of loops around this sole optimized kernel. By exploiting our new kernel we implement Recurrent Neural Networks, Convolution Neural Networks and Multilayer Perceptron training and inference primitives in just 3K lines of high-level code. Our primitives outperform vendor-optimized libraries on multi-node CPU clusters, and we also provide proof-of-concept CNN kernels targeting GPUs. Finally, we demonstrate that the batch-reduce GEMM kernel within a tensor compiler yields high-performance CNN primitives, further amplifying the viability of our approach.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"258 1","pages":"222-233"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74664023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Ritter, A. Calotoiu, S. Rinke, Thorsten Reimann, T. Hoefler, F. Wolf
{"title":"Learning Cost-Effective Sampling Strategies for Empirical Performance Modeling","authors":"M. Ritter, A. Calotoiu, S. Rinke, Thorsten Reimann, T. Hoefler, F. Wolf","doi":"10.1109/IPDPS47924.2020.00095","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00095","url":null,"abstract":"Identifying scalability bottlenecks in parallel applications is a vital but also laborious and expensive task. Empirical performance models have proven to be helpful to find such limitations, though they require a set of experiments in order to gain valuable insights. Therefore, the experiment design determines the quality and cost of the models. Extra-P is an empirical modeling tool that uses small-scale experiments to assess the scalability of applications. Its current version requires an exponential number of experiments per model parameter. This makes the creation of empirical performance models very expensive, and in some situations even impractical. In this paper, we propose a novel parameter-value selection heuristic, which functions as a guideline for the experiment design, leveraging sparse performance-modeling, a technique that only needs a polynomial number of experiments per model parameter. Using synthetic analysis and data from three different case studies, we show that our solution reduces the average modeling costs by about 85% while retaining 92% of the model accuracy.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"12 1","pages":"884-895"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76520550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gustavo Chavez, Yang Liu, P. Ghysels, X. Li, E. Rebrova
{"title":"Scalable and Memory-Efficient Kernel Ridge Regression","authors":"Gustavo Chavez, Yang Liu, P. Ghysels, X. Li, E. Rebrova","doi":"10.1109/IPDPS47924.2020.00102","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00102","url":null,"abstract":"We present a scalable and memory-efficient framework for kernel ridge regression. We exploit the inherent rank deficiency of the kernel ridge regression matrix by constructing an approximation that relies on a hierarchy of low-rank factorizations of tunable accuracy, rather than leverage scores or other subsampling techniques. Without ever decompressing the kernel matrix approximation, we propose factorization and solve methods to compute the weight(s) for a given set of training and test data. We show that our method performs an optimal number of operations $mathcal{O}left( {{r^2}n} right)$ with respect to the number of training samples (n) due to the underlying numerical low-rank (r) structure of the kernel matrix. Furthermore, each algorithm is also presented in the context of a massively parallel computer system, exploiting two levels of concurrency that take into account both shared-memory and distributed-memory inter-node parallelism. In addition, we present a variety of experiments using popular datasets – small, and large – to show that our approach provides sufficient accuracy in comparison with state-of-the-art methods and with the exact (i.e. non-approximated) kernel ridge regression method. For datasets, in the order of 106 data points, we show that our framework strong-scales to 103 cores. Finally, we provide a Python interface to the scikit-learn library so that scikit-learn can leverage our high-performance solver library to achieve much-improved performance and memory footprint.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"39 1","pages":"956-965"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86265671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating Thread Coarsening and Low-cost Synchronization on Intel Xeon Phi","authors":"Hancheng Wu, M. Becchi","doi":"10.1109/ipdps47924.2020.00108","DOIUrl":"https://doi.org/10.1109/ipdps47924.2020.00108","url":null,"abstract":"Manycore processors such as GPUs and Intel Xeon Phis have become popular due to their massive parallelism and high power-efficiency. To achieve optimal performance, it is necessary to optimize the use of the compute cores and of the memory system available on these devices. Previous work has proposed techniques to improve the use of the GPU resources. While Intel Phi can provide massive parallelism through their x86 cores and vector units, optimization techniques for these platforms have received less consideration.In this work, we study the benefits of thread coarsening and low-cost synchronization on applications running on Intel Xeon Phi processors and encoded in SIMT fashion. Specifically, we explore thread coarsening as a way to remap the work to the available cores and vector lanes. In addition, we propose low- overhead synchronization primitives, such as atomic operations and barriers, which transparently apply to threads mapped to the same or different VPUs and x86 cores. Finally, we consider the combined use of thread coarsening and our proposed synchronization primitives. We evaluate the effect of these techniques on the performance of two kinds of kernels: collaborative and non-collaborative ones, the former using scratchpad memory to explicitly control data sharing among threads. Our evaluation leads to the following results. First, while not always beneficial for non-collaborative kernels, thread coarsening improves the performance of collaborative kernels consistently by reducing the synchronization overhead. Second, our synchronization primitives outperform standard pthread APIs by a factor up to 8x in real-world benchmarks. Last, the combined use of the proposed techniques leads to performance improvements, especially for collaborative kernels.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"1018-1029"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87690238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Holtryd, M. Manivannan, P. Stenström, M. Pericàs
{"title":"DELTA: Distributed Locality-Aware Cache Partitioning for Tile-based Chip Multiprocessors","authors":"N. Holtryd, M. Manivannan, P. Stenström, M. Pericàs","doi":"10.1109/IPDPS47924.2020.00066","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00066","url":null,"abstract":"Cache partitioning in tile-based CMP architectures is a challenging problem because of i) the need to determine capacity allocations with low computational overhead and ii) the need to place allocations close to where they are used, in order to reduce access latency. Although, previous solutions have addressed the problem of reducing the computational overhead and incorporating locality-awareness, they suffer from the overheads of centrally determining allocations.In this paper, we propose DELTA, a novel distributed and locality-aware cache partitioning solution which works by exchanging asynchronous challenges among cores. The distributed nature of the algorithm coupled with the low computational complexity allows for frequent reconfigurations at negligible cost and for the scheme to be implemented directly in hardware. The allocation algorithm is supported by an enforcement mechanism which enables locality-aware placement of data. We evaluate DELTA on 16- and 64-core tiled CMPs with multi-programmed workloads. Our evaluation shows that DELTA improves performance by 9% and 16%, respectively, on average, compared to an unpartitioned shared last-level cache.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"578-589"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78777320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vishwesh Jatala, Roshan Dathathri, G. Gill, Loc Hoang, V. K. Nandivada, K. Pingali
{"title":"A Study of Graph Analytics for Massive Datasets on Distributed Multi-GPUs","authors":"Vishwesh Jatala, Roshan Dathathri, G. Gill, Loc Hoang, V. K. Nandivada, K. Pingali","doi":"10.1109/IPDPS47924.2020.00019","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00019","url":null,"abstract":"There are relatively few studies of distributed GPU graph analytics systems in the literature and they are limited in scope since they deal with small data-sets, consider only a few applications, and do not consider the interplay between partitioning policies and optimizations for computation and communication.In this paper, we present the first detailed analysis of graph analytics applications for massive real-world datasets on a distributed multi-GPU platform and the first analysis of strong scaling of smaller real-world datasets. We use D-IrGL, the state-of-the-art distributed GPU graph analytical framework, in our study. Our evaluation shows that (1) the Cartesian vertex-cut partitioning policy is critical to scale computation out on GPUs even at a small scale, (2) static load imbalance is a key factor in performance since memory is limited on GPUs, (3) device-host communication is a significant portion of execution time and should be optimized to gain performance, and (4) asynchronous execution is not always better than bulk-synchronous execution.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"32 1","pages":"84-94"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83356740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Experience-Driven Computational Resource Allocation of Federated Learning by Deep Reinforcement Learning","authors":"Yufeng Zhan, Peng Li, Song Guo","doi":"10.1109/IPDPS47924.2020.00033","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00033","url":null,"abstract":"Federated learning is promising in enabling large-scale machine learning by massive mobile devices without exposing the raw data of users with strong privacy concerns. Existing work of federated learning struggles for accelerating the learning process, but ignores the energy efficiency that is critical for resource-constrained mobile devices. In this paper, we propose to improve the energy efficiency of federated learning by lowering CPU-cycle frequency of mobile devices who are faster in the training group. Since all the devices are synchronized by iterations, the federated learning speed is preserved as long as they complete the training before the slowest device in each iteration. Based on this idea, we formulate an optimization problem aiming to minimize the total system cost that is defined as a weighted sum of training time and energy consumption. Due to the hardness of nonlinear constraints and unawareness of network quality, we design an experience-driven algorithm based on the Deep Reinforcement Learning (DRL), which can converge to the near-optimal solution without knowledge of network quality. Experiments on a small-scale testbed and large-scale simulations are conducted to evaluate our proposed algorithm. The results show that it outperforms the start-of-the-art by 40% at most.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"234-243"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89406127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}