{"title":"Adapting Batch Scheduling to Workload Characteristics: What Can We Expect From Online Learning?","authors":"Arnaud Legrand, D. Trystram, Salah Zrigui","doi":"10.1109/IPDPS.2019.00077","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00077","url":null,"abstract":"Despite the impressive growth and size of super-computers, the computational power they provide still cannot match the demand. Efficient and fair resource allocation is a critical task. Super-computers use Resource and Job Management Systems to schedule applications, which is generally done by relying on generic index policies such as First Come First Served and Shortest Processing time First in combination with Backfilling strategies. Unfortunately, such generic policies often fail to exploit specific characteristics of real workloads. In this work, we focus on improving the performance of online schedulers. We study mixed policies, which are created by combining multiple job characteristics in a weighted linear expression, as opposed to classical pure policies which use only a single characteristic. This larger class of scheduling policies aims at providing more flexibility and adaptability. We use space coverage and black-box optimization techniques to explore this new space of mixed policies and we study how can they adapt to the changes in the workload. We perform an extensive experimental campaign through which we show that (1) even the best pure policy is far from optimal and that (2) using a carefully tuned mixed policy would allow to significantly improve the performance of the system. (3) We also provide empirical evidence that there is no one size fits all policy, by showing that the rapid workload evolution seems to prevent classical online learning algorithms from being effective.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131104731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modelling DVFS and UFS for Region-Based Energy Aware Tuning of HPC Applications","authors":"Mohak Chadha, M. Gerndt","doi":"10.1109/IPDPS.2019.00089","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00089","url":null,"abstract":"Energy efficiency and energy conservation are one of the most crucial constraints for meeting the 20MW power envelope desired for exascale systems. Towards this, most of the research in this area has been focused on the utilization of user-controllable hardware switches such as per-core dynamic voltage frequency scaling (DVFS) and software controlled clock modulation at the application level. In this paper, we present a tuning plugin for the Periscope Tuning Framework which integrates fine-grained autotuning at the region level with DVFS and uncore frequency scaling (UFS). The tuning is based on a feed-forward neural network which is formulated using Performance Monitoring Counters (PMC) supported by x86 systems and trained using standardized benchmarks. Experiments on five standardized hybrid benchmarks show an energy improvement of 16.1% on average when the applications are tuned according to our methodology as compared to 7.8% for static tuning.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123478502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junqing Gu, Chentao Wu, Xin Xie, Han Qiu, Jie Li, M. Guo, Xubin He, Yuanyuan Dong, Yafei Zhao
{"title":"Optimizing the Parity Check Matrix for Efficient Decoding of RS-Based Cloud Storage Systems","authors":"Junqing Gu, Chentao Wu, Xin Xie, Han Qiu, Jie Li, M. Guo, Xubin He, Yuanyuan Dong, Yafei Zhao","doi":"10.1109/IPDPS.2019.00063","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00063","url":null,"abstract":"In large scale distributed systems such as cloud storage systems, erasure coding is a fundamental technique to provide high reliability at low monetary cost. Compared with the traditional disk arrays, cloud storage systems use an erasure coding scheme with both flexible fault tolerance and high scalability. Thus, Reed-Solomon (RS) Codes or RS-based codes are popular choices for cloud storage systems. However, the decoding performance for RS-based codes is not as good as XOR-based codes, which are optimized via investigating the relationships among different parity chains or reducing the computational complexity of matrix multiplications. Therefore, exploring an efficient decoding method is highly desired. To address the above problem, in this paper, we propose an Advanced Parity-Check Matrix (APCM) based approach, which is extended from the original Parity-Check Matrix based (PCM) approach. Instead of improving the decoding performance of XOR-based codes in PCM, APCM focuses on optimizing the decoding efficiency for RS-based codes. Furthermore, APCM avoids the matrix inversion computations and reduces the computational complexity of the decoding process. To demonstrate the effectiveness of the APCM, we conduct intensive experiments by using both RS-based and XOR-based codes under cloud storage environment. The results show that, compared to typical decoding methods, APCM improves the decoding speed by up to 32.31% in the Alibaba cloud storage system.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120902982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GraphTinker: A High Performance Data Structure for Dynamic Graph Processing","authors":"Wole Jaiyeoba, K. Skadron","doi":"10.1109/IPDPS.2019.00110","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00110","url":null,"abstract":"Interest in high performance analytics for dynamic (constantly evolving) graphs has been on the rise in the last decade, especially due to the prevalence and rapid growth of social networks today. The current state-of-the art data structures for dynamic graph processing rely on the adjacency list model of edgeblocks in updating graphs. This model suffers from long probe distances when following edges, leading to poor update throughputs. Furthermore, both current graph processing models—the static model that requires reprocessing the entire graph after every batch update, and the incremental model in which only the affected subset of edges need to be processed—suffer drawbacks. In this paper, we present GraphTinker, a new, more scalable graph data structure for dynamic graphs. It uses a new hashing scheme to reduce probe distance and improve edge-update performance. It also better compacts edge data. These innovations improve performance for graph updates as well as graph analytics. In addition, we present a hybrid engine which improves the performance of dynamic graph processing by automatically selecting the most optimal execution model (static vs. incremental) for every iteration, surpassing the performance of both. Our evaluations of GraphTinker shows a throughput improvement of up to 3.3X compared to the state-of-the-art data structure (STINGER) when used for graph updates. GraphTinker also demonstrates a performance improvement of up to 10X over STINGER when used to run graph analytics algorithms. In addition, our hybrid engine demonstrates up to 2X improvement over the incremental-compute model and up to 3X improvement over the static model.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126762861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computation of Matrix Chain Products on Parallel Machines","authors":"Elad Weiss, O. Schwartz","doi":"10.1109/IPDPS.2019.00059","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00059","url":null,"abstract":"The Matrix Chain Ordering Problem is a well studied optimization problem, aiming at finding optimal parentheses assignment for minimizing the number of arithmetic operations required when computing a chain of matrix multiplications. Existing algorithms include the O(N^3) dynamic programming of Godbole (1973) and the faster O(NlogN) algorithm of Hu and Shing (1982). We show that both may result in sub-optimal parentheses assignment for modern machines as they do not take into account inter-processor communication costs that often dominate the running time. Further, the optimal solution may change when using fast matrix multiplication algorithms. We adapt the O(N^3) dynamic programming algorithm to provide optimal solutions for modern machines and modern matrix multiplication algorithms, and obtain an adaption of the O(NlogN) algorithm that guarantees a constant approximation.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126318901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IPDPS 2019 Technical Program","authors":"Vinod E. F. Rebello, Lawrence Rauchwerger","doi":"10.1109/ipdps.2019.00008","DOIUrl":"https://doi.org/10.1109/ipdps.2019.00008","url":null,"abstract":": In 2001, as early high-speed networks were deployed, George Gilder observed that “when the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.” Two decades later, our networks are 1,000 times faster, our appliances are increasingly specialized, and our computer systems are indeed disintegrating. As hardware acceleration overcomes speed-of-light delays, time and space merge into a computing continuum. Familiar questions like “where should I compute,” “for what workloads should I design computers,” and \"where should I place my computers” seem to allow for a myriad of new answers that are exhilarating but also daunting. Are there concepts that can help guide us as we design applications and computer systems in a world that is untethered from familiar landmarks like center, cloud, edge? I propose some ideas and report on experiments in coding the continuum. Abstract: Parallel computers have come of age and need parallel software to justify their usefulness. There are two major avenues to get programs to run in parallel: parallelizing compilers and parallel languages and/or libraries. In this talk we present our latest results using both approaches and draw some conclusions about their relative effectiveness and potential. In the first part we introduce the Hybrid Analysis (HA) compiler framework that can seamlessly integrate static and run-time analysis of memory references into a single framework capable of full automatic loop level parallelization. Experimental results on 26 benchmarks show full program speedups superior to those obtained by the Intel Fortran compilers. In the second part of this talk we present the Standard Template Adaptive Parallel Library (STAPL) based approach to parallelizing code. STAPL is a collection of generic data structures and algorithms that provides a high productivity, parallel programming infrastructure analogous to the C++ Standard Template Library (STL). In this talk, we provide an overview of the major STAPL components with particular emphasis on graph algorithms. We then present scalability results of real codes using peta scale machines such as IBM BG/Q and Cray. Finally we present some of our ideas for future work in this area. Abstract: The trends in hardware architecture are paving the road towards Exascale. However, these trends are also increasing the complexity of design and development of the software developer environment that is deployed on modern supercomputers. Moreover, the scale and complexity of high-end systems creates a new set of challenges for application developers. Computational scientists are facing system characteristics that will significantly impact the programmability and scalability of applications. In order to address these issues, software architects need to take a holistic view of the entire system and deliver a high-level programming environment that can help maximize programmability, while not l","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123541068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Constantino Gómez, Francesc Martínez, Adrià Armejach, Miquel Moretó, F. Mantovani, Marc Casas
{"title":"Design Space Exploration of Next-Generation HPC Machines","authors":"Constantino Gómez, Francesc Martínez, Adrià Armejach, Miquel Moretó, F. Mantovani, Marc Casas","doi":"10.1109/IPDPS.2019.00017","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00017","url":null,"abstract":"The landscape of High Performance Computing (HPC) system architectures keeps expanding with new technologies and increased complexity. With the goal of improving the efficiency of next-generation large HPC systems, designers require tools for analyzing and predicting the impact of new architectural features on the performance of complex scientific applications at scale. We simulate five hybrid (MPI+OpenMP) applications over 864 architectural proposals based on state-of-the-art and emerging HPC technologies, relevant both in industry and research. This paper significantly extends our previous work with MUltiscale Simulation Approach (MUSA) enabling accurate performance and power estimations of large-scale HPC systems. We reveal that several applications present critical scalability issues mostly due to the software parallelization approach. Looking at speedup and energy consumption exploring the design space (i.e., changing memory bandwidth, number of cores, and type of cores), we provide evidence-based architectural recommendations that will serve as hardware and software co-design guidelines.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130266150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DYRS: Bandwidth-Aware Disk-to-Memory Migration of Cold Data in Big-Data File Systems","authors":"Simbarashe Dzinamarira, Florin Dinu, T. Ng","doi":"10.1109/IPDPS.2019.00069","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00069","url":null,"abstract":"Migrating data into memory can significantly accelerate big-data applications by hiding low disk throughput. While prior work has mostly targeted caching frequently used data, the techniques employed do not benefit jobs that read cold data. For these jobs, the file system has to pro-actively migrate the inputs into memory. Successfully migrating cold inputs can result in a large speedup for many jobs, especially those that spend a significant part of their execution reading inputs. In this paper, we use data from the Google cluster trace to make the case that the conditions in production workloads are favorable for migration. We then design and implement DYRS, a framework for migrating cold data in big-data file systems. DYRS can adapt to match the available bandwidth on storage nodes, ensuring all nodes are fully utilized throughout the migration. In addition to balancing the load, DYRS optimizes the placement of each migration to maximize the number of successful migrations and eliminate stragglers at the end of a job. We evaluate DYRS using several Hive queries, a trace-based workload from Facebook, and the Sort application. Our results show that DYRS successfully adapts to bandwidth heterogeneity and effectively migrates data. DYRS accelerates Hive queries by up to 48%, and by 36% on average. Jobs in a trace-based workload experience a speedup of 33% on average. The mapper tasks in this workload have an even greater speedup of 46%. DYRS accelerates sort jobs by up to 20%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115784112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Berenbrink, Tom Friedetzky, Dominik Kaaser, Peter Kling
{"title":"Tight & Simple Load Balancing","authors":"P. Berenbrink, Tom Friedetzky, Dominik Kaaser, Peter Kling","doi":"10.1109/IPDPS.2019.00080","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00080","url":null,"abstract":"We consider the following load balancing process for m tokens distributed arbitrarily among n nodes connected by a complete graph. In each time step a pair of nodes is selected uniformly at random. Let ℓ_1 and ℓ_2 be their respective number of tokens. The two nodes exchange tokens such that they have ⌈(ℓ_1 + ℓ_2)/2⌉ and ⌈(ℓ_1 + ℓ_2)/2⌉ tokens, respectively. We provide a simple analysis showing that this process reaches almost perfect balance within O(n log n + n log Δ) steps with high probability, where Δ is the maximal initial load difference between any two nodes. This bound is asymptotically tight.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126936332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}