{"title":"SymSig: A low latency interconnection topology for HPC clusters","authors":"Dhananjay Brahme, Onkar Bhardwaj, V. Chaudhary","doi":"10.1109/HiPC.2013.6799144","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799144","url":null,"abstract":"This paper presents the underlying theory and the performance of a cluster using a new 2-hop network topology. This topology is constructed using a symmetric equation and Singer Difference Sets and is called SymSig. The degree of connections at each node with SymSig is about half compared to previous methods using Singer Difference Sets. A comparison with a cluster of Clos topology shows significant advantages. The worst case congestion in SymSig topology for unicast permutation is 2, where as in Clos it is proportional to the radix of the building block switches used. The number of switches required is smaller by about 25%, the size of the cluster is larger by about 15% and the worst bandwidth is better by about 50% for SymSig. These advantages are retained for peta and exascale systems. Its performance on a set of collectives like exchange-all, shift-all, broadcast-all and all-to-all send/receive shows improvements ranging from 39% to 83%. Its performance on a molecular dynamics application GROMMACS shows improvement of upto 33%. This network is particularly suitable for applications that require global all to all communications. The low latency of this network makes it scaleable and an attractive alternative for building peta and exascale systems.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130502534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can GPUs sort strings efficiently?","authors":"A. Deshpande, P J Narayanan","doi":"10.1109/HiPC.2013.6799129","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799129","url":null,"abstract":"String sorting or variable-length key sorting has lagged in performance on the GPU even as the fixed-length key sorting has improved dramatically. Radix sorting is the fastest on the GPUs. In this paper, we present a fast and efficient string sort on the GPU that is built on the available radix sort. Our method sorts strings from left to right in steps, moving only indexes and small prefixes for efficiency. We reduce the number of sort steps by adaptively consuming maximum string bytes based on the number of segments in each step. Performance is improved by using Thrust primitives for most steps and by removing singleton segments from consideration. Over 70% of the string sort time is spent on Thrust primitives. This provides high performance along with high adaptability to future GPUs. We achieve speed of up to 10 over current GPU methods, especially on large datasets. We also scale to much larger input sizes. We present results on easy and difficult strings defined using their after-sort tie lengths.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129967868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olivier Beaumont, Philippe Duchon, Paul Renaud-Goud
{"title":"Approximation algorithms for energy minimization in Cloud service allocation under reliability constraints","authors":"Olivier Beaumont, Philippe Duchon, Paul Renaud-Goud","doi":"10.1109/HiPC.2013.6799123","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799123","url":null,"abstract":"We consider allocation problems that arise in the context of service allocation in Clouds. More specifically, we assume on the one part that each computing resource is associated with a capacity, that can be chosen using the Dynamic Voltage and Frequency Scaling (DVFS) method, and with a probability of failure. On the other hand, we assume that the services run as a set of independent instances of identical Virtual Machines (VMs). Moreover, there exists a Service Level Agreement (SLA) between the Cloud provider and the client that can be expressed as follows: the client comes with a minimal number of service instances that must be alive at anytime, and the Cloud provider offers a list of pairs (price, compensation), the compensation having to be paid by the Cloud provider if it fails to keep alive the required number of services. On the Cloud provider side, each pair actually corresponds to a guaranteed reliability of fulfilling the constraint on the minimal number of instances. In this context, given a minimal number of instances and a probability of success, the question for the Cloud provider is to find the number of necessary resources, their clock frequency and an allocation of the instances (possibly using replication) onto machines. This solution should satisfy all types of constraints (both capacity and reliability constraints). Moreover, it should remain valid during a time period (with a given reliability in presence of failures) while minimizing the energy consumption of used resources. We assume in this paper that this time period, that typically takes place between two redistributions, is fixed and known in advance. We prove deterministic approximation ratios on the consumed energy for algorithms that provide guaranteed reliability and we provide an extensive set of simulations that prove that homogeneous solutions are close to optimal.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"PP 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126535072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Sreepathi, Vamsi Sripathiy, R. Mills, Glenn Hammondz, G. Mahinthakumar
{"title":"SCORPIO: A scalable two-phase parallel I/O library with application to a large scale subsurface simulator","authors":"S. Sreepathi, Vamsi Sripathiy, R. Mills, Glenn Hammondz, G. Mahinthakumar","doi":"10.1145/2148600.2148635","DOIUrl":"https://doi.org/10.1145/2148600.2148635","url":null,"abstract":"Inefficient parallel I/O is known to be a major bottleneck among scientific applications employed on supercomputers as the number of processor cores grows into the thousands. Our prior experience indicated that parallel I/O libraries such as HDF5 that rely on MPI-IO do not scale well beyond 10K processor cores, especially on parallel file systems (like Lustre) with single point of resource contention. Our previous optimization efforts for a massively parallel multi-phase and multi-component subsurface simulator (PFLOTRAN) led to a two-phase I/O approach at the application level where a set of designated processes participate in the I/O process by splitting the I/O operation into a communication phase and a disk I/O phase. The designated I/O processes are created by splitting the MPI global communicator into multiple sub-communicators. The root process in each sub-communicator is responsible for performing the I/O operations for the entire group and then distributing the data to rest of the group. This approach resulted in over 25X speedup in HDF I/O read performance and 3X speedup in write performance for PFLOTRAN at over 100K processor cores on the ORNL Jaguar supercomputer. This research describes the design and development of a general purpose parallel I/O library called Scorpio that incorporates our optimized two-phase I/O approach. The library provides a simplified higher level abstraction to the user, sitting atop existing parallel I/O libraries (such as HDF5) and implements optimized I/O access patterns that can scale on larger number of processors. Performance results with standard benchmark problems and PFLOTRAN indicate that our library is able to maintain the same speedups as before with the added flexibility of being applicable to a wider range of I/O intensive applications.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116672891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Basu, Anand Venkat, Mary W. Hall, Samuel Williams, B. V. Straalen, L. Oliker
{"title":"Compiler generation and autotuning of communication-avoiding operators for geometric multigrid","authors":"P. Basu, Anand Venkat, Mary W. Hall, Samuel Williams, B. V. Straalen, L. Oliker","doi":"10.1109/HiPC.2013.6799131","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799131","url":null,"abstract":"This paper describes a compiler approach to introducing communication-avoiding optimizations in geometric multigrid (GMG), one of the most popular methods for solving partial differential equations. Communication-avoiding optimizations reduce vertical communication through the memory hierarchy and horizontal communication across processes or threads, usually at the expense of introducing redundant computation. We focus on applying these optimizations to the smooth operator, which successively reduces the error and accounts for the largest fraction of the GMG execution time. Our compiler technology applies both novel and known transformations to derive an implementation comparable to manually-tuned code. To make the approach portable, an underlying autotuning system explores the tradeoff between reduced communication and increased computation, as well as tradeoffs in threading schemes, to automatically identify the best implementation for a particular architecture and at each computation phase. Results show that we are able to quadruple the performance of the smooth operation on the finest grids while attaining performance within 94% of manually-tuned code. Overall we improve the overall multigrid solve time by 2.5× without sacrificing programer productivity.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122623720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akhil Langer, Ramprasad Venkataraman, U. Palekar, L. Kalé
{"title":"Parallel branch-and-bound for two-stage stochastic integer optimization","authors":"Akhil Langer, Ramprasad Venkataraman, U. Palekar, L. Kalé","doi":"10.1109/HiPC.2013.6799130","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799130","url":null,"abstract":"Many real-world planning problems require searching for an optimal solution in the face of uncertain input. One approach to is to express them as a two-stage stochastic optimization problem where the search for an optimum in one stage is informed by the evaluation of multiple possible scenarios in the other stage. If integer solutions are required, then branch-and-bound techniques are the accepted norm. However, there has been little prior work in parallelizing and scaling branch-and-bound algorithms for stochastic optimization problems. In this paper, we explore the parallelization of a two-stage stochastic integer program solved using branch-and-bound. We present a range of factors that influence the parallel design for such problems. Unlike typical, iterative scientific applications, we encounter several interesting characteristics that make it challenging to realize a scalable design. We present two design variations that navigate some of these challenges. Our designs seek to increase the exposed parallelism while delegating sequential linear program solves to existing libraries. We evaluate the scalability of our designs using sample aircraft allocation problems for the US airfleet. It is important that these problems be solved quickly while evaluating large number of scenarios. Our attempts result in strong scaling to hundreds of cores for these datasets. We believe similar results are not common in literature, and that our experiences will feed usefully into further research on this topic.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131886327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}