Nathan R. Tallent, J. Mellor-Crummey, L. Adhianto, M. Fagan, Mark W. Krentel
{"title":"Diagnosing performance bottlenecks in emerging petascale applications","authors":"Nathan R. Tallent, J. Mellor-Crummey, L. Adhianto, M. Fagan, Mark W. Krentel","doi":"10.1145/1654059.1654111","DOIUrl":"https://doi.org/10.1145/1654059.1654111","url":null,"abstract":"Cutting-edge science and engineering applications require petascale computing. It is, however, a significant challenge to use petascale computing platforms effectively. Consequently, there is a critical need for performance tools that enable scientists to understand impediments to performance on emerging petascale systems. In this paper, we describe HPCToolkit-a suite of multi-platform tools that supports sampling-based analysis of application performance on emerging petascale platforms. HPCToolkit uses sampling to pinpoint and quantify both scaling and node performance bottlenecks. We study several emerging petascale applications on the Cray XT and IBM BlueGene/P platforms and use HPCToolkit to identify specific source lines - in their full calling context - associated with performance bottlenecks in these codes. Such information is exactly what application developers need to know to improve their applications to take full advantage of the power of petascale systems.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124846249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient band approximation of Gram matrices for large scale kernel methods on GPUs","authors":"Mohamed E. Hussein, W. Abd-Almageed","doi":"10.1145/1654059.1654091","DOIUrl":"https://doi.org/10.1145/1654059.1654091","url":null,"abstract":"Kernel-based methods require O(N2) time and space complexities to compute and store non-sparse Gram matrices, which is prohibitively expensive for large scale problems. We introduce a novel method to approximate a Gram matrix with a band matrix. Our method relies on the locality preserving properties of space filling curves, and the special structure of Gram matrices. Our approach has several important merits. First, it computes only those elements of the Gram matrix that lie within the projected band. Second, it is simple to parallelize. Third, using the special band matrix structure makes it space efficient and GPU-friendly. We developed GPU implementations for the Affinity Propagation (AP) clustering algorithm using both our method and the COO sparse representation. Our band approximation is about 5 times more space efficient and faster to construct than COO. AP gains up to 6x speedup using our method without any degradation in its clustering performance.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128882969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ananthanarayanan, Steven K. Esser, H. Simon, D. Modha
{"title":"The cat is out of the bag: cortical simulations with 109 neurons, 1013 synapses","authors":"R. Ananthanarayanan, Steven K. Esser, H. Simon, D. Modha","doi":"10.1145/1654059.1654124","DOIUrl":"https://doi.org/10.1145/1654059.1654124","url":null,"abstract":"In the quest for cognitive computing, we have built a massively parallel cortical simulator, C2, that incorporates a number of innovations in computation, memory, and communication. Using C2 on LLNL's Dawn Blue Gene/P supercomputer with 147, 456 CPUs and 144 TB of main memory, we report two cortical simulations -- at unprecedented scale -- that effectively saturate the entire memory capacity and refresh it at least every simulated second. The first simulation consists of 1.6 billion neurons and 8.87 trillion synapses with experimentally-measured gray matter thalamocortical connectivity. The second simulation has 900 million neurons and 9 trillion synapses with probabilistic connectivity. We demonstrate nearly perfect weak scaling and attractive strong scaling. The simulations, which incorporate phenomenological spiking neurons, individual learning synapses, axonal delays, and dynamic synaptic channels, exceed the scale of the cat cortex, marking the dawn of a new era in the scale of cortical simulations.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116645582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Indexing genomic sequences on the IBM Blue Gene","authors":"A. Ghoting, K. Makarychev","doi":"10.1145/1654059.1654122","DOIUrl":"https://doi.org/10.1145/1654059.1654122","url":null,"abstract":"With advances in sequencing technology and through aggressive sequencing efforts, DNA sequence data sets have been growing at a rapid pace. To gain from these advances, it is important to provide life science researchers with the ability to process and query large sequence data sets. For the past three decades, the suffix tree has served as a fundamental data structure in processing sequential data sets. However, tree construction times on large data sets have been excessive. While parallel suffix tree construction is an obvious solution to reduce execution times, poor locality of reference has limited parallel performance. In this paper, we show that through careful parallel algorithm design, this limitation can be removed, allowing tree construction to scale to massively parallel systems like the IBM Blue Gene. We demonstrate that the entire Human genome can be indexed on 1024 processors in under 15 minutes.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125669667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal real number codes for fault tolerant matrix operations","authors":"Zizhong Chen","doi":"10.1145/1654059.1654089","DOIUrl":"https://doi.org/10.1145/1654059.1654089","url":null,"abstract":"It has been demonstrated recently that single fail-stop process failure in ScaLAPACK matrix multiplication can be tolerated without checkpointing. Multiple simultaneous processor failures can be tolerated without checkpointing by encoding matrices using a real-number erasure correcting code. However, the floating-point representation of a real number in today's high performance computer architecture introduces round off errors which can be enlarged and cause the loss of precision of possibly all effective digits during recovery when the number of processors in the system is large. In this paper, we present a class of Reed-Solomon style real-number erasure correcting codes which have optimal numerical stability during recovery. We analytically construct the numerically best erasure correcting codes for 2 erasures and develop an approximation method to computationally construct numerically good codes for 3 or more erasures. Experimental results demonstrate that the proposed codes are numerically much more stable than existing codes.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122700098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FALCON: a system for reliable checkpoint recovery in shared grid environments","authors":"T. Islam, S. Bagchi, R. Eigenmann","doi":"10.1145/1654059.1654110","DOIUrl":"https://doi.org/10.1145/1654059.1654110","url":null,"abstract":"In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such ”failures”. Today's FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a system called FALCON that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model failures of storage hosts and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with FALCON in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133902478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Saini, Andrey Naraikin, R. Biswas, D. Barkai, T. Sandstrom
{"title":"Early performance evaluation of a \"Nehalem\" cluster using scientific and engineering applications","authors":"S. Saini, Andrey Naraikin, R. Biswas, D. Barkai, T. Sandstrom","doi":"10.1145/1654059.1654084","DOIUrl":"https://doi.org/10.1145/1654059.1654084","url":null,"abstract":"In this paper, we present an early performance evaluation of a 624-core cluster based on the Intel<sup>®</sup> Xeon<sup>®</sup> Processor 5560 (code named \"Nehalem-EP\", and referred to as Xeon 5560 in this paper)---the third-generation quad-core architecture from Intel. This is the first processor from Intel with a non-uniform memory access (NUMA) architecture managed by on-chip integrated memory controller. It employs a point-to-point interconnect called the Intel<sup>®</sup> QuickPath Interconnect (QPI) between processors and to the input/output (I/O) hub. It also introduces to a quad-core architecture both Intel's hyper-threading technology (or simultaneous multi-threading, \"SMT\") and Intel<sup>®</sup> Turbo Boost Technology (\"Turbo mode\") that automatically allow processor cores to run faster than the base operating frequency if the processor is operating below rated power, temperature, and current specification limits. It can be engaged with any number of cores or logical processors enabled and active. We critically evaluate these features using the High Performance Computing Challenge (HPCC) benchmarks, NAS Parallel Benchmarks (NPB), and four full-scale scientific applications. We compare and contrast the results of a cluster based on the Xeon 5560 with an SGI<sup>®</sup> Altix<sup>®</sup> ICE 8200EX cluster of quad-core Intel<sup>®</sup> Xeon<sup>®</sup> 5472 Processor (\"Xeon 5472\" from here on) and another cluster of Intel<sup>®</sup> Xeon<sup>®</sup> 5462 Processor (\"Xeon 5462\"; the Xeon 5400 Series Processors are previous generation quad-core Intel processors and were code named Harpertown).","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128972122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Eisenbach, C.-G. Zhou, D. Nicholson, G. Brown, J. Larkin, T. Schulthess
{"title":"A scalable method for ab initio computation of free energies in nanoscale systems","authors":"M. Eisenbach, C.-G. Zhou, D. Nicholson, G. Brown, J. Larkin, T. Schulthess","doi":"10.1145/1654059.1654125","DOIUrl":"https://doi.org/10.1145/1654059.1654125","url":null,"abstract":"Calculating the thermodynamics of nanoscale systems presents challenges in the simultaneous treatment of the electronic structure, which determines the interactions between atoms, and the statistical fluctuations that become ever more important at shorter length scales. Here we present a highly scalable method that combines ab initio electronic structure techniques, we use the Locally Self-Consitent Multiple Scattering (LSMS) technique, with the Wang-Landau (WL) algorithm to compute free energies and other thermodynamic properties of nanoscale systems. The combined WL-LSMS code is targeted to the study of nanomagnetic systems that have anywhere from about one hundred to a few thousand atoms. The code scales very well on the Cray XT5 system at ORNL, sustaining 1.03 Petaflop/s in double precision on 147,464 cores.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117199639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geoffrey Belter, E. Jessup, I. Karlin, Jeremy G. Siek
{"title":"Automating the generation of composed linear algebra kernels","authors":"Geoffrey Belter, E. Jessup, I. Karlin, Jeremy G. Siek","doi":"10.1145/1654059.1654119","DOIUrl":"https://doi.org/10.1145/1654059.1654119","url":null,"abstract":"Memory bandwidth limits the performance of important kernels in many scientific applications. Such applications often use sequences of Basic Linear Algebra Subprograms (BLAS), and highly efficient implementations of those routines enable scientists to achieve high performance at little cost. However, tuning the BLAS in isolation misses opportunities for memory optimization that result from composing multiple subprograms. Because it is not practical to create a library of all BLAS combinations, we have developed a domain-specific compiler that generates them on demand. In this paper, we describe a novel algorithm for compiling linear algebra kernels and searching for the best combination of optimization choices. We also present a new hybrid analytic/empirical method for quickly evaluating the profitability of each optimization. We report experimental results showing speedups of up to 130% relative to the GotoBLAS on an AMD Opteron and up to 137% relative to MKL on an Intel Core 2.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"57 4 Suppl 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133281830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Kambadur, Anshul Gupta, A. Ghoting, H. Avron, A. Lumsdaine
{"title":"PFunc: modern task parallelism for modern high performance computing","authors":"P. Kambadur, Anshul Gupta, A. Ghoting, H. Avron, A. Lumsdaine","doi":"10.1145/1654059.1654103","DOIUrl":"https://doi.org/10.1145/1654059.1654103","url":null,"abstract":"HPC today faces new challenges due to paradigm shifts in both hardware and software. The ubiquity of multi-cores, many-cores, and GPGPUs is forcing traditional serial as well as distributed-memory parallel applications to be parallelized for these architectures. Emerging applications in areas such as informatics are placing unique requirements on parallel programming tools that have not yet been addressed. Although, of all the available parallel programming models, task parallelism appears to be the most promising in meeting these new challenges, current solutions for task parallelism are inadequate. In this paper, we introduce PFunc, a new library for task parallelism that extends the feature set of current solutions for task parallelism with custom task scheduling, task priorities, task affinities, multiple completion notifications and task groups. These features enable PFunc to naturally and efficiently parallelize a wide variety of modern HPC applications and to support the SPMD model of parallel programming. We present three case studies: demand-driven DAG execution, frequent pattern mining and iterative sparse solvers to demonstrate the utility of PFunc's new features.","PeriodicalId":371415,"journal":{"name":"Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis","volume":"138 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128467660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}