{"title":"High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth performance Analysis","authors":"S. Sur, Matthew J. Koop, D. Panda","doi":"10.1145/1188455.1188565","DOIUrl":"https://doi.org/10.1145/1188455.1188565","url":null,"abstract":"InfiniBand is an emerging HPC interconnect being deployed in very large scale clusters, with even larger InfiniBand-based clusters expected to be deployed in the near future. The message passing interface (MPI) is the programming model of choice for scientific applications running on these large scale clusters. Thus, it is very critical for the MPI implementation used to be based on a scalable and high-performance design. We analyze the performance and scalability aspects of MVAPICH, a popular open-source MPI implementation on InfiniBand, from an application standpoint. We analyze the performance and memory requirements of the MPI library while executing several well-known applications and benchmarks, such as NAS, SuperLU, NAMD, and HPL on a 64-node InfiniBand cluster. Our analysis reveals that latest design of MVAPICH requires an order of magnitude less internal MPI memory (average per process) and yet delivers the best possible performance. Further, we observe that for these benchmarks and applications evaluated, the internal memory requirement of MVAPICH remains nearly constant at around 5-10 MB as the number of processes increase, indicating that the MVAPICH design is highly scalable","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130761396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Fitch, A. Rayshubskiy, M. Eleftheriou, T. Ward, M. Giampapa, M. Pitman, R. Germain
{"title":"Blue Matter: Approaching the Limits of Concurrency for Classical Molecular Dynamics","authors":"B. Fitch, A. Rayshubskiy, M. Eleftheriou, T. Ward, M. Giampapa, M. Pitman, R. Germain","doi":"10.1145/1188455.1188547","DOIUrl":"https://doi.org/10.1145/1188455.1188547","url":null,"abstract":"This paper describes a novel spatial-force decomposition for N-body simulations for which we observe O(sqrt(p)) communication scaling. This has enabled Blue Matter to approach the effective limits of concurrency for molecular dynamics using particle-mesh (FFT-based) methods for handling electrostatic interactions. Using this decomposition, Blue Matter running on Blue Gene/L has achieved simulation rates in excess of 1000 time steps per second and demonstrated significant speed-ups to O(1) atom per node. Blue Matter employs a communicating sequential process (CSP) style model with application communication state machines compiled to hardware interfaces. The scalability achieved has enabled methodologically rigorous biomolecular simulations on biologically interesting systems, such as membrane-bound proteins, whose time scales dwarf previous work on those systems. Major scaling improvements require exploration of alternative algorithms for treating the long range electrostatics","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128964947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Daniel Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, K. Kennedy
{"title":"Evaluation of a Workflow Scheduler Using Integrated Performance Modelling and Batch Queue Wait Time Prediction","authors":"Daniel Nurmi, A. Mandal, J. Brevik, C. Koelbel, R. Wolski, K. Kennedy","doi":"10.1145/1188455.1188579","DOIUrl":"https://doi.org/10.1145/1188455.1188579","url":null,"abstract":"Large-scale distributed systems offer computational power at unprecedented levels. In the past, HPC users typically had access to relatively few individual supercomputers and, in general, would assign a one-to-one mapping of applications to machines. Modern HPC users have simultaneous access to a large number of individual machines and are beginning to make use of all of them for single-application execution cycles. One method that application developers have devised in order to take advantage of such systems is to organize an entire application execution cycle as a workflow. The scheduling of such workflows has been the topic of a great deal of research in the past few years and, although very sophisticated algorithms have been devised, a very specific aspect of these distributed systems, namely that most supercomputing resources employ batch queue scheduling software, has therefore been omitted from consideration, presumably because it is difficult to model accurately. In this work, we augment an existing workflow scheduler through the introduction of methods which make accurate predictions of both the performance of the application on specific hardware, and the amount of time individual workflow tasks would spend waiting in batch queues. Our results show that although a workflow scheduler alone may choose correct task placement based on data locality or network connectivity, this benefit is often compromised by the fact that most jobs submitted to current systems must wait in overcommitted batch queues for a significant portion of time. However, incorporating the enhancements we describe improves workflow execution time in settings where batch queues impose significant delays on constituent workflow tasks","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122515577","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Govindaraju, S. Larsen, J. Gray, Dinesh Manocha
{"title":"A Memory Model for Scientific Algorithms on Graphics Processors","authors":"N. Govindaraju, S. Larsen, J. Gray, Dinesh Manocha","doi":"10.1145/1188455.1188549","DOIUrl":"https://doi.org/10.1145/1188455.1188549","url":null,"abstract":"We present a memory model to analyze and improve the performance of scientific algorithms on graphics processing units (GPUs). Our memory model is based on texturing hardware, which uses a 2D block-based array representation to perform the underlying computations. We incorporate many characteristics of GPU architectures including smaller cache sizes, 2D block representations, and use the 3C's model to analyze the cache misses. Moreover, we present techniques to improve the performance of nested loops on GPUs. In order to demonstrate the effectiveness of our model, we highlight its performance on three memory-intensive scientific applications - sorting, fast Fourier transform and dense matrix-multiplication. In practice, our cache-efficient algorithms for these applications are able to achieve memory throughput of 30-50 GB/s on a NVIDIA 7900 GTX GPU. We also compare our results with prior GPU-based and CPU-based implementations on high-end processors. In practice, we are able to achieve 2-5x performance improvement","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"92 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120869714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Chrisochoides, Andrey Fedorov, A. Kot, N. Archip, P. Black, O. Clatz, A. Golby, R. Kikinis, S. Warfield
{"title":"Toward Real-Time Image Guided Neurosurgery Using Distributed and Grid Computing","authors":"N. Chrisochoides, Andrey Fedorov, A. Kot, N. Archip, P. Black, O. Clatz, A. Golby, R. Kikinis, S. Warfield","doi":"10.1145/1188455.1188536","DOIUrl":"https://doi.org/10.1145/1188455.1188536","url":null,"abstract":"Neurosurgical resection is a therapeutic intervention in the treatment of brain tumors. Precision of the resection can be improved by utilizing magnetic resonance imaging (MRI) as an aid in decision making during image guided neurosurgery (IGNS). Image registration adjusts pre-operative data according to intra-operative tissue deformation. Some of the approaches increase the registration accuracy by tracking image landmarks through the whole brain volume. High computational cost used to render these techniques inappropriate for clinical applications. In this paper we present a parallel implementation of a state of the art registration method, and a number of needed incremental improvements. Overall, we reduced the response time for registration of an average dataset from about an hour and for some cases more than an hour to less than seven minutes, which is within the time constraints imposed by neurosurgeons. For the first time in clinical practice we demonstrated, that with the help of distributed computing non-rigid MRI registration based on volume tracking can be computed intra-operatively","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130255700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"End-System Aware, Rate-Adaptive Protocol for Network Transport in LambdaGrid Environments","authors":"P. Datta, W. Feng, Sushant Sharma","doi":"10.1145/1188455.1188572","DOIUrl":"https://doi.org/10.1145/1188455.1188572","url":null,"abstract":"Next-generation e-Science applications would require the ability to transfer information at high data rates between distributed computing centers and data repositories. A LambdaGrid offers dedicated, optical, circuit-switched, point-to-point connections that can be reserved exclusively for such applications. These dedicated high-speed connections eliminate network congestion as seen in traditional Internet, but they effectively push the network congestion to the end systems, as processing speeds cannot keep up with networking speeds. Thus, developing an efficient transport protocol over such high-speed dedicated circuits is of critical importance. We propose the idea of a end-system aware, rate-adaptive protocol for network transport, based on end-system performance monitoring. Our proposed protocol significantly improves the performance of data transfer over LambdaGrids by intelligently adapting the sending rate based on end-system constraints. We demonstrate the effectiveness of our proposed protocol and illustrate the performance gains achieved via wide-area network emulation","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131954925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architectures and APIs: Assessing Requirements for Delivering FPGA Performance to Applications","authors":"K. Underwood, K. Hemmert, C. Ulmer","doi":"10.1145/1188455.1188571","DOIUrl":"https://doi.org/10.1145/1188455.1188571","url":null,"abstract":"Reconfigurable computing leveraging field programmable gate arrays (FPGAs) is one of many accelerator technologies that are being investigated for application to high performance computing (HPC). Like most accelerators, FPGAs are very efficient at both dense matrix multiplication and FFT computations, but two important aspects of how to deliver that performance to applications have received too little attention. First, the standard API for important compute kernels hides parallelism from the system. Second, the issue of system architecture is virtually never addressed. This paper explores both issues and their implications for applications. We find that high bandwidth, low latency connectivity can be important, but the right API can be even more important","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132961477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
José E. Moreira, Michael Brutman, J. Castaños, Thomas Engelsiepen, M. Giampapa, Thomas Gooding, R. Haskin, T. Inglett, D. Lieber, P. McCarthy, M. Mundy, Jeff Parker, Brian P. Wallenfelt
{"title":"Designing a Highly-Scalable Operating System: The Blue Gene/L Story","authors":"José E. Moreira, Michael Brutman, J. Castaños, Thomas Engelsiepen, M. Giampapa, Thomas Gooding, R. Haskin, T. Inglett, D. Lieber, P. McCarthy, M. Mundy, Jeff Parker, Brian P. Wallenfelt","doi":"10.1145/1188455.1188578","DOIUrl":"https://doi.org/10.1145/1188455.1188578","url":null,"abstract":"Blue Gene/L, is currently the world's fastest and most scalable supercomputer. It has demonstrated essentially linear scaling all the way to 131,072 processors in several benchmarks and real applications. The operating systems for the compute and I/O nodes of Blue Gene/L are among the components responsible for that scalability. Compute nodes are dedicated to running application processes, whereas I/O nodes are dedicated to performing system functions. The operating systems adopted for each of these nodes reflect this separation of junction. Compute nodes run a lightweight operating system called the compute node kernel. I/O nodes run a port of the Linux operating system. This paper discusses the architecture and design of this solution for Blue Gene/L in the context of the hardware characteristics that led to the design decisions. It also explains and demonstrates how those decisions are instrumental in achieving the performance and scalability for which Blue Gene/L is famous","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114669819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. Ramakrishnan, David E. Irwin, Laura E. Grit, Aydan R. Yumerefendi, Adriana Iamnitchi, J. Chase
{"title":"Toward a Doctrine of Containment: Grid Hosting with Adaptive Resource Control","authors":"L. Ramakrishnan, David E. Irwin, Laura E. Grit, Aydan R. Yumerefendi, Adriana Iamnitchi, J. Chase","doi":"10.1145/1188455.1188561","DOIUrl":"https://doi.org/10.1145/1188455.1188561","url":null,"abstract":"Grid computing environments need secure resource control and predictable service quality in order to be sustainable. We propose a grid hosting model in which independent, self-contained grid deployments run within isolated containers on shared resource provider sites. Sites and hosted grids interact via an underlying resource control plane to manage a dynamic binding of computational resources to containers. We present a prototype grid hosting system, in which a set of independent globus grids share a network of cluster sites. Each grid instance runs a coordinator that leases and configures cluster resources for its grid on demand. Experiments demonstrate adaptive provisioning of cluster resources and contrast job-level and container-level resource management in the context of two grid application managers","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123573837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating Grid Portal Security","authors":"D. Vecchio, Victor Hazlewood, M. Humphrey","doi":"10.1145/1188455.1188574","DOIUrl":"https://doi.org/10.1145/1188455.1188574","url":null,"abstract":"Grid portals are an increasingly popular mechanism for creating customizable, Web-based interfaces to grid services and resources. Due to the powerful, general-purpose nature of grid technology, the security of any portal or entry point to such resources cannot be taken lightly. This is particularly true if the portal is running inside of a trusted perimeter, such as a science gateway running on an SDSC machine for access to the TeraGrid. To evaluate the current state of grid portal security, we undertake a comparative analysis of the three most popular grid portal frameworks that are being pursued as frontends to the TeraGrid: GridSphere, OGCE and clarens. We explore general challenges that grid portals face in the areas of authentication (including user identification), authorization, auditing (logging) and session management then contrast how the different grid portal implementations address these challenges. We find that although most grid portals address these security concerns to a certain extent, there is still room for improvement, particularly in the areas of secure default configurations and comprehensive logging and auditing support. We conclude with specific recommendations for designing, implementing and configuring secure grid portals","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129170044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}