{"title":"Preliminary Investigation of Advanced Electrostatics in Molecular Dynamics on Reconfigurable Computers","authors":"R. Scrofano, V. Prasanna","doi":"10.1145/1188455.1188550","DOIUrl":"https://doi.org/10.1145/1188455.1188550","url":null,"abstract":"Scientific computing is marked by applications with very high performance demands. As technology has improved, reconfigurable hardware has become a viable platform to provide application acceleration, even for floating-point-intensive scientific applications. Now, reconfigurable computers - computers with general purpose microprocessors, reconfigurable hardware, memory, and high performance interconnect - are emerging as platforms that allow complete applications to be partitioned into parts that execute in software and parts that are accelerated in hardware. In this paper, we study molecular dynamics simulation. Specifically, we study the use of the smooth particle mesh Ewald technique in a molecular dynamics simulation program that takes advantage of the hardware acceleration capabilities of a reconfigurable computer. We demonstrate a 2.7-2.9times speed-up over the corresponding software-only simulation program. Along the way, we note design issues and techniques related to the use of reconfigurable computers for scientific computing in general","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122798525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael R. Head, M. Govindaraju, Robert A. van Engelen, Wei Zhang
{"title":"Benchmarking XML Processors for Applications in Grid Web Services","authors":"Michael R. Head, M. Govindaraju, Robert A. van Engelen, Wei Zhang","doi":"10.1145/1188455.1188581","DOIUrl":"https://doi.org/10.1145/1188455.1188581","url":null,"abstract":"Web services based specifications have emerged as the underlying architecture for core grid services and standards, such as WSRF. XML is inextricably inter-twined with Web services based specifications, and as a result the design and implementation of XML processing tools plays a significant role in grid applications. These applications use XML in a wide variety of ways, including workflow specifications, WS-Security based documents, service descriptions in WSDL, and on-the-wire format in SOAP-based communication. The application characteristics also vary widely in the use of XML messages in their performance, memory, size, and processing requirements. Numerous XML processing tools exist today, each of which is optimized for specific features. To make the right decisions, grid application and middleware developers must thus understand the complex dependencies between XML features and the application. We propose a standard benchmark suite for quantifying, comparing, and contrasting the performance of XML processors under a wide range of representative use cases. The benchmarks are defined by a set of XML schemas and conforming documents. To demonstrate the utility of the benchmarks and to provide a snapshot of the current XML implementation landscape, we report the performance of many different XML implementations, on the benchmarks, and draw conclusions about their current performance characteristics. We also present a brief analysis on the current shortcomings and required critical design changes for multi-threaded XML processing tools to run efficiently on emerging multi-core architectures","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128718832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Grid Capacity Planning with Negotiation-based Advance Reservation for Optimized QoS","authors":"M. Siddiqui, A. Villazón, T. Fahringer","doi":"10.1145/1188455.1188563","DOIUrl":"https://doi.org/10.1145/1188455.1188563","url":null,"abstract":"Advance reservation of grid resources can play a key role in enabling grid middleware to deliver on-demand resource provision with significantly improved quality-of-service (QoS). However, in the grid, advance reservation has been largely ignored due to the dynamic grid behavior, underutilization concerns, multi-constrained applications, and lack of support for agreement enforcement. These issues force the grid middleware to make resource allocations at run-time with reduced QoS. To remedy these, we introduce a new, 3-layered negotiation protocol for advance reservation of the grid resources. We model resource allocation as an online strip packing problem and introduce a new mechanism that optimizes resource utilization and QoS constraints while generating the contention-free solutions. The mechanism supports open reservations to deal with the dynamic grid and provides a practical solution for agreement enforcement. We have implemented a prototype and performed experiments to demonstrate the effectiveness of our approach","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131411813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vijay S. Kumar, B. Rutt, T. Kurç, Ümit V. Çatalyürek, J. Saltz, S. Chow, S. Lamont, M. Martone
{"title":"Large Image Correction and Warping in a Cluster Environment","authors":"Vijay S. Kumar, B. Rutt, T. Kurç, Ümit V. Çatalyürek, J. Saltz, S. Chow, S. Lamont, M. Martone","doi":"10.1145/1188455.1188539","DOIUrl":"https://doi.org/10.1145/1188455.1188539","url":null,"abstract":"This paper is concerned with efficient execution of a pipeline of data processing operations on very large images obtained from confocal microscopy instruments. We describe parallel, out-of-core algorithms for each operation in this pipeline. One of the challenging steps in the pipeline is the warping operation using inverse mapping based methods. We propose and investigate a set of algorithms to handle the warping computations on storage clusters. Our experimental results show that the proposed approaches are scalable both in terms of number of processors and the size of images","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124728078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Estimating Query Result Sizes for Proxy Caching in Scientific Database Federations","authors":"T. Malik, R. Burns, N. Chawla, A. Szalay","doi":"10.1145/1188455.1188562","DOIUrl":"https://doi.org/10.1145/1188455.1188562","url":null,"abstract":"In a proxy cache for federations of scientific databases it is important to estimate the size of a query before making a caching decision. With accurate estimates, near-optimal cache performance can be obtained. On the other extreme, inaccurate estimates can render the cache totally ineffective. We present classification and regression over templates (CAROT), a general method for estimating query result sizes, which is suited to the resource-limited environment of proxy caches and the distributed nature of database federations. CAROT estimates query result sizes by learning the distribution of query results, not by examining or sampling data, but from observing workload. We have integrated CAROT into the proxy cache of the National Virtual Observatory (NVO) federation of astronomy databases. Experiments conducted in the NVO show that CAROT dramatically outperforms conventional estimation techniques and provides near-optimal cache performance","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christophe Lemuet, Jack Sampson, Jean-Francois Collard, Norm Jouppi
{"title":"The Potential Energy Efficiency of Vector Acceleration","authors":"Christophe Lemuet, Jack Sampson, Jean-Francois Collard, Norm Jouppi","doi":"10.1145/1188455.1188537","DOIUrl":"https://doi.org/10.1145/1188455.1188537","url":null,"abstract":"Energy efficiency of computation is quickly becoming a key problem from the chip through the data center. This paper presents the first quantitative study of the potential energy efficiency of vector accelerators. We propose and study a vector accelerator architecture suitable for implementation in a 70 nm technology. The vector architecture has a high-bandwidth on-chip cache system coupled to 16 independent memory channels. We show that such an accelerator can achieve speedups of 10X or more on loop kernels in comparison to a quad-issue superscalar uniprocessor, while using less energy. We also introduce run-ahead lanes, a complexity and energy efficient means of tolerating variable latency from crossbar contention, cache bank conflicts, cache misses, and the memory system. Run-ahead lanes only synchronize on dependencies or when explicitly directed","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129749757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
H. Shan, E. Strohmaier, J. Qiang, D. Bailey, K. Yelick
{"title":"Performance Modeling and Optimization of a High Energy Colliding Beam Simulation Code","authors":"H. Shan, E. Strohmaier, J. Qiang, D. Bailey, K. Yelick","doi":"10.1145/1188455.1188557","DOIUrl":"https://doi.org/10.1145/1188455.1188557","url":null,"abstract":"An accurate modeling of the beam-beam interaction is essential to maximizing the luminosity in existing and future colliders. BeamBeam3D was the first parallel code that can be used to study this interaction fully self-consistently on high-performance computing platforms. Various all-to-all personalized communication (AAPC) algorithms dominate its communication patterns, for which we developed a sequence of performance models using a series of micro-benchmarks. We find that for SMP based systems the most important performance constraint is node-adapter contention, while for 3D-torus topologies good performance models are not possible without considering link contention. The best average model prediction error is very low on SMP based systems with of 3% to 7%. On torus based systems errors of 29% are higher but optimized performance can again be predicted within 8% in some cases. These excellent results across five different systems indicate that this methodology for performance modeling can be applied to a large class of algorithms","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133862355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Blocksome, C. Archer, T. Inglett, P. McCarthy, M. Mundy, J. Ratterman, A. Sidelnik, B. Smith, G. Almási, J. Castaños, D. Lieber, J. Moreira, S. Krishnamoorthy, V. Tipparaju, J. Nieplocha
{"title":"Design and Implementation of a One-Sided Communication Interface for the IBM eServer Blue Gene","authors":"M. Blocksome, C. Archer, T. Inglett, P. McCarthy, M. Mundy, J. Ratterman, A. Sidelnik, B. Smith, G. Almási, J. Castaños, D. Lieber, J. Moreira, S. Krishnamoorthy, V. Tipparaju, J. Nieplocha","doi":"10.1145/1188455.1188580","DOIUrl":"https://doi.org/10.1145/1188455.1188580","url":null,"abstract":"This paper discusses the design and implementation of a one-sided communication interface for the IBM Blue Gene/L supercomputer. This interface facilitates ARMCI and the Global Arrays toolkit and can be used by other one-sided communication libraries. New protocols, interrupt driven communication, and compute node kernel enhancements were required to enable these libraries. Three possible methods for enabling ARMCI on the Blue Gene/L software stack are discussed. A detailed look into the development process shows how the implementation of the one-sided communication interface was completed. This was accomplished on a compressed time scale with the collaboration of various organizations within IBM and open source communities. In addition to enabling the one-sided libraries, bandwidth enhancements were made for communication along a diagonal on the Blue Gene/L torus network. The maximum bandwidth improved by a factor of three. This work will enable a variety of one-sided applications to run on Blue Gene/L","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130024356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Hoisie, Greg Johnson, D. Kerbyson, M. Lang, S. Pakin
{"title":"A Performance Comparison Through Benchmarking and Modeling of Three Leading Supercomputers: Blue Gene/L, Red Storm, and Purple","authors":"A. Hoisie, Greg Johnson, D. Kerbyson, M. Lang, S. Pakin","doi":"10.1145/1188455.1188534","DOIUrl":"https://doi.org/10.1145/1188455.1188534","url":null,"abstract":"This work provides a performance analysis of three leading supercomputers that have recently been deployed: Purple, Red Storm and Blue Gene/L. Each of these machines is architecturally diverse, with very different performance characteristics. Each contains over 10,000 processors and has a system peak of over 40 Teraflops. We analyze each system using a range of micro-benchmarks which include communication performance as well as quantifying the impact of the operating system. The achievable application performance is compared across the systems. The application performance is confirmed via the use of detailed application models which use the underlying performance characteristics as measured by the micro-benchmarks. We also compare the machines in a realistic production scenario in which each machine is used so as to maximize its memory usage with the applications executed in a weak-scaling mode. The results also help illustrate that achievable performance is not directly related to the peak performance","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123858989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Rosenberg, G. Norton, J. Novarini, W. Anderson, M. Lanzagorta
{"title":"Modeling Pulse Propagation and Scattering in a Dispersive Medium: Performance of MPI/OpenMP Hybrid Code","authors":"R. Rosenberg, G. Norton, J. Novarini, W. Anderson, M. Lanzagorta","doi":"10.1145/1188455.1188555","DOIUrl":"https://doi.org/10.1145/1188455.1188555","url":null,"abstract":"Accurate modeling of pulse propagation and scattering is of great importance to the Navy. In a non-dispersive medium a fourth order in time and space 2-D finite difference time domain (FDTD) scheme representation of the linear wave equation can be used. However when the medium is dispersive one is required to take into account the frequency dependent attenuation and phase velocity. Using a theory first proposed by Blackstock, the linear wave equation has been modified by adding an additional term (the derivative of the convolution between the causal time domain propagation factor and the acoustic pressure) that takes into account the dispersive nature of the medium. This additional term transforms the calculation from one suitable to a workstation into one very much suited to a large-scale computational platform, both in terms of computation and memory. With appropriate distribution of data, good scaling can be achieved up to thousands of processors. Due to the simple structure of the code, it is easily parallelized using three different techniques: pure MPI, pure OpenMP and a hybrid MPI/OpenMP. We use this real life application to evaluate the performance of the latest multi-cpu/multicore platforms available from the DoD HPCMP","PeriodicalId":333909,"journal":{"name":"ACM/IEEE SC 2006 Conference (SC'06)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115177025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}