M. Yokokawa, K. Itakura, Atsuya Uno, T. Ishihara, Y. Kaneda
{"title":"16.4-Tflops Direct Numerical Simulation of Turbulence by a Fourier Spectral Method on the Earth Simulator","authors":"M. Yokokawa, K. Itakura, Atsuya Uno, T. Ishihara, Y. Kaneda","doi":"10.1109/SC.2002.10052","DOIUrl":"https://doi.org/10.1109/SC.2002.10052","url":null,"abstract":"The high-resolution direct numerical simulations (DNSs) of incompressible turbulence with numbers of grid points up to 40963 have been executed on the Earth Simulator (ES). The DNSs are based on the Fourier spectral method, so that the equation for mass conservation is accurately solved. In DNS based on the spectral method, most of the computation time is consumed in calculating the three-dimensional (3D) Fast Fourier Transform (FFT), which requires huge-scale global data transfer and has been the major stumbling block that has prevented truly high-performance computing. By implementing new methods to efficiently perform the 3D-FFT on the ES, we have achieved DNS at 16.4 Tflops on 20483 grid points. The DNS yields an energy spectrum exhibiting a wide inertial subrange, in contrast to previous DNSs with lower resolutions, and therefore provides valuable data for the study of the universal features of turbulence at large Reynolds number.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115532058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MPI and OpenMP Paradigms on Cluster of SMP Architectures: The Vacancy Tracking Algorithm for Multi-Dimensional Array Transposition","authors":"Yun He, C. Ding","doi":"10.12694/SCPE.V5I2.276","DOIUrl":"https://doi.org/10.12694/SCPE.V5I2.276","url":null,"abstract":"We investigate remapping multi-dimensional arrays on cluster of SMP architectures under OpenMP, MPI, and hybrid paradigms. Traditional method of array transpose needs an auxiliary array of the same size and a copy back stage. We recently developed an in-place method using vacancy tracking cycles. The vacancy tracking algorithm outperforms the traditional 2-array method as demonstrated by extensive comparisons. The independence of vacancy tracking cycles allows efficient parallelization of the in-place method on SMP architectures at node level. Performance of multi-threaded parallelism using OpenMP are tested with different scheduling methods and different number of threads. The vacancy tracking method is parallelized using several parallel paradigms. At node level, pure OpenMP outperforms pure MPI by a factor of 2.76. Across entire cluster of SMP nodes, the hybrid MPI/OpenMP implementation outperforms pure MPI by a factor of 4.44, demonstrating the validity of the parallel paradigm of mixing MPI with OpenMP.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"767 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120882880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gerald Baumgartner, D. Bernholdt, D. Cociorva, R. Harrison, S. Hirata, Chi-Chung Lam, M. Nooijen, R. Pitzer, J. Ramanujam, P. Sadayappan
{"title":"A High-Level Approach to Synthesis of High-Performance Codes for Quantum Chemistry","authors":"Gerald Baumgartner, D. Bernholdt, D. Cociorva, R. Harrison, S. Hirata, Chi-Chung Lam, M. Nooijen, R. Pitzer, J. Ramanujam, P. Sadayappan","doi":"10.1109/SC.2002.10056","DOIUrl":"https://doi.org/10.1109/SC.2002.10056","url":null,"abstract":"This paper discusses an approach to the synthesis of high-performance parallel programs for a class of computations encountered in quantum chemistry and physics. These computations are expressible as a set of tensor contractions and arise in electronic structure modeling. An overview is provided of the synthesis system, that transforms a high-level specification of the computation into high-performance parallel code, tailored to the characteristics of the target architecture. An example from computational chemistry is used to illustrate how different code structures are generated under different assumptions of available memory on the target computer.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116634028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Furmento, William Lee, A. Mayer, S. Newhouse, J. Darlington
{"title":"ICENI: An Open Grid Service Architecture Implemented with Jini","authors":"N. Furmento, William Lee, A. Mayer, S. Newhouse, J. Darlington","doi":"10.1109/SC.2002.10027","DOIUrl":"https://doi.org/10.1109/SC.2002.10027","url":null,"abstract":"The move towards Service Grids, where services are composed to meet the requirements of a user community within constraints specified by the resource provider, present many challenges to service provision and description. To support our research activities in the autonomous composition of services to form a Semantic Service Grid we describe the adoption within ICENI of web services to enable interoperability with the recently proposed Open Grid Services Architecture.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123632650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Web Service Discovery Architecture","authors":"Wolfgang Hoschek","doi":"10.1109/SC.2002.10033","DOIUrl":"https://doi.org/10.1109/SC.2002.10033","url":null,"abstract":"In this paper, we propose the Web Service Discovery Architecture (WSDA). At runtime, Grid applications can use this architecture to discover and adapt to remote services. WSDA promotes an interoperable web service discovery layer by defining appropriate services, interfaces, operations and protocol bindings, based on industry standards. It is unified because it subsumes an array of disparate concepts, interfaces and protocols under a single semi-transparent umbrella. It is modular because it defines a small set of orthogonal multi-purpose communication primitives (building blocks) for discovery. These primitives cover service identification, service description retrieval, data publication as well as minimal and powerful query support. The architecture is open and flexible because each primitive can be used, implemented, customized and extended in many ways. It is powerful because the individual primitives can be combined and plugged together by specific clients and services to yield a wide range of behaviors and emerging synergies.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124768894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Decoupled Scheduling Approach for the GrADS Program Development Environment","authors":"H. Dail, H. Casanova, F. Berman","doi":"10.1109/SC.2002.10009","DOIUrl":"https://doi.org/10.1109/SC.2002.10009","url":null,"abstract":"Program development environments are instrumental in providing users with easy and efficient access to parallel computing platforms. While a number of such environments have been widely accepted and used for traditional HPC systems, there are currently no widely used environments for Grid programming. The goal of the Grid Application Development Software (GrADS) project is to develop a coordinated set of tools, libraries and run-time execution facilities for Grid program development. In this paper, we describe a Grid scheduler component that is integrated as part of the GrADS software system. Traditionally, application-level schedulers (e.g. AppLeS) have been tightly integrated with the application itself and were not easily applied to other applications. Our design is generic: we decouple the scheduler core (the search procedure) from the application-specific (e.g. application performance models) and platform-specific (e.g. collection of resource information) components used by the search procedure. We provide experimental validation of our approach for two representative regular, iterative parallel programs in a variety of real-world Grid testbeds. Our scheduler consistently outperforms static and user-driven scheduling methods.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124877110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scaling the Unscalable: A Case Study on the AlphaServer SC","authors":"P. Worley","doi":"10.1109/SC.2002.10035","DOIUrl":"https://doi.org/10.1109/SC.2002.10035","url":null,"abstract":"A case study of the optimization of a climate modeling application on the Compaq AlphaServer SC at the Pittsburgh Supercomputer Center is used to illustrate tools and techniques that are important to achieving good performance scaling.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125988203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"UPC Performance and Potential: A NPB Experimental Study","authors":"T. El-Ghazawi, François Cantonnet","doi":"10.1109/SC.2002.10034","DOIUrl":"https://doi.org/10.1109/SC.2002.10034","url":null,"abstract":"UPC, or Unified Parallel C, is a parallel extension of ANSI C. UPC follows a distributed shared memory programming model aimed at leveraging the ease of programming of the shared memory paradigm, while enabling the exploitation of data locality. UPC incorporates constructs that allow placing data near the threads that manipulate them to minimize remote accesses. This paper gives an overview of the concepts and features of UPC and establishes, through extensive performance measurements of NPB workloads, the viability of the UPC programming language compared to the other popular paradigms. Further, through performance measurements we identify the challenges, the remaining steps and the priorities for UPC. It will be shown that with proper hand tuning and optimized collective operations libraries, UPC performance will be comparable to that of MPI. Furthermore, by incorporating such improvements into automatic compiler optimizations, UPC will compare quite favorably to message passing in ease of programming.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116793424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 29.5 Tflops Simulation of Planetesimals in Uranus-Neptune Region on GRAPE-6","authors":"J. Makino, E. Kokubo, T. Fukushige, H. Daisaka","doi":"10.1109/SC.2002.10022","DOIUrl":"https://doi.org/10.1109/SC.2002.10022","url":null,"abstract":"As an entry for the 2002 Gordon Bell performance prize, we report the performance achieved on the GRAPE-6 system for a simulation of the early evolution of the protoplanet-planetesimal system of the Uranus-Neptune region. GRAPE-6 is a special-purpose computer for astrophysical N-body calculations. The present configuration has 2048 custom pipeline chips, each containing six pipeline processors for the calculation of gravitational interactions between particles. Its theoretical peak performance is 63.4 Tflops. The actual performance obtained was 29.5 Tflops, for a simulation of the early evolution of outer Solar system with 1.8 million planetesimals and two massive protoplanets.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132690162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Separated High-Bandwidth and Low-Latency Communication in the Cluster Interconnect Clint","authors":"H. Eberle, N. Gura","doi":"10.1109/SC.2002.10042","DOIUrl":"https://doi.org/10.1109/SC.2002.10042","url":null,"abstract":"An interconnect for a high-performance cluster has to be optimized in respect to both high throughput and low latency. To avoid the tradeoff between throughput and latency, the cluster interconnect Clint1 has a segregated architecture that provides two physically separate transmission channels: A bulk channel optimized for high-bandwidth traffic and a quick channel optimized for low-latency traffic. Different scheduling strategies are applied. The bulk channel uses a scheduler that globally allocates time slots on the transmission paths before packets are sent off. This way collisions as well as blockages are avoided. In contrast, the quick channel takes a best-effort approach by sending packets whenever they are available thereby risking collisions and retransmissions. Simulation results clearly show the performance advantages of the segregated architecture. The carefully scheduled bulk channel can be loaded nearly to its full capacity without exhibiting head-of-line blocking that limits many networks while the quick channel provides low-latency communication even in the presence of high-bandwidth traffic.","PeriodicalId":302800,"journal":{"name":"ACM/IEEE SC 2002 Conference (SC'02)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2002-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133574569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}