Proceedings of the IEEE/ACM SC98 Conference最新文献_第3页

Multi-processor Performance on the Tera MTA Tera MTA的多处理器性能

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10049

A. Snavely, L. Carter, J. Boisseau, A. Majumdar, K. Gatlin, N. Mitchell, J. Feo, Brian D. Koblenz

{"title":"Multi-processor Performance on the Tera MTA","authors":"A. Snavely, L. Carter, J. Boisseau, A. Majumdar, K. Gatlin, N. Mitchell, J. Feo, Brian D. Koblenz","doi":"10.1109/SC.1998.10049","DOIUrl":"https://doi.org/10.1109/SC.1998.10049","url":null,"abstract":"The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera MTA. In a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA codes. We then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the \"serial bottlenecks\" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of threads. Instead, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except for the EP benchmark, the tuned codes issue memory references at nearly the peak rate of one per instruction. Consequently, the network can support the memory references issued by one, but not two, processors. As a result, the parallel efficiency of EP is near- perfect, but the others are reduced accordingly. Another reason for imperfect speedup pertains to the compiler. While the definition of a thread in a single processor or multi-processor mode is essentially the same, there is a different implementation and an associated overhead with running on multiple processors. We characterize the overhead of running \"frays\" (a collection of threads running on a single processor) and \"crews\" (a collection of frays, one per processor.)","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128010301","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 71

User-Space Communication: A Quantitative Study 用户空间通信:定量研究

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10038

S. Araki, A. Bilas, C. Dubnicki, J. Edler, K. Konishi, J. Philbin

{"title":"User-Space Communication: A Quantitative Study","authors":"S. Araki, A. Bilas, C. Dubnicki, J. Edler, K. Konishi, J. Philbin","doi":"10.1109/SC.1998.10038","DOIUrl":"https://doi.org/10.1109/SC.1998.10038","url":null,"abstract":"Powerful commodity systems and networks offer a promising direction for high performance computing because they are inexpensive and they closely track technology progress. However, high, raw-hardware performance is rarely delivered to the end user. Previous work has shown that the bottleneck in these architectures is the overheads imposed by the software communication layer. To reduce these overheads, researchers have proposed a number of user-space communication models. The common feature of these models is that applications have direct access to the network, bypassing the operating system in the common case and thus avoiding the cost of send/receive system calls. In this paper we examine five user-space communication layers, that represent different points in the configuration space: Generic AM, BIP-0.92, FM-2.02, PM-1.2, and VMMC-2. Although these systems support different communication paradigms and employ a variety of different implementation tradeoffs, we are able to quantitatively compare them on a single testbed consisting of a cluster of high-end PCs connected by a Myrinet network. We find that all five communication systems have very low latency for small messages, in the range of 5 to 17 s. Not surprisingly, this range is strongly influenced by the functionality offered by each system. We are encouraged, however, to find that features such as protected and reliable communication at user level and multiprogramming can be provided at very low cost. Bandwidth, however, depends primarily on how data is transferred between host memory and the network. Most of the investigated libraries support zero-copy protocols for certain types of data transfers, but differ significantly in the bandwidth delivered to end users. The highest bandwidth, between 95 and 125 MBytes/s for long message transfers, is delivered by libraries that use DMA on both send and receive sides and avoid all data copies. Libraries that perform additional data copies or use programmed I/O to send data to the network achieve lower maximum bandwidth, in the range of 60-70 MBytes/s.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126867215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

A Case for Using MPI's Derived Datatypes to Improve I/O Performance 使用MPI的派生数据类型提高I/O性能的案例

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10006

R. Thakur, W. Gropp, E. Lusk

引用次数: 89

A Prototype Notebook-Based Environment for Computational Tools Computational Tools 计算工具的基于笔记本的原型环境

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10031

J. L. Skidmore, M. Sottile, J. Cuny, A. Malony

{"title":"A Prototype Notebook-Based Environment for Computational Tools Computational Tools","authors":"J. L. Skidmore, M. Sottile, J. Cuny, A. Malony","doi":"10.1109/SC.1998.10031","DOIUrl":"https://doi.org/10.1109/SC.1998.10031","url":null,"abstract":"The Virtual Notebook Environment (ViNE) is a platform-independent, web-based interface designed to support a range of scientific activities across distributed, heterogeneous computing platforms. ViNE provides scientists with a web-based version of the common paper-based lab notebook, but in addition, it provides support for collaboration and management of computational experiments. Collaboration is supported with the web-based approach, which makes notebook material generally accessible and with a hierarchy of security mechanisms that screen that access. ViNE provides uniform, system-transparent access to data, tools, and programs throughout the scientist's computing infrastructure. Computational experiments can be launched from ViNE using a visual specification language. The scientist is freed from concerns about inter-tool connectivity, data distribution, or data management details. ViNE also provides support for dynamically linking analysis results back into the notebook content. In this paper we present the ViNE system architecture and a case study of its use in neuropsychology research at the University of Oregon. Our case study with the Brain Electrophysiology Laboratory (BEL) addresses their need for data security and management, collaborative support, and distributed analysis processes. The current version of ViNE is a prototype system being tested with this and other scientific applications.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133774550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

The UCLA AGCM in High Performance Computing Environments 高性能计算环境下的UCLA AGCM

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10020

C. Mechoso, L. A. Drummond, J. Farrara, J. A. Spahr

{"title":"The UCLA AGCM in High Performance Computing Environments","authors":"C. Mechoso, L. A. Drummond, J. Farrara, J. A. Spahr","doi":"10.1109/SC.1998.10020","DOIUrl":"https://doi.org/10.1109/SC.1998.10020","url":null,"abstract":"General Circulation Models (GCMs) are at the top of the hierarchy of numerical models that are used to study the Earth's climate. To increase the significance of predictions using GCMs requires ensembles of integrations that in turn demand large amounts of computing resources. GCMs codes are particularly difficult to optimize in view of their heterogeneity. In this paper we focus on code optimization for GCMs of the atmosphere (AGCMs), one of the major components of the climate system. In this paper, we present our efforts in optimizing the parallel UCLA AGCM code. The UCLA AGCM is a state-of-the-art finite-difference model of the global atmosphere. Our optimization efforts include the implementation of load balancing schemes, new physical parameterizations of atmospheric processes, code restructuring and use of special mathematical functions. At the beginning of this work, the overall execution time of the code was 459 seconds per simulated day in 256 nodes of a CRAY T3D. At present, the same model configuration requires 51 seconds per simulated day in 256 nodes of a CRAY T3E-900, which is approximately 9 times faster. The peak model performance is about 40 GFLOPs on 512 T3E-900 nodes. We present results in support of our conclusion that major advances in our ability to carry out longer and more detailed climate simulations depend primarily upon development of more powerful supercomputers and that code optimization, for a particular computer architecture, and development of more efficient algorithms can be nearly as important.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134225048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes 科学代码并行化的高性能Fortran编译技术

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10034

Vikram S. Adve, G. Jin, J. Mellor-Crummey, Qing Yi

{"title":"High Performance Fortran Compilation Techniques for Parallelizing Scientific Codes","authors":"Vikram S. Adve, G. Jin, J. Mellor-Crummey, Qing Yi","doi":"10.1109/SC.1998.10034","DOIUrl":"https://doi.org/10.1109/SC.1998.10034","url":null,"abstract":"With current compilers for High Performance Fortran (HPF), substantial restructuring and hand- optimization may be required to obtain acceptable performance from an HPF port of an existing Fortran application. A key goal of the Rice dHPF compiler project is to develop optimization techniques that can provide consistently high performance for a broad spectrum of scientific applications with minimal restructuring of existing Fortran 77 or Fortran 90 applications. This paper presents four new optimization techniques we developed to support efficient parallelization of codes with minimal restructuring. These optimizations include computation partition selection for loop nests that use privatizable arrays, along with partial replication of boundary computations to reduce communication overhead; communication- sensitive loop distribution to eliminate inner-loop communications; interprocedural selection of computation partitions; and data availability analysis to eliminate redundant communications. We studied the effectiveness of the dHPF compiler, which incorporates these optimizations, in parallelizing serial versions of the NAS SP and BT application benchmarks. We present experimental results comparing the performance of hand-written MPI code for the benchmarks against code generated from HPF using the dHPF compiler and the Portland Group's pghpf compiler. Using the compilation techniques described in this paper we achieve performance within 15% of hand-written MPI code on 25 processors for BT and within 33% for SP. Furthermore, these results are obtained with HPF versions of the benchmarks that were created with minimal restructuring of the serial code (modifying only approximately 5% of the code).","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130273767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 52

OpenMP on Networks of Workstations OpenMP在工作站网络中的应用

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10001

Honghui Lu, Charlie Hu, W. Zwaenepoel

{"title":"OpenMP on Networks of Workstations","authors":"Honghui Lu, Charlie Hu, W. Zwaenepoel","doi":"10.1109/SC.1998.10001","DOIUrl":"https://doi.org/10.1109/SC.1998.10001","url":null,"abstract":"We describe an implementation of a sizable subset of OpenMP on networks of workstations (NOWs). By extending the availability of OpenMP to NOWs, we overcome one of its primary drawbacks compared to MPI, namely lack of portability to environments other than hardware shared memory machines. In order to support OpenMP execution on NOWs, our compiler targets a software distributed shared memory system (DSM) which provides multi-threaded execution and memory consistency. This paper presents two contributions. First, we identify two aspects of the current OpenMP standard that make an implementation on NOWs hard, and suggest simple modifications to the standard that remedy the situation. These problems reflect differences in memory architecture between software and hardware shared memory and the high cost of synchronization on NOWs. Second, we present performance results of a prototype implementation of an OpenMP subset on a NOW, and compare them with hand-coded software DSM and MPI results for the same applications on the same platform. We use five applications (ASCI Sweep3d, NAS 3D- FFT, SPLASH-2 Water, QSORT, and TSP) exhibiting various styles of parallelization, including pipelined execution, data parallelism, coarse-grained parallelism, and task queues. The measurements show little difference between OpenMP and hand-coded software DSM, but both are still lagging behind MPI. Further work will concentrate on compiler optimization to reduce these differences.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129258188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63

A Hierarchical Load-Balancing Framework for Dynamic Multithreaded Computations 动态多线程计算的分层负载平衡框架

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10047

V. Karamcheti, A. Chien

{"title":"A Hierarchical Load-Balancing Framework for Dynamic Multithreaded Computations","authors":"V. Karamcheti, A. Chien","doi":"10.1109/SC.1998.10047","DOIUrl":"https://doi.org/10.1109/SC.1998.10047","url":null,"abstract":"High-level parallel programming models supporting dynamic fine-grained threads in a global object space, are becoming increasingly popular for expressing irregular applications based on sophisticated adaptive algorithms and pointer-based data structures. However, implementing these multithreaded computations on scalable parallel machines poses significant challenges, particularly with respect to load-balancing. Load-balancing techniques must simultaneously incur low overhead to support fine-grained threads as well as be sophisticated enough to preserve data locality and thread execution priority. This paper presents a hierarchical framework which addresses these conflicting goals by viewing the computation as being made up of different thread subsets, each of which are load-balanced independently. In contrast to previous processor-centric approaches that have advocated the use of a uniform policy for load-balancing all threads in a computation, our framework allows each thread subset to be load-balanced using a policy most suited to its characteristics (e.g., locality or priority sensitivity). The framework consists of two parts: (i) language support which permits a programmer to tag different thread subsets with appropriate policies, and (ii) run-time support which synthesizes overall application load-balance by composing these individual policies. This framework has been implemented in the Illinois Concert runtime system, an execution platform for fine-grained concurrent object-oriented languages. Results for four large irregular applications on the Cray T3D and the SGI Origin 2000 demonstrate advantages of the hierarchical framework: performance improves by up to an order of magnitude as compared to using a uniform load-balancing policy.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127759147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

An out-of-core implementation of the COLUMBUS massively-parallel multireference configuration interaction program 哥伦布大规模并行多引用配置交互程序的核外实现

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.1109/SC.1998.10027

H. Dachsel, J. Nieplocha, R. Harrison

引用次数: 8

A new Lanczos method for electronic structure calculations 电子结构计算的Lanczos新方法

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI: 10.5555/509058.509093

Kesheng Wu, A. Canning, H. Simon

{"title":"A new Lanczos method for electronic structure calculations","authors":"Kesheng Wu, A. Canning, H. Simon","doi":"10.5555/509058.509093","DOIUrl":"https://doi.org/10.5555/509058.509093","url":null,"abstract":"In the heart of most electronic structure simulation programs, there is a routine to find the solution of eigenvalue problems. Solving these eigenvalue problems usually dominates the computer time used for the whole simulation[6]. Because of their physical properties, these eigenvalue problems are always symmetric real or Hermitian. The dimensions of the matrices involved are usually very large and a large number of eigenvalues and their corresponding eigenvectors are needed to compute the desired physical quantities. To solve this type of problems, we introduce a variant of the Lanczos method called the thick-restart Lanczos method. In material science, this method is most appropriate for non-selfconsistent cases where the eigenvalue problems are linear and the number of required eigenvalues is relatively small compared to the size of the matrix.The Lanczos method is very simple and yet effective in finding eigenvalues. It is also well suited for parallel computing. There are two common ways of implementing the Lanczos method depending on whether the Lanczos vectors are stored. When the Lanczos vectors are not stored, they may lose orthogonality and the Lanczos method may generate spurious eigenvalues [2, 10]. Though spurious eigenvalues can be effectively identified, we still prefer not to deal with the spurious eigenvalues. When the Lanczos vectors are stored, the loss of orthogonality problem can be corrected by re-orthogonalization [4, 5, 7]. No spurious eigenvalue is generated in this case. However, because each Lanczos step generates one vector, a large amount of computer memory may be required to store all the Lanczos vectors. To limit the maximum amount of memory used, we typically restart the Lanczos algorithm after a certain number of steps. The restarted versions usually use considerably more matrix-vector multiplications than the non-restarted version. In recent years, newly developed restarting strategies have significantly reduced the number of matrix-vector multiplications used. The two most successful ones are the implicit restarting technique [1, 3, 8] and the dynamic thick-restart technique [9, 12]. For symmetric or Hermitian eigenvalue problems, these two schemes are equivalent. Because the thick-restart scheme is easier to implement and it is slightly more flexible than the implicit restarted scheme [9, 12], the new method described here uses the thick-restart scheme. Other thick-restart eigenvalue methods, e.g., the thick-restart Davidson method, can be applied on symmetric eigenvalue problems as well. Compared to them, the main advantage of the new scheme is that it uses less arithmetic operations by taking full advantage of the symmetry of the matrix [13].","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114314725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0