{"title":"Cost Effectiveness of an Adaptable Computing Cluster","authors":"K. Underwood, R. Sass, W. Ligon","doi":"10.1145/582034.582088","DOIUrl":"https://doi.org/10.1145/582034.582088","url":null,"abstract":"With a focus on commodity PC systems, Beowulf clusters traditionally lack the cutting edge network architectures, memory subsystems, and processor technologies found in their more expensive supercomputer counterparts. What Beowulf clusters lack in technology, they more than make up for with their significant cost advantage over traditional supercomputers. This paper presents the cost implications of an architectural extension that adds reconfigurable computing to the network interface of Beowulf clusters. A quantitative idea of cost-effectiveness is formulated to evaluate computing technologies. Here, cost-effectiveness is considered in the context of two applications: the 2D Fast Fourier transform (2D-FFT) and integer sorting.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123237699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Greer, J. Harrison, G. Henry, Wei Li, P. T. P. Tang
{"title":"Scientific Computing on the Itanium ™ Processor","authors":"B. Greer, J. Harrison, G. Henry, Wei Li, P. T. P. Tang","doi":"10.1145/582034.582075","DOIUrl":"https://doi.org/10.1145/582034.582075","url":null,"abstract":"The 64-bit Intel® Itanium™ architecture is designed for high-performance scientific and enterprise computing, and the Itanium processor is itsfirst silicon implementation. Features such as extensive arithmetic support, predication, speculation, and explicit parallelism can be used to provide a sound infrastructure for supercomputing. A largenumber of high-performance computer companies are offering Itanium™-based systems, some capable of peak performance exceeding 50 GFLOPS. In this paper we give an overview of the most relevant architectural features and provide illustrations of how these features are used in both low-level and high-level support for scientific and engineering computing, including transcendental functions and linear algebra kernels.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122051200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dynamic Load Balancing of SAMR Applications on Distributed Systems","authors":"Z. Lan, V. Taylor, G. Bryan","doi":"10.1145/582034.582070","DOIUrl":"https://doi.org/10.1145/582034.582070","url":null,"abstract":"Dynamic load balancing (DLB) for parallel systems has been studied extensively; however, DLB for distributed systems is relatively new. To efficiently utilize computing resources provided by distributed systems, an underlying DLB scheme must address both heterogeneous and dynamic features of distributed systems. In this paper, we propose a DLB scheme for Structured Adaptive Mesh Refinement (SAMR) applications on distributed systems. While the proposed scheme can take into consideration (1) the heterogeneity of processors and (2) the heterogeneity and dynamic load of the networks, the focus of this paper is on the latter. The load-balancing processes are divided into two phases: global load balancing and local load balancing. We also provide a heuristic method to evaluate the computational gain and redistribution cost for global redistribution. Experiments show that by using our distributed DLB scheme, the execution time can be reduced by 9-46% as compared to using parallel DLB scheme which does not consider the heterogeneous and dynamic features of distributed systems.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130273133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Network and I/O Throttling for Fine-Grain Cycle Stealing","authors":"K. D. Ryu, J. Hollingsworth, P. Keleher","doi":"10.1145/582034.582037","DOIUrl":"https://doi.org/10.1145/582034.582037","url":null,"abstract":"This paper proposes and evaluates a new mechanism, rate windows, for I/O and network rate policing. The goal of the proposed system is to provide a simple, yet effective way to enforce resource limits on target classes of jobs in a system. This work was motivated by our Linger Longer infrastructure, which harvests idle cycles in networks of workstations. Network and I/O throttling is crucial because Linger Longer can leave guest jobs on non-idle nodes and machine owners should not be adversely affected. Our approach is quite simple. We use a sliding window of recent events to compute the average rate for a target resource. The assigned limit is enforced by the simple expedient of putting application processes to sleep when they issue requests that would bring their resource utilization out of the allowable profile. Our I/O system call intercept model makes the rate windows mechanism light-weight and highly portable. Our experimental results show that we are able to limit resource usage to within a few percent of target usages.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127612520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Matrix Multiplies Using Graphics Hardware","authors":"E. S. Larsen, David K. McAllister","doi":"10.1145/582034.582089","DOIUrl":"https://doi.org/10.1145/582034.582089","url":null,"abstract":"We present a technique for large matrix-matrix multiplies using low cost graphics hardware. The result is computed by literally visualizing the computations of a simple parallel processing algorithm. Current graphics hardware technology has limited precision and thus limits immediate applicability of our algorithm. We include results demonstrating proof of concept, correctness, speedup, and a simple application. This is therefore forward looking research: a technique ready for technology on the horizon.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127430208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coastal Ocean Modeling of the U.S. West Coast with Multiblock Grid and Dual-Level Parallelism","authors":"P. Luong, Clay P. Breshears, L. Ly","doi":"10.1145/582034.582043","DOIUrl":"https://doi.org/10.1145/582034.582043","url":null,"abstract":"In coastal ocean modeling, a one-block rectangular grid for a large domain has large memory requirements and long processing times. With complicated coastlines, the number of grid points used in the calculation is often the same or smaller than the number of unused grid points. These problems have been a major concern for researchers in this field. Multiblock grid generation and dual-level parallel techniques are solutions that can overcome these problems. The Multiblock Grid Princeton Ocean Model (MGPOM) uses Message Passing Interface (MPI) to parallelize computations by assigning each grid block to a unique processor. Since not all grid blocks are of the same size, the workload between MPI processes varies. Pthreads is used to improve load balance. Performance results from the MGPOM model on a one-block grid and a 29-block grid simulation for the U.S. west coast demonstrate the efficacy of both the MPI-Only and MPI-Pthreads code versions.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131192919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Parallel Application Launch on Cplant ™","authors":"R. Brightwell, L. Fisk","doi":"10.1145/582034.582074","DOIUrl":"https://doi.org/10.1145/582034.582074","url":null,"abstract":"This paper describes the components of a runtime system for launching parallel applications and presents performance results for starting a job on more than a thousand nodes of a workstation cluster. This runtime system was developed at Sandia National Laboratories as part of the Computational Plant (Cplant™) project, which is deploying large-scale parallel computing clusters using commodity hardware and the Linux operating system. We have designed and implemented a flexible runtime system that allows for launching parallel jobs on thousands of nodes in a matter of seconds. The interactions of the components are described, and the key issues that address the scalability and performance of the runtime system are discussed. We also present performance results of launching executables of varying sizes on more than a thousand nodes.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"509 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134227409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David W. Miller, Jinhua Guo, Eileen T. Kraemer, Yin Xiong
{"title":"On-the-Fly Calculation and Verification of Consistent Steering Transactions","authors":"David W. Miller, Jinhua Guo, Eileen T. Kraemer, Yin Xiong","doi":"10.1145/582034.582044","DOIUrl":"https://doi.org/10.1145/582034.582044","url":null,"abstract":"Interactive Steering can be a valuable tool for understanding and controlling a distributed computation in real-time. With Interactive Steering, the user may change the state of a computation by adjusting application parameters on-the-fly. In our system, we model both the program’s execution and steering actions in terms of transactions. We define a steering transaction as consistent if its vector time is not concurrent with the vector time of any program transaction. That is, consistent steering transactions occur \"between\" program transactions, at a point that represents a consistent cut. In this paper, we present an algorithm for verifying the consistency of steering transactions. The algorithm analyzes a record of the program transactions and compares it against the steering transaction; if the time at which the steering transaction was applied is inconsistent, the algorithm generates a vector representing the earliest consistent time at which the steering transaction could have been applied.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131950519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling and Detecting Performance Problems for Distributed and Parallel Programs with JavaPSL","authors":"T. Fahringer, Clovis Seragiotto","doi":"10.1145/582034.582069","DOIUrl":"https://doi.org/10.1145/582034.582069","url":null,"abstract":"In this paper we present JavaPSL, a Performance Specification Language that can be used for a systematic and portable specification of large classes of experiment-related data and performance properties for distributed and parallel programs. Performance properties are described in a generic and normalized way, thus interpretation and comparison of performance properties is largely alleviated. Moreover, JavaPSL provides meta-properties in order to describe new properties based on existing ones and to relate properties to each other. JavaPSL uses Java and its powerful mechanisms, in particular, polymorphism, abstract classes, and reflection to describe experiment-related data and performance properties. JavaPSL can also be considered as a performance information interface based on which sophisticated performance tools can be built or other tools can access performance data in a portable way. We have implemented a prototype performance tool that uses JavaPSL to automatically detect performance bottlenecks for MPI, OpenMP, and mixed OpenMP and MPI programs. Several experiments with realistic codes demonstrate the usefulness of JavaPSL.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"267 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121409738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Applying Scheduling and Tuning to On-Line Parallel Tomography","authors":"Shava Smallen, H. Casanova, F. Berman","doi":"10.1145/582034.582046","DOIUrl":"https://doi.org/10.1145/582034.582046","url":null,"abstract":"Tomography is a popular technique to reconstruct the three-dimensional structure of an object from a series of two-dimensional projections. Tomography is resource-intensive and deployment of a parallel implementation onto Computational Grid platforms has been studied in previous work. In this work, we address on-line execution of the application where computation is performed as data is collected from an on-line instrument. The goal is to compute incremental 3-D reconstructions that provide quasi-real-time feedback to the user. We model on-line parallel tomography as a tunable application: trade-offs between resolution of the reconstruction and frequency of feedback can be used to accommodate various resource availabilities. We demonstrate that application scheduling/tuning can be framed as multiple constrained optimization problems and evaluate our methodology in simulation. Our results show that prediction of dynamic network performance is key to efficient scheduling and that tunability allows for production runs of on-line parallel tomography in Computational Grid environments.","PeriodicalId":325282,"journal":{"name":"ACM/IEEE SC 2001 Conference (SC'01)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2001-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129093274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}