{"title":"Using the NREN Testbed to Prototype a High-Performance Multicast Application","authors":"Marjory J. Johnson, M. C. Spence, L. Chao","doi":"10.1145/331532.331539","DOIUrl":"https://doi.org/10.1145/331532.331539","url":null,"abstract":"Development of the Next Generation Internet requires the development of revolutionary applications as well as advances in networking technologies. This paper presents experiences in using the NREN testbed to prototype a collaborative medical-imaging application. Technological requirements for this application include high-bandwidth reliable multicast and QoS provisioning. Engineering the network infrastructure for a demonstration in May 1999 was a challenging experience. We achieved high-bandwidth multicast for the demonstration; current efforts focus on satisfying application requirements for reliable multicast and QoS.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"2012 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114631873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Moreira, S. Midkiff, M. Gupta, Richard D. Lawrence
{"title":"High Performance Computing with the Array Package for Java: A Case Study using Data Mining","authors":"J. Moreira, S. Midkiff, M. Gupta, Richard D. Lawrence","doi":"10.1145/331532.331542","DOIUrl":"https://doi.org/10.1145/331532.331542","url":null,"abstract":"This paper discusses several techniques used in developing a parallel, production quality data mining application in Java. We started by developing three sequential versions of a product recommendation data mining application: (i) a Fortran 90 version used as a performance reference, (ii) a plain Java implementation that only uses the primitive array structures from the language, and (iii) a baseline Java implementation that uses our Array package for Java. This Array package provides parallelism at the level of individual Array and BLAS operations. Using this Array package, we also developed two parallel Java versions of the data mining application: one that relies entirely on the implicit parallelism provided by the Array package, and another that is explicitly parallel at the application level. We discuss the design of the Array package, as well as the design of the data mining application. We compare the trade-offs between performance and the abstraction level the different Java versions present to the application programmer. Our studies show that, although a plain Java implementation performs poorly, the Java implementation with the Array package is quite competitive in performance with Fortran. We achieve a single processor performance of 109 Mflops, or 91% of Fortran performance, on a 332 MHz PowerPC 604e processor. Both the implicitly and explicitly parallel forms of our Java implementations also parallelize well. On an SMP with four of those PowerPC processors, the implicitly parallel form achieves 290 Mflops with no effort from the application programmer, while the explicitly parallel form achieves 340 Mflops.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127059235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving Online Performance Diagnosis by the Use of Historical Performance Data","authors":"K. Karavanic, B. Miller","doi":"10.1145/331532.331574","DOIUrl":"https://doi.org/10.1145/331532.331574","url":null,"abstract":"Accurate performance diagnosis of parallel and distributed programs is a difficult and time-consuming task. We describe a new technique that uses historical performance data, gathered in previous executions of an application, to increase the effectiveness of automated performance diagnosis. We incorporate several different types of historical knowledge about the application’s performance into an existing profiling tool, the Paradyn Parallel Performance Tool. We gather performance and structural data from previous executions of the same program, extract knowledge useful for diagnosis from this collection of data in the form of search directives, then input the directives to an enhanced version of Paradyn, which conducts a directed online diagnosis. Compared to existing approaches, incorporating historical data shortens the time required to identify bottlenecks, decreases the amount of unhelpful instrumentation, and improves the usefulness of the information obtained from a diagnostic session.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125372550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Netwon-Krylov Methods for PDE-Constrained Optimization","authors":"G. Biros, O. Ghattas","doi":"10.1145/331532.331560","DOIUrl":"https://doi.org/10.1145/331532.331560","url":null,"abstract":"Large scale optimization of systems governed by partial differential equations (PDEs) is a frontier problem in scientific computation. The state-of-the-art for solving such problems is reduced-space quasi-Newton sequential quadratic programming (SQP) methods. These take full advantage of existing PDE solver technology and parallelize well. However, their algorithmic scalability is questionable; for certain problem classes they can be very slow to converge. In this paper we propose a full-space Newton-Krylov SQP method that uses the reduced-space quasi-Newton method as a preconditioner. The new method is fully parallelizable; exploits the structure of and available parallel algorithms for the PDE forward problem; and is quadratically convergent close to a local minimum. We restrict our attention to boundary value problems and we solve a model optimal flow control problem, with both Stokes and Navier-Stokes equations as constraints. Algorithmic comparisons, scalability results, and parallel performance on a Cray T3E-900 are presented. On the model problems solved, the new method is a factor of 5-10 faster than reduced space quasi-Newton SQP, and is scalable provided a good forward preconditioner is available.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131602860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Locality Optimizations for Multi-Level Caches","authors":"Gabriel Rivera, C. Tseng","doi":"10.1145/331532.331534","DOIUrl":"https://doi.org/10.1145/331532.331534","url":null,"abstract":"Compiler transformations can significantly improve data locality of scientific programs. In this paper, we examine the impact of multi-level caches on data locality optimizations. We find nearly all the benefits can be achieved by simply targeting the L1 (primary) cache. Most locality transformations are unaffected because they improve reuse for all levels of the cache; however, some optimizations can be enhanced. Inter-variable padding can take advantage of modular arithmetic to eliminate conflict misses and preserve group reuse on multiple cache levels. Loop fusion can balance increasing group reuse for the L2 (secondary) cache at the expense of losing group reuse at the smaller L1 cache. Tiling for the L1 cache also exploits locality available in the L2 cache. Experiments show enhanced algorithms are able to reduce cache misses, but performance improvements are rarely significant. Our results indicate existing compiler optimizations are usually sufficient to achieve good performance for multi-level caches.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"393 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121248482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mary W. Hall, P. Kogge, J. Koller, P. Diniz, Jacqueline Chame, J. Draper, J. LaCoss, J. Granacki, J. Brockman, Apoorv Srivastava, W. Athas, V. Freeh, Jaewook Shin, Joonseok Park
{"title":"Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture","authors":"Mary W. Hall, P. Kogge, J. Koller, P. Diniz, Jacqueline Chame, J. Draper, J. LaCoss, J. Granacki, J. Brockman, Apoorv Srivastava, W. Athas, V. Freeh, Jaewook Shin, Joonseok Park","doi":"10.1145/331532.331589","DOIUrl":"https://doi.org/10.1145/331532.331589","url":null,"abstract":"Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIM-to-PIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processor-memory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processor-memory bus. DIVA uniquely supports acceleration of important irregular applications, including sparse-matrix and pointer-based computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the memory model and parcel definitions; (2) the PIM-to-PIM interconnect; and, (3) requirements for the processor-to-memory interface. We demonstrate the potential of PIM-based architectures in accelerating the performance of three irregular computations, sparse conjugate gradient, a natural-join database operation and an object-oriented database query.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125889881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data Organization and I/O in a Parallel Ocean Circulation Model","authors":"C. Ding, Yun He","doi":"10.1145/331532.331565","DOIUrl":"https://doi.org/10.1145/331532.331565","url":null,"abstract":"We describe an efficient and scalable parallel I/O strategy for writing out gigabytes of data generated hourly in the ocean model simulations on massively parallel distributed-memory architectures. Working with Modular Ocean Model, using netCDF file system, and implemented on Cray T3E, the strategy speeds up I/O by a factor of 50 in the sequential case. In parallel case, on 32 processors up to 512 processors, our implementation writes out most model dynamic fields of 969 MB to a single netCDF file in 65 seconds, independent of the number of processors. The remap-and-write parallel strategy resolves the memory limitation problem and requires minimal collective I/O capability of the file system. Several critical optimizations on memory management and file access are carried out, ensuring scalability and speeding up numerical simulation due to the improved memory organizations.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122913317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stuart Bailey, R. Grossman, H. Sivakumar, Andrei L. Turinsky
{"title":"Papyrus: A System for Data Mining over Local and Wide Area Clusters and Super-Clusters","authors":"Stuart Bailey, R. Grossman, H. Sivakumar, Andrei L. Turinsky","doi":"10.1145/331532.331595","DOIUrl":"https://doi.org/10.1145/331532.331595","url":null,"abstract":"Data mining is the semi-automatic discovery of patterns, correlations, changes, associations, and anomalies in large data sets. Traditionally, in a broad sense, statistics has focused on the assumption-driven analysis of data, while data mining has focused on the discovery-driven analysis of data. By discoverydriven, we mean the automatic search or semi-automatic search for interesting patterns and models. With the explosion of the commodity internet and the emergence of wide area high performance networks, mining distributed data is becoming recognized as a fundamental scientific challenge. In this paper, we introduce a system called Papyrus for distributed data mining over commodity and high performance networks and give some preliminary experimental results about its performance. We are particularly interested in data mining over clusters of workstations, distributed clusters connected by high performance networks (super-clusters), and distributed clusters and super-clusters connected by commodity networks (meta-clusters). As a motivating example taken from [7], consider the problem of searching for correlations between twenty five years of sunspot data archived on a server in Boulder and 80 years of Southern night marine air temperature data archived on a server in Maryland. The goal of this data mining query might be to understand whether sunspots are correlated with climatic shifts in temperature. Notice that","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"666 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132275745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Mahinthakumar, F. Hoffman, W. Hargrove, N. Karonis
{"title":"Multivariate Geographic Clustering in A Metacomputing Environment Using Globus","authors":"G. Mahinthakumar, F. Hoffman, W. Hargrove, N. Karonis","doi":"10.1145/331532.331537","DOIUrl":"https://doi.org/10.1145/331532.331537","url":null,"abstract":"The authors present a metacomputing application of multivariate, nonhierarchical statistical clustering to geographic environmental data from the 48 conterminous United States in order to produce maps of regions of ecological similarity, called ecoregions. These maps represent finer scale regionalizations than do those generated by the traditional technique: an expert with a marker pen. Several variables (e.g., temperature, organic matter, rainfall etc.) thought to affect the growth of vegetation are clustered at resolutions as fine as one square kilometer (1 km2). These data can represent over 7.8 million map cells in an n-dimensional (n = 9 to 25) data space. A parallel version of the iterative statistical clustering algorithm is developed by the authors using the MPI (Message Passing Interface) message passing routines. The parallel algorithm uses a classical, self-scheduling, single-program, multiple data (SPMD) organization; performs dynamic load balancing for reasonable performance in heterogeneous metacomputing environments; and provides fault tolerance by saving intermediate results for easy restarts in case of hardware failure. The parallel algorithm was tested on various geographically distributed heterogeneous metacomputing configurations involving an IBM SP3TM, an IBM SP2TM, and two SGI Origin 2000TM ’s. The tests were performed with minimal code modification, and were made possible by GlobusTM (a metacomputing software toolkit) and the Globus-enabled version of MPI (MPICH-G). Our performance tests indicate that while the algorithm works reasonably well under the metacomputing environment for a moderate number of processors, the communication overhead can become prohibitive for large processor configurations.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127022808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Multigrid Solver for 3D Unstructured Finite Element Problems","authors":"M. Adams, J. Demmel","doi":"10.1145/331532.331559","DOIUrl":"https://doi.org/10.1145/331532.331559","url":null,"abstract":"Multigrid is a popular solution method for the system of linear algebraic equations that arise from PDEs discretized with the finite element method. The application of multigrid to unstructured grid problems, however, is not well developed. We discuss a method, that uses many of the same techniques as the finite element method itself, to apply standard multigrid algorithms to unstructured finite element problems. We use maximal independent sets (MISs) as a mechanism to automatically coarsen unstructured grids; the inherent flexibility in the selection of an MIS allows for the use of heuristics to improve their effectiveness for a multigrid solver. We present parallel algorithms, based on geometric heuristics, to optimize the quality of MISs and the meshes constructed from them, for use in multigrid solvers for 3D unstructured problems. We conduct scalability studies that demonstrate the effectiveness of our methods on a problem in large deformation elasticity and plasticity of up to 40 million degrees of freedom on 960 processor IBM PowerPC 4-way SMP cluster with about 60% parallel efficiency.","PeriodicalId":354898,"journal":{"name":"ACM/IEEE SC 1999 Conference (SC'99)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128991929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}