{"title":"Balancing Processor Loads and Exploiting Data Locality in N-Body Simulations","authors":"I. Banicescu, S. F. Hummel","doi":"10.1145/224170.224306","DOIUrl":"https://doi.org/10.1145/224170.224306","url":null,"abstract":"Although N-body simulation algorithms are amenable to parallelization, performance gains from execution on parallel machines are difficult to obtain due to load imbalances caused by irregular distributions of bodies. In general, there is a tension between balancing processor loads and maintaining locality, as the dynamic re-assignment of work necessitates access to remote data. Fractiling is a dynamic scheduling scheme that simultaneously balances processor loads and maintains locality by exploiting the self-similarity properties of fractals. Fractiling is based on a probabilistic analysis, and thus, accommodates load imbalances caused by predictable phenomena, such as irregular data, and unpredictable phenomena, such as data-access latencies. In experiments on a KSR1, performance of N-body simulation codes were improved by as much as 53% by fractiling. Performance improvements were obtained on uniform and nonuniform distributions of bodies, underscoring the need for a scheduling scheme that accommodates system induced variance. As the fractiling scheme is orthogonal to the N-body algorithm, we could use simple codes that discretize space into equal-size subrectangles (2-d) or subcubes (3-d) as the base algorithms.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129224142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
John Plevyak, V. Karamcheti, Xingbin Zhang, A. Chien
{"title":"A Hybrid Execution Model for Fine-Grained Languages on Distributed Memory Multicomputers","authors":"John Plevyak, V. Karamcheti, Xingbin Zhang, A. Chien","doi":"10.1145/224170.224302","DOIUrl":"https://doi.org/10.1145/224170.224302","url":null,"abstract":"While fine-grained concurrent languages can naturally capture concurrency in many irregular and dynamic problems, their flexibility has generally resulted in poor execution effciency. In such languages the computation consists of many small threads which are created dynamically and synchronized implicitly. In order to minimize the overhead of these operations, we propose a hybrid execution model which dynamically adapts to runtime data layout, providing both sequential efficiency and low overhead parallel execution. This model uses separately optimized sequential and parallel versions of code. Sequential efficiency is obtained by dynamically coalescing threads via stack-based execution and parallel efficiency through latency hiding and cheap synchronization using heap-allocated activation frames. Novel aspects of the stack mechanism include handling return values for futures and executing forwarded messages (the responsibility to reply is passed along, like call/cc in Scheme) on the stack. In addition, the hybrid execution model is expressed entirely in C, and therefore is easily portable to many systems. Experiments with function-call intensive programs show that this model achieves sequential efficiency comparable to C programs. Experiments with regular and irregular application kernels on the CM5 and T3D demonstrate that it can yield 1.5 to 3 times better performance than code optimized for parallel execution alone.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126211182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Seamons, Ying Chen, P. Jones, J. Jozwiak, M. Winslett
{"title":"Server-Directed Collective I/O in Panda","authors":"K. Seamons, Ying Chen, P. Jones, J. Jozwiak, M. Winslett","doi":"10.1145/224170.224371","DOIUrl":"https://doi.org/10.1145/224170.224371","url":null,"abstract":"We present the architecture and implementation results for Panda 2.0, a library for input and output of multidimensional arrays on parallel and sequential platforms. Panda achieves remarkable performance levels on the IBM SP2, showing excellent scalability as data size increases and as the number of nodes increases, and provides throughputs close to the full capacity of the AIX file system on the SP2 we used. We argue that this good performance can be traced to Panda's use of server-directed i/o (a logical-level version of disk-directed i/o [Kotz94b]) to perform array i/o using sequential disk reads and writes, a very high level interface for collective i/o requests, and built-in facilities for arbitrary rearrangements of arrays during i/o. Other advantages of Panda's approach are ease of use, easy application portability, and a reliance on commodity system software.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125820135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting Application Behavior in Large Scale Shared-Memory Multiprocessors","authors":"Karim Harzallah, K. Sevcik","doi":"10.1145/224170.224356","DOIUrl":"https://doi.org/10.1145/224170.224356","url":null,"abstract":"In this paper we present an analytical-based framework for parallel program performance prediction. The main thrust of this work is to provide a means for treating realistic applications within a single unified framework. Our approach is based upon the specification of a set of non-linear equations which describe the application, processor configuration, network and memory operations. These equations are solved iteratively since the application execution rate depends on the communication latencies. The iterative solution technique is found to be efficient as it typically requires only few iterations to reach convergence. Our modeling methodology achieves a good balance between abstraction and accuracy. This is attained by accounting for both time and space dimensions of memory references, while maintaining a simple description of the workload. We demonstrate both the practicality and the accuracy of our approach by comparing predicted results with measurements taken on a commercial multiprocessor system. We found the model to be faithful in reflecting changes in processor speed, and changes in the number and placement of allocated processors.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123660587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Browne, J. Dongarra, G. Fox, K. Hawick, K. Kennedy, R. Stevens, R. Olson, T. Rowan
{"title":"Distributed Information Management in the National HPCC Software Exchange","authors":"S. Browne, J. Dongarra, G. Fox, K. Hawick, K. Kennedy, R. Stevens, R. Olson, T. Rowan","doi":"10.1145/224170.224211","DOIUrl":"https://doi.org/10.1145/224170.224211","url":null,"abstract":"The National HPCC Software Exchange is a collaborative effort by member institutions of the Center for Research on Parallel Computation to provide network access to HPCC-related software, documents, and data. Challenges for the NHSE include identifying, organizing, filtering, and indexing the rapidly growing wealth of relevant information available on the Web. The large quantity of information necessitates performing these tasks using automatic techniques, many of which make use of parallel and distribution computation, but human intervention is needed for intelligent abstracting, analysis, and critical review tasks. Thus, major goals of NHSE research are to find the right mix of manual and automated techniques, and to leverage the results of manual efforts to the maximum extent possible. This paper describes our current information gathering and processing techniques, as well as our future plans for integrating the manual and automated approaches. The NHSE home page is accessible at http://www.netlib.org/nhse/.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126308056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kim Mills, Geoffrey C. Fox, P. Coddington, Barbara Mihalas, M. Podgorny, Barbara Shelly, Steven Bossert
{"title":"The Living Textbook and the K-12 Classroom of the Future","authors":"Kim Mills, Geoffrey C. Fox, P. Coddington, Barbara Mihalas, M. Podgorny, Barbara Shelly, Steven Bossert","doi":"10.1145/224170.224196","DOIUrl":"https://doi.org/10.1145/224170.224196","url":null,"abstract":"The Living Textbook creates a unique learning environment enabling teachers and students to use educational resources on multimedia information servers, supercomputers, parallel databases, and network testbeds. We have three innovative educational software applications running in our laboratory, and under test in the classroom. Our education-focused goal is to learn how new, learner-driven, explorative models of learning can be supported by these high bandwidth, interactive applications and ultimately how they will impact the classroom of the future.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131811320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributing a Chemical Process Optimization Application Over a Gigabit Network","authors":"R. Clay, P. Steenkiste","doi":"10.1145/224170.224310","DOIUrl":"https://doi.org/10.1145/224170.224310","url":null,"abstract":"We evaluate the impact of a gigabit network on the implementation of a distributed chemical process optimization application. The optimization problem is formulated as a stochastic Linear Assignment Problem and was solved using the Thinking Machines CM-2 (SIMD) and the Cray C-90 (vector) computers at PSC, and the Intel iWarp (MIMD) system at CMU, connected by the Gigabit Nectar testbed. We report our experience distributing the application across this heterogeneous set of systems and present measurements that show how the communication requirements of the application depend on the structure of the application. We use detailed traces to build an application performance model that can be used to estimate the elapsed time of the application for different computer system and network combinations. Our results show that the application benefits from the high-speed network, and that the need for high network throughput is increasing as computer systems get faster. We also observed that supporting high burst rates is critical, although structuring the application so that communication is overlapped with computation relaxes the bandwidth requirements.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122292906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Surface Fitting Using GCV Smoothing Splines on Supercomputers","authors":"Alan Williams, K. Burrage","doi":"10.1145/224170.224192","DOIUrl":"https://doi.org/10.1145/224170.224192","url":null,"abstract":"The task of fitting smoothing spline surfaces to meteorological data such as temperature or rainfall observations is computationally intensive. The Generalised Cross Validation (GCV) smoothing algorithm is O(n³) computationally, and memory requirements are 0(n²). Fitting a spline to a moderately sized data set of, for example. 1080 observations and calculating an output surface grid of dimension 220 × 220 involves approximately 5 billion floating point operations, and takes approximately 19 minutes of execution time on a Sun SPARC2 workstation. Since fitting a surface to data collected from the whole of Australia could conceivably involve data sets with approximately 10000 points, and because it is desirable to be able to fit surfaces of at least 1000 data points in 1 to 5 seconds for use in interactive visualisations, it is crucial to be able to take advantage of supercomputing resources. This paper describes the adaptation of the surface fitting program to different supercomputing platforms, and the results achieved.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114994388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Modeling the Performance of a Fast Connected Components Algorithm on Parallel Machines","authors":"S. Lumetta, A. Krishnamurthy, D. Culler","doi":"10.1145/224170.224275","DOIUrl":"https://doi.org/10.1145/224170.224275","url":null,"abstract":"We present and analyze a portable, high-performance algorithm for finding connected components on modern distributed memory multiprocessors. The algorithm is a hybrid of the classic DFS on the subgraph local to each processor and a variant of the Shiloach-Vishkin PRAM algorithm on the global collection of subgraphs. We implement the algorithm in Split-C and measure performance on the the Cray T3D, the Meiko CS-2, and the Thinking Machines CM-5 using a class of graphs derived from cluster dynamics methods in computational physics. On a 256 processor Cray T3D, the implementation outperforms all previous solutions by an order of magnitude. A characterization of graph parameters allows us to select graphs that highlight key performance features. We study the effects of these parameters and machine characteristics on the balance of time between the local and global phases of the algorithm and find that edge density, surface-to-volume ratio, and relative communication cost dominate performance. By understanding the effect of machine characteristics on performance, the study sheds light on the impact of improvements in computational and/or communication performance on this challenging problem.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133309054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Large Eddy Simulation of a Spatially-Developing Boundary Layer","authors":"Xiaohua Wu, K. Squires, T. Lund","doi":"10.1145/224170.224408","DOIUrl":"https://doi.org/10.1145/224170.224408","url":null,"abstract":"A method for generation of a three-dimensional, time-dependent turbulent inflow condition for simulation of spatially-developing boundary layers is described. Assuming self-preservation of the boundary layer, a quasi-homogeneous coordinate is defined along which streamwise inhomogeneity is minimized (Spalart 1988). Using this quasi-homogeneous coordinate and decomposition of the velocity into a mean and periodic part, the velocity field at a location near the exit boundary of the computational domain is re-introduced at the in- flow boundary at each time step. The method was tested using large eddy simulations of a flat-plate boundary layer for momentum thickness Reynolds numbers ranging from 1470 to 1700. Subgrid scale stresses were modeled using the dynamic eddy viscosity model of Germano et al. (1991). Simulation results demonstrate that the essential features of spatially-developing turbulent boundary layers are reproduced using the present approach without the need for a prolonged and computationally expensive laminar-turbulent transition region. Boundary layer properties such as skin friction and shape factor as well as mean velocity profiles and turbulence intensities are in good agreement with experimental measurements and results from direct numerical simulation. Application of the method for calculation of spatially-developing complex turbulent boundary layers is also described.","PeriodicalId":269909,"journal":{"name":"Proceedings of the IEEE/ACM SC95 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132369789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}