{"title":"Embedding of k-ary complete trees into hypercubes with optimal load","authors":"Jan Trdlicka, P. Tvrdík","doi":"10.1109/SPDP.1996.570390","DOIUrl":"https://doi.org/10.1109/SPDP.1996.570390","url":null,"abstract":"The main result of the paper is an algorithm for embedding k-ary complete trees into hypercubes with optimal load and asymptotically optimal dilation. The algorithm is fully scalable, the dimension of the hypercube can be chosen independently of the arity and height of the complete tree. The basic property of the embedded tree is that both all the tree nodes at a given level and all the tree nodes together are uniformly distributed within equally-sized subcubes of the hypercube. This implies that no hypercube node is loaded with more than [A/sub h//2/sup n/] tree nodes and [B/sub h//2/sup n/] leaves of the tree, where A/sub h/ is the number of all tree nodes, B/sub h/ is the number of leaves of the k-ary complete tree of height h, and n is the dimension of the hypercube. The embedding enables optimal emulations of both divide and conquer computations on the k-ary complete tree, where only one level of nodes is active at a time, and general computations based on k-ary complete trees, where all tree nodes are active simultaneously. As a special case the authors obtain an algorithm for embedding the k-ary complete tree of height h into its optimal hypercube with load 1 and with dilation that is only by a small constant factor worse than the lower bound. This improves the best previous result by Shen et al. (1995), whose embedding has load 1 and nearly optimal dilation, but requires much larger than the optimal hypercube.","PeriodicalId":360478,"journal":{"name":"Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121210858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Extending functional languages with stateful computations","authors":"Yung-Syau Chen, J. Gaudiot","doi":"10.1109/SPDP.1996.570381","DOIUrl":"https://doi.org/10.1109/SPDP.1996.570381","url":null,"abstract":"A new approach in which stateful computations can be performed within the framework of a functional programming language is presented. In most functional programming languages, programmers are unable to easily manipulate state-based computations which are not supported by functional languages. To solve this problem, the authors propose to extend the Sisal language with special user declared variables. This approach can greatly help users in writing programs, simplifying parallel compilation, and improving performance. Under this scheme, programmers are able to manipulate stateful computations. In the methodology, programmers are allowed to declare special variables, and the parallel threads can be identified according to the usage of special variables. When compared to \"pure\" functional languages, the extended Sisal has more expressive power due to the availability of stateful computations.","PeriodicalId":360478,"journal":{"name":"Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124569685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A compiler address transformation for conflict-free access of memories and networks","authors":"M. Al-Mouhamed, L. Bic, Husam Abu-Haimed","doi":"10.1109/SPDP.1996.570378","DOIUrl":"https://doi.org/10.1109/SPDP.1996.570378","url":null,"abstract":"A method for mapping arrays into parallel memories to minimize serialization and network conflicts for lock-step systems is presented. Each array is associated an arbitrary number of data access patterns that can be identified following compiler data-dependence analysis. Conditions for conflict-free access of parallel memories and network are derived for arbitrary power-of-2 data patterns and arbitrary multistage networks. The authors propose an efficient heuristic to synthesize combined address transformation (NP complete) which applies to arbitrary linear patterns, arbitrary multistage networks, and an arbitrary number of power-of-2 memories. The method can be implemented as part of the address transformation (Xor and And) or through compiler emulation. The performance of optimized storage schemes is presented for FFT, arbitrary sets of data patterns, non power-of-2 stride access in vector processors, interleaving, and static row-column storages. Their approach is profitable in all the above cases and provides a systematic method for converting array-memory mapping and network aspects of algorithms from one network topology to another.","PeriodicalId":360478,"journal":{"name":"Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116430567","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. E. Barrows, Dawn E. Gregory, Lixin Gao, A. Rosenberg, P. Cohen
{"title":"An empirical study of dynamic scheduling on rings of processors","authors":"M. E. Barrows, Dawn E. Gregory, Lixin Gao, A. Rosenberg, P. Cohen","doi":"10.1109/SPDP.1996.570370","DOIUrl":"https://doi.org/10.1109/SPDP.1996.570370","url":null,"abstract":"The authors empirically analyze and compare two distributed low-overhead policies for scheduling dynamic tree-structured computations on rings of identical PEs. The experiments show that both policies give significant parallel speedup on large classes of computations, and that one yields almost optimal speedup on moderate size rings. They believe that the methodology of experiment design and analysis will prove useful in other such studies.","PeriodicalId":360478,"journal":{"name":"Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130565461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance of parallel algorithms for a fingerprint image comparison system","authors":"H. Ammar, Zhouhui Miao","doi":"10.1109/SPDP.1996.570362","DOIUrl":"https://doi.org/10.1109/SPDP.1996.570362","url":null,"abstract":"This paper addresses the problem of analyzing the performance of parallel algorithms for the training procedure of a neural network based fingerprint image comparison (FIC) system. The target architecture is assumed to be a coarse-grain distributed memory parallel architecture. Two types of parallelism: node parallelism and training set parallelism (TSP) are investigated. These algorithms are implemented on a 32 node CM-5. Theoretical analysis and experimental results comparing the performance of these algorithms are presented.","PeriodicalId":360478,"journal":{"name":"Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing","volume":"311 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124423713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient parallel scheduling algorithm","authors":"Minyou Wu","doi":"10.1109/SPDP.1996.570342","DOIUrl":"https://doi.org/10.1109/SPDP.1996.570342","url":null,"abstract":"Most static scheduling algorithms that schedule parallel programs represented by directed acyclic graphs (DAGs) are sequential. This paper discusses the essential issues on parallelization of static scheduling algorithms. An efficient parallel scheduling algorithm, the HPMCP algorithm, is proposed. It produces high-quality scheduling and is much faster than existing algorithms.","PeriodicalId":360478,"journal":{"name":"Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117043057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impact of load balancing on unstructured adaptive grid computations for distributed-memory multiprocessors","authors":"A. Sohn, R. Biswas, H. Simon","doi":"10.1109/SPDP.1996.570313","DOIUrl":"https://doi.org/10.1109/SPDP.1996.570313","url":null,"abstract":"The computational requirements for an adaptive solution of unsteady problems change as the simulation progresses. This causes workload imbalance among processors on a parallel machine which, in turn, requires significant data movement at runtime. We present a new dynamic load-balancing framework, called JOVE, that balances the workload across all processors with a global view. Whenever the computational mesh is adapted, JOVE is activated to eliminate the load imbalance. JOVE has been implemented on an IBM SP2 distributed-memory machine in MPI for portability. Experimental results for two model meshes demonstrate that mesh adaption with load balancing gives more than a sixfold improvement over one without load balancing. We also show that JOVE gives a 24-fold speedup on 64 processors compared to sequential execution.","PeriodicalId":360478,"journal":{"name":"Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing","volume":"136 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127834379","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measurement and simulation based performance analysis of parallel I/O in a high-performance cluster system","authors":"C. Natarajan, R. Iyer","doi":"10.1109/SPDP.1996.570351","DOIUrl":"https://doi.org/10.1109/SPDP.1996.570351","url":null,"abstract":"This paper presents a measurement and simulation based study of parallel I/O in a high-performance cluster system: the Pittsburgh Supercomputing Center (PSC) DEC Alpha Supercluster. The measurements were used to characterize the performance bottlenecks and the throughput limits at the compute and I/O nodes, and to provide realistic input parameters to PioSim, a simulation environment we have developed to investigate parallel I/O performance issues in cluster systems. PioSim was used to obtain a detailed characterization of parallel I/O performance, in the high performance cluster system, for different regular access patterns and different system configurations. This paper also explores the use of local disks at the compute nodes for parallel I/O, and finds that the local disk architecture outperforms the traditional parallel I/O over remote I/O node disks architecture, even when as much as 68-75% of the requests from each compute node goes to remote disks.","PeriodicalId":360478,"journal":{"name":"Proceedings of SPDP '96: 8th IEEE Symposium on Parallel and Distributed Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132984432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}