{"title":"A localized dynamic load balancing strategy for highly parallel systems","authors":"M. Willebeek-LeMair, A. Reeves","doi":"10.1109/FMPC.1990.89487","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89487","url":null,"abstract":"Two dynamic load-balancing strategies, a local diffusion (RID) and a global exchange (DEM) strategy, designed to support massively parallel systems are presented and compared. The effects of system size and task granularity are studied. Both strategies are implemented on a 32-processor iPSC/2 and a 256-processor IBM Victor. Even for low degrees of parallelism the performance of the DEM and RID strategies is very similar. The efficiency of the DEM strategy, however, depends heavily on the system interconnection topology. Furthermore, the system sizes tested were small in the context of massively parallel systems. The overhead costs of synchronization (scale as O(N)) for the DEM approach may cause a serious deterioration of performance. The RID strategy is easily embedded into simpler topologies, and can scale gracefully for larger systems. Finally, the RID scheme is able to maintain task locality, supporting a wider variety of applications that exhibit local communication dependencies between tasks. Therefore, the RID strategy may offer a superior performance when locality is important.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126343698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data optimization: minimizing residual interprocessor data motion on SIMD machines","authors":"K. Knobe, V. Natarajan","doi":"10.1109/FMPC.1990.89492","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89492","url":null,"abstract":"Basic concepts in array layout are summarized, and unhonored preferences and residual data motion are discussed. A technique for minimizing such motion is presented. For each array the source program is divided into regions, each associated with a single home. This enables efficient handling of residual data motion. The partitioning into regions is based on control flow and data dependence. Preliminary results obtained with this technique show an order-of-magnitude improvement for certain classes of programs.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123814071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An optimal lookahead processor to prune search space","authors":"J. Gu","doi":"10.1109/FMPC.1990.89462","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89462","url":null,"abstract":"The discrete relaxation algorithm (DRA) is an efficient computational technique for enforcing arc consistency (AC) in a consistent labeling problem (CLP). The original sequential AC-1 algorithm suffers from O(n/sup 3/m/sup 3/) time complexity for an n-object and m-label problem. Sample problem runs show that all these sequential algorithms are too slow to meet the need for any useful real-time CLP applications. An optimal parallel DRA5 algorithm that reaches the optimal lower bound, O(nm), for parallel AC algorithms (where the number of processors is polynomial in the problem size) is given. The algorithm has been implemented on a fine-grained, massively parallel hardware computer architecture. For problems of practical interest, 4 to 10 orders of magnitude of efficiency improvement can be reached on this hardware architecture.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"359 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122749293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Koh, G. S. Moon, K. Mehrotra, C. Mohan, Sanjay Ranka
{"title":"Korean character recognition using neural networks","authors":"J. Koh, G. S. Moon, K. Mehrotra, C. Mohan, Sanjay Ranka","doi":"10.1109/FMPC.1990.89454","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89454","url":null,"abstract":"A neural network approach for recognizing printed Korean characters, based on a variant of the backpropagation algorithm, is presented. Implementation of the algorithms for neural networks with Hough transform inputs provided excellent recognition: about 81% of the training samples and 73% of the tested samples can be recognized.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128134159","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On single parameter characterization of parallelism","authors":"D. Marinescu, J. Rice","doi":"10.1109/FMPC.1990.89464","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89464","url":null,"abstract":"Issues pertinent to performance analysis of massively parallel systems are discussed. Attention is focused on the average parallelism of a software structure, which has been proposed as a single-parameter characterization of parallel software. It is argued that single-parameter characterization of parallel software or of parallel hardware rarely provides insight into the complex interactions among the software and hardware components of a parallel system. In particular, bounds for the speedup based on simple models of parallelism are violated when a model ignores the effects of communication delays.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133347161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Designing the 3-LAP (three layers associative processor) for arithmetic and symbolic applications","authors":"C. Davarakis, D. Maritsas","doi":"10.1109/FMPC.1990.89471","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89471","url":null,"abstract":"A variant of the MULTAP architecture, called 3-LAP, is presented. This three-layer machine is designed from the middle out, beginning with its finite-state-machine diagram and working toward its low-level processing element cell specification and its high-level algorithm applications definition. The 3-LAP's operating and control parts are defined, the estimated machine throughput performance is presented (over 100 GCOPS (giga complex operations per second)), the processing element cell is defined, and arithmetic and symbolic application primitives in 3-LAP instructions are described.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114567233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance analysis of an implementation of the Beam and Warming implicit factored scheme on the NCUBE hypercube","authors":"P. J. Kominsky","doi":"10.1109/FMPC.1990.89447","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89447","url":null,"abstract":"A production 3-D Beam and Warming implicit Navier Stokes code has been implemented on the NCUBE hypercube using the grid allocation scheme of J. Bruno and P.R. Capello (see Proc. 3rd Conf. on Hypercube Concurrent Computers and Applications, p.1073-87, 1988). Predicted (32-b) performance on 1024 nodes is 67.1 MFLOPS. Efficiencies of 70% are attainable for implicit algorithms, although constant-memory scaled performance is found to decrease with increasing number of nodes, unlike the case for explicit implementations.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123918509","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Index domain alignment: minimizing cost of cross-referencing between distributed arrays","authors":"Jingke Li, M. Chen","doi":"10.1109/FMPC.1990.89493","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89493","url":null,"abstract":"The issue of data movement between processors due to cross-references between multiple distributed arrays is addressed. The problem of index domain alignment is formulated as finding a set of suitable alignment functions that map the index domains of the arrays into a common index domain so as to minimize the cost of data movement due to cross-references between the arrays. The cost function and the machine model used are abstractions of the current generation of distributed-memory machines. The problem as formulated is shown to be NP-complete. A heuristic algorithm is devised and shown to be efficient and to provide excellent results.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"58 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129762774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simulation of neural networks on a massively parallel computer (DAP-510) using sparse matrix techniques","authors":"S.N. Gupta, M. Zubair, C. Grosch","doi":"10.1109/FMPC.1990.89486","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89486","url":null,"abstract":"A parallel sparse matrix algorithm is proposed for the simulation of the modified Hopfield-Tank (MHT) network for solving the Traveling Salesman Problem (TSP). The MHT network using this sparse matrix algorithm has been implemented on a DAP-510, a massively parallel SIMD (single-instruction-steam, multiple-data-stream) computer consisting of 1024 processors. Problems of various sizes, ranging from eight cities up to 256 cities, have been simulated. The results show a very large speedup for the algorithm as compared with one using a standard dense matrix implementation.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128071476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Indirect addressing and load balancing for faster solution to Mandelbrot set on SIMD architectures","authors":"S. Tomboulian, M. Pappas","doi":"10.1109/FMPC.1990.89495","DOIUrl":"https://doi.org/10.1109/FMPC.1990.89495","url":null,"abstract":"The authors present a method for using local indirect addressing to achieve faster solutions for some problems with data-dependent convergence rates on SIMD (single-instruction-stream, multiple-data-stream) architectures. A class of problems characterized by computations on data points where the computation is identical but the convergence rate is data dependent is examined. In the absence of indirect addressing, algorithm time is governed by the maximum number of iterations. An algorithm using indirect addressing allows a processor to proceed to the next data point upon convergence. Thus the overall number of iterations will approach the mean convergence rate for a sufficiently large problem. Load-balancing techniques can be applied for additional performance improvement. These techniques are used for solving Mandelbrot sets on the MP-1 massively parallel computer.<<ETX>>","PeriodicalId":193332,"journal":{"name":"[1990 Proceedings] The Third Symposium on the Frontiers of Massively Parallel Computation","volume":"21 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125083014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}