{"title":"Adaptive Optics Calculations Using the Connection Machine","authors":"R. Firestone, Eric N. Opp","doi":"10.1109/DMCC.1991.633209","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633209","url":null,"abstract":"The performance of reflecting optical telescopes located on the surface of the earth are subject to distortions due to the force of gravity on the mirror and the turbulence of the atmosphere on the light path. Reflective optics are also planned for use in high-powered laser systems, where the intensity of the light itself is capable of producing distortions in the air within the instrument, thereby affecting the shape of the focused wavefront. A solution proposed by optical designers is the use of adaptive optics: an optical system in which the figure of the mirror is deformable to the extent necessary to correct for the distortions mentioned. An adaptive optical system uses a feedback loop concept, in which the distortions of the optical wavefront are measured, the necessary corrections are computed, and a set of actuators is moved to provide those corrections. The calculation of the corrections is computationally intense. Specifically, the measurement of the distortions provides a collection of phase differences between measuring points corresponding to the actuator positions. This set of phase differences is larger than the number of actuators, leading to an overdetermined problem. As physical systems have some amount of noise present, the technique of least-squares solution serves both to provide the best choice of actuator positions for this overdetermined problem and to suppress the noise in the measurements. The necessary algorithms for solving the computation portion of the adaptive optics problem consist of a matrix generator to derive the computational representation of the physical system, a matrix inversion routine, and a high-speed least-squares solver. In the optical astronomy paradigm, the computational requirement is for a small number of adjustments per second, due to the rate of atmospheric turbulence. For the laser system, with more stringent requirements, we demonstrate an improvement of 11 2 orders of magnitude, made possible only through the use of supercomputer methods. Extrapolation of these results indicates that even greater acceleration is possible if the interprocessor communication is minimized; in other words, supercomputer designers have not yet solved the problem of making interprocessor communication as efficient as that within processors (or, in the present case, between processors on a single chip).","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"229 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130910711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Flexible Interleaved Memory Design for Generalized Low Conflict Memory Access","authors":"L. S. Kaplan","doi":"10.1109/DMCC.1991.633349","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633349","url":null,"abstract":"High bandwidth delivery of data to the processor(s) is critical for good perforniance in highly parallel computer systems. To increase memory throughput, many systems make use of interleaved parallel memory banks. An implementation must provide uniform throughput with little or no contention at the memory banks for a wide variety of algorithms and access patterns. This paper proposes an implementation for an interleaved memory system that exhibits extremely low contention for the memoiry banks during virtually all patterned accesses. It also has the advantage that, due to its programmability, it imposes few requirements on the configuration of the machines in which it is used. The hardware to implement the design is dliscussed along with address space considerations. A variant of this design is currently in use on the BBN TC2000 (tm) parallel computer.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116122453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Visualization of SLALOM","authors":"D. Rover, M. B. Carter, J. Gustafson","doi":"10.1109/DMCC.1991.633313","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633313","url":null,"abstract":"Performance visua1,ization provides insights about the complex operation of concurrent computer systems. SLAL O W M is a scalable, fuced-time coinputer benchmark. Each corresponds to U method of computer performance evaluation: monitoring and benchmarking, respectively. Whereas benchmark programs typically report singlenumber performance naetrics for ease of comparison among different machines, a perforfinance monitor (via instrumentation and visualization) gives (a detailed account of the dynamks of program execution. Using sofrware tools developed for the nCCBE 2 and the MasPar MP-1 distributed memory machines and applied to the SLALOM program, we demonstrate the utility of performance visualization for fine-tuning algorithms and understanding phenomena. The tools include PICL and ParaGraph and custom VISTA components.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129138244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comparison of Particle Simulation Implementations on Two Different Parallel Architect ures","authors":"J. Mcdonald, L. Dagum","doi":"10.1109/DMCC.1991.633198","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633198","url":null,"abstract":"Direct particle simur'ation is a powerful method for analyzing low density, hypersonic re-entry flows. The method involves following a large sample of representative gas molecules through motion and collision with other molecules or with surfaces in the simulated flow. In this paper, two very different parallel architectures are examined for their suitability an particle samulation computations, na;mely the Connection Machine CM-2 and the Intel iPSC/860. The difference in architectures has resulted in very diferent parallel decompositions. The two implementations are described and performance results are given. Both implementations achieve performance comparable iio a single Cray-2 CPU, however, this performance is obtained at the cost of greatly increased programming complexity.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131751130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication Abstraction and Process Refinement","authors":"J. Yantchev","doi":"10.1109/DMCC.1991.633097","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633097","url":null,"abstract":"Concurrent systems are collections of data, processes, and communication channels. Top-down, hierarchical design of concurrent systems needs powerful abstraction facilities provided by the implementation language. While most languages provide some structuring mechanisms for data and process abstraction, none seems to provide any equivalent mechanisms for communication structuring. Communication channels are to communicate data and, therefore, all data structuring mechanisms provided by a programming language must be available to structure channels as well. In order to preserve behaviour through successive levels of design refinement, these means of communication structuring must preserve the abstraction of atomic transfers of values of arbitrary types. Int r o duct ion Most concurrent programming languages [5, 6, 7, 11 support the abstraction of concurrent systems as collections of data, processes, and communication channels. However, while they provide some structuring mechanisms for data and process abstraction, none seems to provide any equivalent mechanisms for communication structuring. Interprocess communication is almost universally viewed as a synchronised atomic exchange of values between two concurrently active processes. This affects the whole design process and intervenes with the freedom and ease in the refinement of the process structure. The design transformation steps may be non-trivial in some cases and, therefore, difficult to arrive at and verify. In addition, the implementation may be less efficient, both in storage and speed, because of unnecessary data copying and context creation for process spawning. The data structuring mechanisms supported by the contemporary programming languages provide a uniform view on data and data types. Structured data types may consist of components of arbitrary types, including themselves, and values of such types are treated as wholes and may be passed as parameters, returned as results of functions, and assigned to variables. The same applies to processes [5, 71. No distinction of kind need be made between systems with and without substructure and, indeed, a system which at one level of abstraction may be considered to consist of a process and the environment in which it evolves, may be considered as a single system at a higher level of abstraction. A process which for one purpose is taken to be atomic","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133378155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Spare Allocation and Reconfiguration in a Fault Tolerant Hypercube with Direct Connect Capability","authors":"B. Izadi, F. Ozguner","doi":"10.1109/DMCC.1991.633360","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633360","url":null,"abstract":"This paper investigates hardware reconjiguratzon schemes to make the hypercube multicomputer fault tolerant. Two schemes are proposed; the Cluster Approach and the Enhanced Cluster Approach. The approaches are shown to be able to tolerate large number of failures without any performance deg,radation. It is further demonstrated that no modification to either the existing communication or computaitional algorithm is needed. Finally a gracefully degmdable approach is presented to reconfigure when the number of faulty nodes are more than the available spares.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123649585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault Tolerance of the Cyclic Buddy Subcube Location Scheme in Hypercubes","authors":"M. Livingston, Q. Stout","doi":"10.1109/DMCC.1991.633075","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633075","url":null,"abstract":"This paper examines the problem of locating large fault-free subcubes in multiuser hypercube systems. We analyze a new location strategy, the cyclic buddy system, and compare its performance to the buddy system, the gray-coded buddy system, and several variants of them. We show that the cyclic buddy system gives a striking improvement in expected fault tolerance over the above schemes and, since it can easily be implemented in parallel with little overhead, it provides an attractive alternative to these schemes. We also investigate the behavior of these location systems in the folded, or projective, hypercube, and find that the cyclic buddy system, which adapts naturally to this enhancement, significantly outperforms the other schemes. A combination of analytic techniques and simulation is used to examine both worst case and expected case performance.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121643148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance and Assembly Language Programming of the iPSC/860 System","authors":"D. Scott, G. Withers","doi":"10.1109/DMCC.1991.633312","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633312","url":null,"abstract":"In the world of supercomputers, the goal is higher and higher performance. To obtain the highest performance on a particular computational kernel, it is usually necessary to write assembly language. Compiler technology has not matured to the point of being able to take advantage many of the features of the i860 automatically. To approach the peak performance of the chip, it is currently necessary to use custom assembly language code. It is important to know which combinations of assembly instructions offer the highest performance, and which combinations cannot run at full speed. This paper assumes that you are already acquainted with the basics of the i860 microprocessor assembly language, and concentrates on describing how to enhance the performance of your code using 860 assembly language.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116785116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Communication Primitives on Circuit-Switched Hypercubes","authors":"Ching-Tien Ho, M. Raghunath","doi":"10.1109/DMCC.1991.633172","DOIUrl":"https://doi.org/10.1109/DMCC.1991.633172","url":null,"abstract":"We give practical algorithms, complexity analysis and implementation for all-to-all personalized communication and matrix transpose (with two-dimensional partitioning of the matrix) on hypercubes. We assume the following communication characteristics: circuitswitched e-cube routing and one-port communication model. For all-to-all personalized communication, we propose a hybrid algorithm that combines the well-known recursive doubling algorithm [22,12] and a direct-route algorithm [26,23]. Our hybrid algorithm balances between data transfer time and start-up time of these two algorithms, and its communication complexity is estimated to be better than the two previous algorithms for a range of machine parameters. For matrix transpose with two-dimensional partitioning of the matrix, our algorithm is measured to be better than the recursive transpose algorithm [8] by n nearest-neighbor communications [12]. Our algorithm takes advantage of circuit-switched routing and is congestion-free for a hypercube with e-cube routing. We also suggest a way of storing the matrix such that the transpose operation can take advantage of the routing of the machine.","PeriodicalId":313314,"journal":{"name":"The Sixth Distributed Memory Computing Conference, 1991. Proceedings","volume":"236 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1991-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121069900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}