{"title":"A design methodology for fixed-size systolic arrays","authors":"J. Bu, E. Deprettere, P. Dewilde","doi":"10.1109/ASAP.1990.145495","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145495","url":null,"abstract":"The authors present a methodology to design fixed-size systolic arrays. It allows a systematic and hierarchical mapping of full-size arrays to fixed-size arrays. Two processor-clustering techniques are described. They can be used to achieve the following design objectives: (1) transforming inefficient arrays into efficient arrays, (2) reducing the size of an array, (3) reducing the dimension of an array, and (4) balancing local memory and external communication of processors. A technique is described to cluster processors in such a way that the number of I/O pins of the resulting processor is independent of the number of processors that are clustered. The approach presented unifies and generalizes array reduction techniques.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129521841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A processor-time minimal systolic array for transitive closure","authors":"C. Scheiman, P. Cappello","doi":"10.1109/ASAP.1990.145439","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145439","url":null,"abstract":"A directed acyclic graph (DAG) model of algorithms is used. For a given DAG the authors focus on processor-time minimal multiprocessor schedules: time minimal multiprocessor schedules that use as few processors as possible. The Kung, Lo and Lewis (KLL) algorithm (S.-Y. Kung et al., 1987) for computing the transitive closure of a relation over a set of n elements requires at least 5n-4 steps. Their systolic array comprises n/sup 2/ processing elements. Here, it first is shown that any multiprocessor that achieves this 5n-4 time bound needs at least (n/sup 2//3) processing elements. Then, a processor-time minimal systolic array realizing the KLL algorithm's DAG is constructed. Its (n/sup 2//3) processing elements are organized as a cylindrically connected 2-D mesh, when n identical to 0 mod 3. When n is not identical to 0 mod 3, the 2-D mesh is connected as a twisted torus.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125454784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 3-D wafer scale architecture for early vision processing","authors":"S. T. Toborg","doi":"10.1109/ASAP.1990.145462","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145462","url":null,"abstract":"A massively parallel SIMD cellular computer is designed for processing early vision algorithms based on regularization theory and Markov random field (MRF) models. Algorithmic requirements and implementation issues are reviewed in detail for edge detection/surface reconstruction. The development of 3-D wafer scale integration (WSI) technologies that offer an ideal medium for implementing many early vision algorithms is discussed. An edge detection algorithm is mapped to the 3-D WSI computer that consists of a 128*128 array of processors formed by stacking 15 four inch CMOS wafers. This mapping is used as the basis for an enhanced array processor tailored for multiresolution MRF processing. Enhancements are proposed that would boost peak performance to over a trillion operations per second, using a stack of 40 wafers, with a total system volume of 820 cm/sup 3/ and consuming about 370 W.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125557165","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Embedding pyramids in array processors with pipelined busses","authors":"Zicheng Guo, R. Melhem","doi":"10.1109/ASAP.1990.145501","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145501","url":null,"abstract":"The concept of pipelined buses for parallel architectures diverges from the conventional exclusive access buses and offers both possibilities and challenges for significantly improving the efficiency of interprocessor communications in parallel computers. The authors present an efficient embedding of pyramids in array processors with pipelined buses. The embedding has the property that all the neighboring nodes in the pyramid are mapped to the same bus. Thus, any two neighbors in the embedded pyramid can communicate with each other using a single bus cycle.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126226524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. P. Marriott, A. Duller, R. Storer, A. Thomson, M. R. Pout
{"title":"Towards the automated design of application specific array processors (ASAPs)","authors":"A. P. Marriott, A. Duller, R. Storer, A. Thomson, M. R. Pout","doi":"10.1109/ASAP.1990.145477","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145477","url":null,"abstract":"The authors describe the architecture and VLSI design of GLiTCH, an associative processor array chip designed for computer vision applications. The design is built from a library of cells, which can be used in conjunction with high level functional specifications to rapidly design new application specific array processors. The objective is to design a system which will allow application specific associative array processors (ASAPs) to be defined, simulated and then produced in silicon automatically from high level description data. Using such techniques should reduce the design cycle time to the point where processor arrays optimized for a particular problem could be fabricated. The authors describe some of the VLSI design which has been done towards achieving the automatic layout of ASAPs. Specifically, the design decisions and trade-offs made in the implementation of a test chip are described and applied to the problem of producing ASAPs.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128017094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A graph-based approach to map matrix algorithms onto local-access processor arrays","authors":"J. Moreno, T. Lang","doi":"10.1109/ASAP.1990.145499","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145499","url":null,"abstract":"The authors describe the application of the multi-mesh graph (MMG) method to the mapping of large matrix algorithms onto class-specific local-access processor arrays. These arrays consist of cells with large local memory (i.e., memory size proportional to the size of the problems) and low cell bandwidth (much smaller than the cell computation rate). The results given indicate that the MMG method allows the analysis of such issues as allocation operations to cells, load balancing, scheduling, synchronization, and overhead in computations and data transfers. These aspects are illustrated by mapping the LU-decomposition algorithm onto a linear memory-linked array. Performance estimates indicate that mapping with the MMG method produces 94% utilization of cells in the target structure used. Therefore, the MMG is a suitable tool for mapping matrix algorithms onto pre-existing arrays.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134576576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain flow and streaming architectures","authors":"E. T. L. Omtzigt","doi":"10.1109/ASAP.1990.145479","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145479","url":null,"abstract":"The author introduces the main ideas of a system compiler for affine dependence algorithm. The first idea is a streaming architecture, which is a machine model for the compiler that reduces control overhead in comparison with an ensemble of von Neumann architectures. Such a streaming architecture is a dedicated architecture programmed with an incremental array instruction to be able to run any instance of the problem. The second idea is the domain flow model, which is a program representation that captures the communication of the algorithm. The structure of the compiler reflects the division between synthesis and code generation. A general front-end generates a domain flow graph. Both synthesis and code generation phases work off this data structure. However, each phase has its own back-end. For the synthesis phase the back-end is a design critic combined with an expert system which makes decision about what to do next to satisfy the design goals. For the code generation phase the back-end iterates through different partitioning and code generation strategies.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114393341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fault-tolerant two-dimensional sorting network","authors":"J. Krammer, H. Arif","doi":"10.1109/ASAP.1990.145469","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145469","url":null,"abstract":"The authors evaluate a class of sorting algorithms which can be adapted to a faulty network with nearest neighbor interconnections by determining a suitable indexing scheme. A worst case sorting time of O(N) is proved for these sorters. Simulation results show that the average sorting time of the fault-tolerant sorters is only slightly higher than O( square root N), and therefore is comparable to that of non-fault-tolerant sorting algorithms. This algorithmic approach does not require additional wiring for reconfiguration, and hence the amount of additional circuitry required for fault-tolerance is very small. An efficient procedure for calculating an indexing scheme is presented and simulation results are shown. Furthermore, an efficient strategy for testing the network is proposed.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114401002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Frison, E. Gautrin, D. Lavenier, Jean-Luc Scharbarg
{"title":"Designing specific systolic arrays with the API15C chip","authors":"P. Frison, E. Gautrin, D. Lavenier, Jean-Luc Scharbarg","doi":"10.1109/ASAP.1990.145486","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145486","url":null,"abstract":"The API15C processor, a building block for different systolic structures, is designed exclusively for single-instruction-multiple data (SIMD) execution mode. To support this mode, the instruction set includes special control instructions. Three parallel I/O ports are available for different interconnection schemes. The API15C chip is designed in a CMOS 2- mu m technology. It contains 45000 transistors on a 6-mm $M6.2-mm silicon area. The functionality of the circuit was tested successfully after the first run. It executes one instruction per clock phase of 100 ns, giving a global rate of 10 MIPS. To validate this processing element as a building block for systolic structures, a programmable interface and two single board machines were developed. The first is an 18 processor linear structure able to support a wide range of applications. The second is a 28 processor bidimensional structure for a specific application of string comparison. The instruction set is particularly well-suited for SIMD operation.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116119726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Two-level pipelined implementation of systolic block Householder transformation with application to RLS algorithm","authors":"K. J. Liu, S. Hsieh, K. Yao","doi":"10.1109/ASAP.1990.145510","DOIUrl":"https://doi.org/10.1109/ASAP.1990.145510","url":null,"abstract":"The authors propose a systolic block Householder transformation (SBHT) approach to implement the Householder transformation (HT) on a systolic array as well as its application to the recursive-least-squares (RLS) algorithm. Since the data are fetched in a block manner, vector operations are in general required for the vectorized array. However, by using a modified HT algorithm, a two-level pipelined implementation can be used to pipeline the SBHT systolic array both at the vector and word levels. The throughput can be as fast as that of the Givens rotation method. The approach makes the HT amenable for VLSI implementation as well as applicable to real-time high throughput applications of modern signal processing. The constrained RLS problem using the SBHT RLS systolic array is also considered.<<ETX>>","PeriodicalId":438078,"journal":{"name":"[1990] Proceedings of the International Conference on Application Specific Array Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1990-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129874672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}