{"title":"An empirical study of datapath, memory hierarchy, and network in SIMD array architectures","authors":"M. Herbordt, C. Weems","doi":"10.1109/ICCD.1995.528921","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528921","url":null,"abstract":"Although SIMD arrays have been built for 30 years, they have as a class been the subject of few empirical design studies. Using ENPASSANT, a simulation environment developed for that purpose, we analyze several aspects of SIMD array architecture with respect to a test suite of spatially mapped applications. Several surprising results are obtained. With respect to memory hierarchy, we find that adding a level of cache to current PE designs is likely to be advantageous, but that such a cache will look quite different than expected. In particular, we find that associativity has unusual significance and that performance varies inversely with block size. Router network results indicate the importance of support for local transfers, broadcast, and reduction even at the expense of arbitrary permutations. Other communication results point to the appropriate dimensionality of k-ary n-cube networks (2 or 3), and the criticality of supporting bidirectional transfers, even if the overall bandwidth remains unchanged.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123139497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance estimation for real-time distributed embedded systems","authors":"Ti-Yen Yen, W. Wolf","doi":"10.1109/ICCD.1995.528792","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528792","url":null,"abstract":"Many embedded computing systems are distributed systems: communicating processes executing on several CPUs/ASICs connected by communication links. This paper describes a new, efficient analysis algorithm to derive tight bounds on the execution time required for an application task executing on a distributed system. Tight bounds are essential to cosynthesis algorithms. Our bounding algorithms are valid for a general problem model: the system can contain several tasks with different periods; each task is partitioned into a set of processes related by data dependencies; the periods and the computation times of processes are bounded but not necessarily constant. Experimental results show that our algorithm can find tight bounds in small amounts of CPU time.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"61 13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114954478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Write buffer design for cache-coherent shared-memory multiprocessors","authors":"F. Mounes-Toussi, D. Lilja","doi":"10.1109/ICCD.1995.528915","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528915","url":null,"abstract":"We evaluate the performance impact of two different write-buffer configurations (one word per buffer entry and one block per buffer entry) and two different write policies (write-through and write-back), when using the partial block invalidation coherence mechanism in a shared-memory multiprocessor. Using an execution-driven simulator, we find that the one word per entry buffer configuration with a write-back policy is preferred for small write-buffer sizes when both buffers have an equal number of data words, and when they have equal hardware cost. Furthermore, when partial block invalidation is supported, we find that a write-through policy is preferred over a write-back policy due to its simpler cache hit detection mechanism, its elimination of write-back transactions, and its competitive-performance when the write-buffer is relatively large.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115072042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimal self-correcting shift counters","authors":"A.M. Tokarnia, A. Peterson","doi":"10.1109/ICCD.1995.528925","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528925","url":null,"abstract":"In some applications of shift counters, self initialization is an advantage. It eliminates the need for complex initialization and guarantees the return to the original state sequence after a temporary failure. The low operating frequencies and large areas of the available self correcting shift counters, however, impose severe limitations to their use. This poor performance is partially due to a widely used design method. It consists of modifying the state diagram of a counter with the desired modulus until a single cycle is left. Due to the additional hardware required to change state transitions, the final circuit tends to be slow and large. The paper presents a technique for determining self correcting shift counters by selecting the feedback functions from a large set of functions. The set is searched for functions satisfying a minimization criterion. Self correcting shift counters with up to 10 stages have been determined. These counters are faster and smaller than the self correcting shift counters available from the literature. A table of self correcting shift counters with 6 stages is included in the paper.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122634417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Logic synthesis for a single large look-up table","authors":"R. Murgai, M. Fujita, F. Hirose","doi":"10.1109/ICCD.1995.528842","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528842","url":null,"abstract":"Logic synthesis for look-up tables (LUTs) has received much attention in the past few years, since Xilinx introduced its LUT-based field-programmable gate array (FPGA) architectures. An m-input LUT can implement any Boolean function of up to m inputs. So the goal of synthesis for such architectures has been to synthesize a circuit in which each function can be implemented by one m-LUT such that either the total number of functions or the number of levels of the circuit is minimized. In this work, we focus on a different though related problem: synthesize the given circuit on a single memory or LUT L, which has a capacity of M bits. In addition to satisfying the memory constraint M, we also wish to minimize the total number of functions to be implemented. The main motivation for the problem comes from the problem of minimizing the simulation time on a hardware accelerator for logic simulation. This accelerator uses memory as a logic primitive. In fact, the problem is also relevant in the context of compile-code or software simulation. Another situation where the problem arises is in synthesis for the FPGA architectures being proposed that have on-chip memory for storing programs and data. The unused memory locations can be used to store logic functions. We show that the existing LUT synthesis methods are inadequate to solve this problem. We propose techniques to solve the problem and present experimental evidence to demonstrate their effectiveness.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128516578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Interrupt-based hardware support for profiling memory system performance","authors":"A. Goldberg, J. Trotter","doi":"10.1109/ICCD.1995.528917","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528917","url":null,"abstract":"Fueled by higher clock rates and superscalar technologies, growth in processor speed continues to outpace improvement in memory system performance. Reflecting this trend, architects are developing increasingly complex memory hierarchies to mask the speed gap, compiler writers are adding locality enhancing transformations to better utilize complex memory hierarchies, and applications programmers are recoding their algorithms to exploit memory systems. All of these groups need empirical data on memory system behavior to guide their optimizations. This paper describes how to combine simple hardware support and sampling techniques to obtain such data without appreciably perturbing system performance. The idea is implemented in the Mprof prototype that profiles data stall cycles, first level cache misses, and second level misses on the Sun Sparc 10/41.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DART: delay and routability driven technology mapping for LUT based FPGAs","authors":"A. Lu, E. Dagless, J. Saul","doi":"10.1109/ICCD.1995.528841","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528841","url":null,"abstract":"A two-phased approach for routability directed delay-optimal mapping of LUT based FPGAs is presented based on the results of stochastic routability analysis. First, delay-optimal mapping is performed which simultaneously minimizes area and delay. Then, the mapped circuits are restructured to alleviate the potential routing congestions. Experimental results indicate that the first phase creates designs which require 17% fewer levels and 40% fewer LUTs than MIS-pga (delay), 11% fewer levels and 37% fewer LUTs than FlowMap-r, and 5% fewer levels and 39% fewer LUTs than TechMap-D. The success of the second phase is confirmed by running a vendor's layout tool APR. It is observed that they are more routable and have less final delays than those produced by other mappers if they are placed and routed.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Montemayor, M. Sullivan, Jen-Tien Yen, P. Wilson, R. Evers
{"title":"Multiprocessor design verification for the PowerPC 620 microprocessor","authors":"C. Montemayor, M. Sullivan, Jen-Tien Yen, P. Wilson, R. Evers","doi":"10.1109/ICCD.1995.528809","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528809","url":null,"abstract":"Multiprocessor design verification for the PowerPC 620 microprocessor was challenging due to the 620 Bus protocol complexity. The highly concurrent bus and level 2 (LS) cache interfaces, and the extensive system configurability. In order to verify this functionality, a combination of random and deterministic approaches were used. The Random Test Program Generator (RTPG) and the newly developed Stochastic Concurrent Program Generator (SCPG) tools were used for random verification. In the deterministic front, testcases in C were written to verify specific scenarios. In creating SCPG, we dealt with the design complexity and frequent design changes by abstracting areas of concern as simple languages, writing tools to generate tests, and executing these in the standard verification environment. The added value of these tests is that they exercise true data sharing among processors, are self-checking and resemble commercial multiprocessor code.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"46 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114116798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Montemayor, M. Sullivan, Jen-Tien Yen, P. Wilson, R. Evers, K. R. Kishore
{"title":"The PowerPC 603e microprocessor: an enhanced, low-power, superscalar microprocessor","authors":"C. Montemayor, M. Sullivan, Jen-Tien Yen, P. Wilson, R. Evers, K. R. Kishore","doi":"10.1109/ICCD.1995.528810","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528810","url":null,"abstract":"The PowerPC 603e microprocessor is a high performance, low cost, low power microprocessor designed for use in portable computers. The 603e is an enhanced version of the PowerPC 603 microprocessor and extends the performance range of the PowerPC microprocessor family of portable products. The enhancements include increasing the frequency to 100 MHZ doubling the on-chip instruction and data caches to 16 Kbytes each, increasing the cache associativity to 4-way set-associative, adding an extra integer unit, and increasing the throughput of stores and misaligned accesses. Three new bus modes are added to allow for more flexibility in system design. The estimated performance of the 603e at 100 MHz is 120 SPECint92 and 105 SPECfp92. The 603e is fabricated in the same 3.3 volt, 0.5 micron, four-level metal technology as the 603 and contains 2.6 million transistors. The die size is 98 mm/sup 2/. The typical power consumption of the 603e at 100 MHz is 3 watts. Like the 603, the 603e provides three software controllable power-down modes to further extend power saving capability.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125252690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shigeaki Iwasa, Shu Shing, Hisashi Mogi, Hiroshi Nozuwe, Hiroo Hayashi, Osamu Wakamori, Takashi Ohmizo, Kuninori Tanaka, H. Sakai, M. Saito
{"title":"SSM-MP: more scalability in shared-memory multi-processor","authors":"Shigeaki Iwasa, Shu Shing, Hisashi Mogi, Hiroshi Nozuwe, Hiroo Hayashi, Osamu Wakamori, Takashi Ohmizo, Kuninori Tanaka, H. Sakai, M. Saito","doi":"10.1109/ICCD.1995.528923","DOIUrl":"https://doi.org/10.1109/ICCD.1995.528923","url":null,"abstract":"Bus-based shared-memory multi-processors (SM-MP) have successfully been used commercially, since implementation requires no drastic changes to the programming paradigm. In this paper we propose the memory structure called SSM-MP (Scalable shared-memory multi-processors), aimed to shorten the cache refill latency and to relax the bus bottle neck problem. In this machine, main memory consists of local memories dedicated to each of the processors and something called MTag. MTag is a small piece of hardware that filters out bus traffic headed to the system bus and maintains cache coherency. A popular UNIX (SVR4 ES/MP) was ported. Original OS code works well due to its natural locality. Furthermore, by allocating tasks to the local memory, we were able to reduce the system bus traffic to nearly a quarter. SSM-MP is an effective approach in building a multi-processor system with a medium number (4-32) of processors.","PeriodicalId":281907,"journal":{"name":"Proceedings of ICCD '95 International Conference on Computer Design. VLSI in Computers and Processors","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1995-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132507700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}