{"title":"A high-speed asynchronous decompression circuit for embedded processors","authors":"Martin Benes, A. Wolfe, S. Nowick","doi":"10.1109/ARVLSI.1997.634856","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634856","url":null,"abstract":"This paper describes the architecture and implementation of a high-speed decompression engine for embedded processors. The engine is targeted to processors where embedded programs are stored in compressed form, and decompressed at runtime during instruction cache refill. The decompression engine uses a unique asynchronous variable decompression rate architecture to process Huffman-encoded instructions. The resulting circuit is significantly smaller than comparable synchronous decoders, yet has a higher throughput rate than almost almost all existing designs. The 0.8 /spl mu/m layout is all full-custom and contains predominantly dynamic domino logic. The top-level control, as well as several small state machines, are implemented using, asynchronous logic. The design operates without a user-supplied clock. Simulations using Lsim show average throughput of 32 bits/45 ns on the output side, corresponding to about 480 Mbit/sec on the input side. The chip has been manufactured by MOSIS; tests show that the asynchronous implementation operates correctly, with an average throughput exceeding simulations: 32 bits/39 ns on the output side, corresponding to about 560 Mbit/sec on the input side. This speed is acceptable for our application. The area of the design (excluding the pad-frame overhead) is only 0.75 mm/sup 2/. The design is the first fabricated chip for an instruction decompression unit for embedded processors.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"117 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121264445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trends of key advanced device technologies","authors":"B. C. Hwang","doi":"10.1109/ARVLSI.1997.634847","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634847","url":null,"abstract":"Silicon CMOS technology has followed Moore's law over the past two decodes. It is still on the predicted curve, and it appears that the trend will continue into the next decade. The SIA roadmap published by Sematech in 1994 predicted the progress of semiconductor technology fairly well. Expectations based on the SIA roadmap are now being exceeded; for example, as announced by many companies, the projected 0.25 /spl mu/m production in 1998 will be met in 1997. Other technologies continue to make progress, along with silicon CMOS technology. The distinctive ones are Thin Film Silicon on insulator (TFSOI), Complementary Gallium Arsenide (CGaAs), and Graded-Channel CMOS (GCMOS). This paper will discuss the status, potential and hurdles of these technologies.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114951574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Georgios Kornaros, C. Kozyrakis, Panagiota Vatsolaki, M. Katevenis
{"title":"Pipelined multi-queue management in a VLSI ATM switch chip with credit-based flow-control","authors":"Georgios Kornaros, C. Kozyrakis, Panagiota Vatsolaki, M. Katevenis","doi":"10.1109/ARVLSI.1997.634851","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634851","url":null,"abstract":"We describe the queue management block of ATLAS I, a single-chip ATM switch (roster) with optional credit-based (backpressure) flow control. ATLAS I is a 4-million-transistor 0.35-micron CMOS chip, currently under development, offering 20 Gbit/s aggregate I/O throughput, sub-microsecond cut-through latency, 256-cell shared buffer containing multiple logical output queues, priorities, multicasting, and load monitoring. The queue management block of ATLAS I is a dual parallel pipeline that manages the multiple queues of ready cells, the per-flow-group credits, and the cells that are waiting for credits. All cells, in all queues, share one, common buffer space. These 3- and Q-stage pipelines handle events at the rate of one cell arrival or departure per clock cycle, and one credit arrival per clock cycle. The queue management block consists of two compiled SRAMs, pipeline bypass logic, and multi-port CAM and SRAM blocks that are laid out in full-custom and support special access operations. The full-custom part of queue management contains approximately 65 thousand transistors in logic and 14 Kbits in various special memories, it occupies 2.3 mm/sup 2/, it consumes 270 mW (worst case), and it operates at 80 MHz (worst case) versus 50 MHz which is the required clock frequency to support the 622 Mb/s switch link rate.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123671933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Circuits and technology for Digital's StrongARM and ALPHA microprocessors [CMOS technology]","authors":"D. Dobberpuhl","doi":"10.1109/ARVLSI.1997.634842","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634842","url":null,"abstract":"Since the introduction of the first ALPHA microprocessor in 1992, Digital has maintained leadership in absolute CPU performance. During the past year, Digital's StrongARM processor has also achieved a leadership position as the fastest CPU capable of operating from a single AA battery cell. Some of the key techniques used to achieve this performance are described in this invited paper.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129621619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alain J. Martin, Andrew Lines, R. Manohar, M. Nyström, P. Pénzes, Robert Southworth, U. Cummings
{"title":"The design of an asynchronous MIPS R3000 microprocessor","authors":"Alain J. Martin, Andrew Lines, R. Manohar, M. Nyström, P. Pénzes, Robert Southworth, U. Cummings","doi":"10.1109/ARVLSI.1997.634853","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634853","url":null,"abstract":"The design of an asynchronous clone of a MIPS R3000 microprocessor is presented. In 0.6 /spl mu/m CMOS, we expect performance close to 280 MIPS, for a power consumption of 7 W. The paper describes the structure of a high-performance asynchronous pipeline, in particular precise exceptions, pipelined caches, arithmetic, and registers, and the circuit techniques developed to achieve high throughput.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127798239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault scanner for reconfigurable logic","authors":"N. Shnidman, W. Mangione-Smith, M. Potkonjak","doi":"10.1109/ARVLSI.1997.634857","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634857","url":null,"abstract":"We propose a technique for online built-in self-test of Field Programmable Gate Arrays (FPGAs). The goal of this system is to detect deviations from the intended functionality of an FPGA without using special-purpose hardware, hardware external to the device, and without interrupting system operation. A system that solves these problems would be useful for mission-critical applications with resource constraints. We present here a fault detection system which solves these problems through an online fault scanning methodology. Resources internal to the device are configured to test for faults. Testing scans across an FPGA, checking a section at a time. The viability and effectiveness of such a system is supported through simulation of the system on a model FPGA.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"30 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133731459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An embedded DRAM for CMOS ASICs","authors":"J. Poulton","doi":"10.1109/ARVLSI.1997.634861","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634861","url":null,"abstract":"The growing gap between on-chip gates and off-chip I/O bandwidth argues for ever larger amounts of on-chip memory. Emerging portable consumer technology, such as digital cameras, will also require more memory than can be supported easily on logic-oriented ASIC processes. Most ASIC memory systems are P-load SRAM, but this circuit technology is neither dense nor power efficient. This paper describes development of a DRAM, compatible with a standard CMOS ASIC process, that provides a memory density at least 4/spl times/ improved over P-load SRAM in the same layout roles. It runs at speeds comparable to logic in the same process and uses circuitry that is reasonably simple and portable. The design employs Vdd-precharge bit lines, half-capacitance full-voltage dummy cells, and a simple complementary sense amplifier. DRAM is organized as a number of small pages, allowing simple circuit design and low-power operation at modest expense in area overhead. The paper also described a power-conserving low-voltage-swing bus design that interfaces multiple pages to full-voltage-swing circuitry. Circuit and layout details are provided, along with experimental results for a 100 MHz 786K-bit embedded DRAM in a 0.5 /spl mu/m process.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133030690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Circuits and microarchitecture for gigahertz VLSI designs","authors":"K. Nowka, H. P. Hofstee","doi":"10.1109/ARVLSI.1997.634860","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634860","url":null,"abstract":"IBM founded the Austin Research Laboratory to investigate high-performance microprocessor-based systems. Initial efforts have focused on design for high frequency. This resulted in the completion prototype for a 64-bit PowerPC processor core early in 1997. The prototype is expected to run at 800 MHz in 0.25 micron CMOS technology. We discuss clocking strategy, circuit design, microarchitecture, methodology, and the testing strategy needed to achieve this frequency.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121076761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalability in computing for today and tomorrow","authors":"D. Parry","doi":"10.1109/ARVLSI.1997.634843","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634843","url":null,"abstract":"Achieving scalability in computer systems without sacrificing usability, requires a synergistic combination of system architecture, fundamental technologies, and implementation. This paper discusses how the Silicon Graphics Origin system utilizes these elements to create a truly scalable microprocessor and makes predictions of how these elements will evolve to provide performance growth into the next century. The paper begins by reviewing current multiprocessor alternatives, and producing the notion of a scalable SMP. The second part of the paper focuses on a particular instance of a scalable SMP, the Silicon Graphics Origin multiprocessor and its S/sup 2/MP memory architecture. We give an overview of the Origin system architecture and then discuss some of the core technologies and key implementation components of the system. In the final section we examine how technology trends will impact system architecture and what key technologies and implementation strategies are implied by those trends. We go on to predict that clusters of scalable shared-memory multiprocessors will become the dominant, multiprocessor architecture.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129351844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The hierarchical multi-bank DRAM: a high-performance architecture for memory integrated with processors","authors":"T. Yamauchi, Lance Hammond, K. Olukotun","doi":"10.1109/ARVLSI.1997.634862","DOIUrl":"https://doi.org/10.1109/ARVLSI.1997.634862","url":null,"abstract":"A microprocessor integrated with DRAM on the same die has the potential to improve system performance by reducing the memory latency and improving the memory bandwidth. However a high performance microprocessor will typically send more accesses than the DRAM can handle due to the long cycle time of the embedded DRAM, especially in applications with significant memory requirements. A multi-bank DRAM can hide the long cycle time by allowing the DRAM to process multiple accesses in parallel, but it will incur a significant area penalty and will therefore restrict the density of the embedded DRAM main memory. In this paper we propose a hierarchical multi-bank DRAM architecture to achieve high system performance with a minimal area penalty. In this architecture, the independent memory banks are each divided into many semi-independent subbanks that share I/O and decoder resources. A hierarchical multi-bank DRAM with 4 main banks each composed of 32 subbanks occupies approximately the same area as a conventional 4 bank DRAM while performing like a 32 bank one-up to 65% better than a conventional 4 bank DRAM when integrated with a single-chip multiprocessor.","PeriodicalId":201675,"journal":{"name":"Proceedings Seventeenth Conference on Advanced Research in VLSI","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1997-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115186859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}