{"title":"Buffer library selection","authors":"J. Neves, Stephen T. Quay","doi":"10.1109/ICCD.2000.878289","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878289","url":null,"abstract":"Buffer insertion has become a critical optimization technique in high performance design. Perhaps the most prevalent buffer insertion technique is Van Ginneken's dynamic programming algorithm. Although very effective, the algorithm has time complexity that is quadratic in terms of the input buffer library size. Consequently, to achieve an efficient algorithm, it is critical that the buffer library used by the tool be relatively small, containing a subset of the most effective buffers. We propose a new algorithm for selecting a buffer library from all the buffers available in the technology, thereby permitting efficient buffer insertion. We show that the smaller buffer libraries constructed by our algorithm result in little loss in solution quality while speeding up the buffer insertion algorithm by orders of magnitude.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125955808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A selective temporal and aggressive spatial cache system based on time interval","authors":"Jung-Hoon Lee, Jang-Soo Lee, Shin-Dug Kim","doi":"10.1109/ICCD.2000.878298","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878298","url":null,"abstract":"This paper proposes a new cache system that can increase the effect by temporal and spatial locality by using only simple hardware control without any locality detection hardware or compiler aid. The proposed cache system consists of two caches with different associativities and different block sizes, i.e., a direct-mapped cache with small block size and a fully associative spatial buffer with large block size as a multiple of small blocks. Therefore, the spatial locality can be exploited by aggressively fetching large blocks including any missed small block into the buffer, and the temporal locality can also be exploited by selectively storing small blocks that were referenced at the spatial buffer in the past. To determine the blocks to be stored at the direct-mapped cache, the proposed cache system uses a time interval-based selection mechanism. According to the simulation results, similar performance can be achieved by using four times smaller cache size compared with the conventional direct-mapped cache.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127491715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An evaluation of move-based multi-way partitioning algorithms","authors":"Elie Yarack, J. Carletta","doi":"10.1109/ICCD.2000.878309","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878309","url":null,"abstract":"This paper presents a thorough analytical and experimental comparison of four move-based multi-way partitioning algorithms. Modifications are considered to the algorithm with the best solution quality, partitioning by free moves. These modifications allow a tradeoff to be made between solution quality and execution time. Results are given for ISCAS and other benchmarks.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128426486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AMULET3: a 100 MIPS asynchronous embedded processor","authors":"S. Furber, D. A. Edwards, J. Garside","doi":"10.1109/ICCD.2000.878304","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878304","url":null,"abstract":"AMULET3 is a 32-bit asynchronous processor core that is fully instruction set compatible with the clocked ARM cores. It represents the culmination of ten years of research and development into asynchronous processor design at the University of Manchester, and is the first step into commercial use for this technology. AMULET3 shows that asynchronous technology is commercially viable, and is competitive in terms of performance, area and power-efficiency with clocked design. In addition, asynchronous design offers significant advantages in terms of reduced electromagnetic interference and unique power management capabilities.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132647604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Hiyama, Yuko Ito, S. Isomura, Kazunobu Nojiri, Eijiro Maeda
{"title":"Advanced wiring RC timing design techniques for logic LSIs in gigahertz era and beyond","authors":"T. Hiyama, Yuko Ito, S. Isomura, Kazunobu Nojiri, Eijiro Maeda","doi":"10.1109/ICCD.2000.878340","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878340","url":null,"abstract":"In this paper, we describe an advanced wiring RC timing design techniques for the gigahertz era. Our new technique of wiring capacitance extraction makes it possible to calculate more than 1 M nets within 3 hours as accurately as carrying out net-by-net 3-D simulations. Furthermore, we introduced the timing window for estimating crosstalk effects on delay time so as to distinguish harmful nets from harmless nets and reduce surplus design guard-bands.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"248 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133658590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static timing analysis with false paths","authors":"Haizhou Chen, B. Lu, D. Du","doi":"10.1109/ICCD.2000.878336","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878336","url":null,"abstract":"Finding the longest path and the worst delay is the most important task in static timing analysis. But in almost every digital circuit, there exists false paths which are logically impossible or designers don't care about their delays. This paper presents a new method to calculate the worst delay of a circuit with known false paths. When searching for the longest path, it stores delays on nodes conditionally with false paths matched up to the node, thus reduces the number of cache entries and eliminates revisits. This method can be applied to incremental delay calculation with little change. Experiments show that the new method is significantly better than path enumeration without conditional cache.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"122 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115165344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of shared memory misses and reference patterns","authors":"J. Rothman, A. Smith","doi":"10.1109/ICCD.2000.878285","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878285","url":null,"abstract":"Shared bus computer systems permit the relatively simple and efficient implementation of cache consistency algorithms, but the shared bus is a bottleneck which limits performance. False sharing can be an important source of unnecessary traffic for invalidation-based protocols, elimination of which can provide significant performance improvements. For many multiprocessor workloads, however, most misses are true sharing plus cold start misses. Regardless of the cause of cache misses, the largest fraction of bus traffic are words transferred between caches without being accessed, which we refer to as dead sharing. We establish here new methods for characterizing cache block reference patterns, and we measure how these patterns change with variation in workload and block size. Our results show that 42 percent of 64-byte cache blocks are invalidated before more than one word has been read from the block and that 58 percent of blocks that have been modified only have a single word modified before an invalidation to the block occurs. Approximately 50 percent of blocks written and subsequently read by other caches show no use of the newly written information before the block is again invalidated. In addition to our general analysis of reference patterns, we also present a detailed analysis of dead sharing for each shared memory multiprocessor program studied. We find that the worst 10 blocks (based on most total misses) from each of our traces contribute almost 50 percent of the false shearing misses and almost 20 percent of the true sharing misses (on average). A relatively simple restructuring of four of our workloads based on analysis of these 10 worst blocks leads to a 21 percent reduction in overall misses and a 15 percent reduction in execution time. Permitting the block size to vary (as could be accomplished with a sector cache) shows that bus traffic can be reduced by 88 percent (for 64-byte blocks) while also decreasing the miss ratio by 35 percent.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125095464","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Delay constrained optimization by simultaneous fanout tree construction, buffer insertion/sizing and gate sizing","authors":"I-Min Liu, A. Aziz","doi":"10.1109/ICCD.2000.878287","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878287","url":null,"abstract":"We present a novel algorithm for delay constrained optimization of combinational logic, extending the state-of-the-art sizing algorithm based on Lagrangian relaxation. We tightly integrate fanout tree construction, buffer insertion/sizing and gate sizing, thereby achieving more optimization than if they were performed independently. We consider the network in its entirety, thereby taking full advantage of the slacks available on the noncritical paths. We have implemented our algorithm and experimented with it on ISCAS-89 benchmark circuits; the results demonstrate that it is effective as well as fast.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125818283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unified fine-granularity buffering of index and data: approach and implementation","authors":"Q. Cao, J. Torrellas, H. Jagadish","doi":"10.1109/ICCD.2000.878284","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878284","url":null,"abstract":"Disk I/O is recognized as a major performance bottleneck in many database applications. Consequently, a topic of considerable study in database systems has traditionally been buffer management. Recently, disk pages have been increasing in size, enabling more and more data to fit in a single page. Such a trend suggests that buffering the data at a grain size finer than a page may use memory better. As a result, there has been some interest in fine-granularity buffering. Past approaches to fine-granularity buffering have proposed buffering either data tuples alone or index entries alone. In this paper, we propose a scheme to support fine-granularity buffering of both index and data entries in a unified manner. The scheme, which we call Hot-Entry buffering, can be used in combination with conventional page-level buffering. Through the experimental evaluation of a simple system, we demonstrate the benefits of our scheme over conventional page-level buffering, and over index-only and data-only fine-granularity buffering. In particular, we show that, for a range of parameter values, our buffering scheme speeds-up query execution by 20-60% relative to page-level buffering only, and by 10-20% relative to the best of index-only or data-only fine-granularity buffering.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122016111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An advanced instruction folding mechanism for a stackless Java processor","authors":"A. Kim, J. M. Chang","doi":"10.1109/ICCD.2000.878343","DOIUrl":"https://doi.org/10.1109/ICCD.2000.878343","url":null,"abstract":"In order to improve the execution speed of Java in hardware, a new advanced instruction folding technique has been developed. In this paper an instruction folding scheme based on an advanced Producer, Operator and Consumer (POC) model is proposed and demonstrates improvement in bytecode execution over the existing techniques. The proposed POC model is able to detect and fold all possible instruction sequence types dynamically in hardware, including a sequence that is separated by other bytecode instructions. SPEC JMV98 benchmark results show that the proposed POC model-based folder can save more than 90% of folding operations. In this research, the proposed instruction folding technique can eliminate most of the stack operations and the use of a physical operand stack, and can thereby achieve the performance of high-end RISC processors.","PeriodicalId":437697,"journal":{"name":"Proceedings 2000 International Conference on Computer Design","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2000-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127036322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}