Steve Huntzicker, Michael Dayringer, Justin Soprano, Anthony Weerasinghe, D. Harris, D. Patil
{"title":"Energy-delay tradeoffs in 32-bit static shifter designs","authors":"Steve Huntzicker, Michael Dayringer, Justin Soprano, Anthony Weerasinghe, D. Harris, D. Patil","doi":"10.1109/ICCD.2008.4751926","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751926","url":null,"abstract":"This paper compares the energy-delay tradeoff curves of 32-bit static barrel and funnel shifters. The Stanford Circuit Optimization Tool (SCOT) is used to determine best transistor sizes in a 90 nm process. The paper evaluates the effect of multiplexer valency, circuit design, and physical placement. It also quantifies the costs of various shift operations. A funnel shifter using 4- and 8-input static multiplexer stages gives the best energy-delay tradeoff, with a knee at 440 ps (15 FO4 inverter delays) consuming 0.9 pJ per shift.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131919074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-aware opcode design","authors":"Balaji V. Iyer, Jason A. Poovey, T. Conte","doi":"10.1109/ICCD.2008.4751918","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751918","url":null,"abstract":"Embedded processors are required to achieve high performance while running on batteries. Thus, they must exploit all the possible means available to reduce energy consumption while not sacrificing performance. In this work, one technique to reduce energy is explored to intelligently design the instruction-opcodes of a processor based on a target-workload. The optimization is done using a heuristic that not-only minimizes switching between adjacent instructions, but also simplifies the decoding to reduce latches to save dynamic energy. On average, an optimized opcode is able to be decoded using 40-60% less latches in the decoder. In addition, it is shown that a decoder optimized for algorithms that had similar program structure, similar data-types or similar behavior exhibited consistent patterns of energy reduction. The techniques presented in this paper yield an average 10% reduction in the total dynamic energy. It is also shown that this heuristic can be used to achieve similar results on different issue-width processors.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124535299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved combined binary/decimal fixed-point multipliers","authors":"Brian J. Hickmann, M. Schulte, M. A. Erle","doi":"10.1109/ICCD.2008.4751845","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751845","url":null,"abstract":"Decimal multiplication is important in many commercial applications including banking, tax calculation, currency conversion, and other financial areas. This paper presents several combined binary/decimal fixed-point multipliers that use the BCD-4221 recoding for the decimal digits. This allows the use of binary carry-save hardware to perform decimal addition with a small correction. Our proposed designs contain several novel improvements over previously published designs. These include an improved reduction tree organization to reduce the area and delay of the multiplier and improved reduction tree components that leverage the redundant decimal encodings to help reduce delay. A novel split reduction tree architecture is also introduced that reduces the delay of the binary product with only a small increase in total area. Area and delay estimates are presented that show that the proposed designs have significant area improvements over separate binary and decimal multipliers while still maintaining similar latencies for both decimal and binary operations.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117349617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Sabbaghi‐Nadooshan, M. Modarressi, H. Sarbazi-Azad
{"title":"The 2D DBM: An attractive alternative to the simple 2D mesh topology for on-chip networks","authors":"R. Sabbaghi‐Nadooshan, M. Modarressi, H. Sarbazi-Azad","doi":"10.1109/ICCD.2008.4751905","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751905","url":null,"abstract":"During the recent years, 2D mesh network-onchip has attracted much attention due to its suitability for VLSI implementation. The 2-dimensional de Bruijn topology for network-on-chip is introduced in this paper as an attractive alternative to the popular simple 2D mesh NoC. Its cost is equal to that of the simple 2D mesh but it has a logarithmic diameter. We compare the proposed network and the popular mesh network in terms of power consumption and network performance. Compared to the equal sized simple mesh NoC, the proposed de Bruijn-based network has better performance while consuming less energy.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122099603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reversi: Post-silicon validation system for modern microprocessors","authors":"I. Wagner, V. Bertacco","doi":"10.1109/ICCD.2008.4751878","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751878","url":null,"abstract":"Verification remains an integral and crucial phase of todaypsilas microprocessor design and manufacturing process. Unfortunately, with soaring design complexities and decreasing time-to-market windows, todaypsilas verification approaches are incapable of fully validating a microprocessor before its release to the public. Increasingly, post-silicon validation is deployed to detect complex functional bugs in addition to exposing electrical and manufacturing defects. This is due to the significantly higher execution performance offered by post-silicon methods, compared to pre-silicon approaches. Validation in the post-silicon domain is predominantly carried out by executing constrained-random test instruction sequences directly on a hardware prototype. However, to identify errors, the state obtained from executing tests directly in hardware must be compared to the one produced by an architectural simulation of the designpsilas golden model. Therefore, the speed of validation is severely limited by the necessity of a costly simulation step. In this work we address this bottleneck in the traditional flow and present a novel solution for post-silicon validation that exposes its native high performance. Our framework, called Reversi, generates random programs in such a way that their correct final state is known at generation time, eliminating the need for architectural simulations. Our experiments show that Reversi generates tests exposing more bugs faster, and can speed up post-silicon validation by 20x compared to traditional flows.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127939067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting producer patterns and L2 cache for timely dependence-based prefetching","authors":"C. Lim, G. Byrd","doi":"10.1109/ICCD.2008.4751935","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751935","url":null,"abstract":"This paper proposes an architecture that efficiently prefetches for loads whose effective addresses are directly dependent on previously-loaded values. This dependence-based prefetching scheme covers most frequently missed loads in programs that contain linked data structures (LDS). For timely prefetches, memory access patterns of producing loads are dynamically learned. These patterns (such as strides) are used to prefetch well ahead of the consumer load. The proposed prefetcher is placed near the processor core and targets L1 cache misses, because removing L1 cache misses has greater performance potential than removing L2 cache misses. We also examine how to capture pointers in LDS with pure hardware implementation. We find that the space requirement can be reduced, compared to previous work, if we selectively record patterns. Still, to make the prefetching scheme generally applicable, a large table is required for storing pointers. We show that storing the prefetch table in a partition of the L2 cache outperforms using the L2 cache conventionally.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128874884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-precision tradeoffs in mobile Graphics Processing Units","authors":"Jeff Pool, A. Lastra, Montek Singh","doi":"10.1109/ICCD.2008.4751841","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751841","url":null,"abstract":"In mobile devices, limiting the Graphics Processing Unitpsilas (GPUpsilas) energy usage is of great importance to extending battery life. This paper focuses on the first stage of the graphics processor pipeline - the vertex transformation stage - and introduces an approach to lowering its switching activity by reducing the precision of arithmetic operations. As a result, the approach enables a tradeoff between energy efficiency and the quality of the rendered image. This paper makes the following specific contributions: 1) a transition-based energy model for quantifying energy consumed as a function of arithmetic precision, and 2) detailed simulation results on several real-world graphics applications to evaluate the tradeoff between energy and precision. In most examples, over 23% of the energy can be saved by lowering arithmetic precision while still maintaining a faithful reproduction of the full-precision image. Pushing the idea further, over 36% energy can be saved by further lowering the precision while preserving acceptable result accuracy. We assert that this represents a significant energy savings that warrants further investigation and extension of our approach to the remaining stages of the graphics processor pipeline.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121362716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Gate planning during placement for gated clock network","authors":"Weixiang Shen, Yici Cai, Xianlong Hong, Jiang Hu","doi":"10.1109/ICCD.2008.4751851","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751851","url":null,"abstract":"Clock gating is a popular technique for reducing power dissipation in clock network. Although there have been numerous research efforts on clock gating, the previous approaches still have a significant weakness. That is, they usually construct a gated clock tree after cell placement, i.e., cell placement is performed without considering clock gating and may generate a solution unfriendly to subsequent gated clock tree construction. As a result, the control gates inserted in the tree construction is very likely to cause cell overlap. Even though the overlap can be eventually removed in placement legalization, remarkable wirelength/power overhead is incurred. In this paper, we propose a gate planning technique which is integrated with a partition-based cell placer. During cell placement, the planning judiciously inserts clock gates based on power estimation. In addition, pseudo edges are inserted between clock gates and registers in order to reduce clock wirelength and enable long shut-off periods. At the end, when a relatively detailed placement is obtained, a post-processing is performed to degrade the inefficient clock gates to clock buffers. We compared our approach with recent previous works on ISCAS89 benchmark circuits. Our method reduces the clock tree wirelength and power by 22.06% and 40.80%, respectively, with a very limited increase on signal nets wirelength and power compared with the conventional (register-oblivious) placement. The results also indicate that our algorithm outperforms the clock-gating-oblivious placement on power reduction and performance improvement.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125237064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Pehl, Tobias Massier, H. Graeb, Ulf Schlichtmann
{"title":"A random and pseudo-gradient approach for analog circuit sizing with non-uniformly discretized parameters","authors":"Michael Pehl, Tobias Massier, H. Graeb, Ulf Schlichtmann","doi":"10.1109/ICCD.2008.4751860","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751860","url":null,"abstract":"Many methods for analog circuit sizing are available as commercial, in-house and academic tools. They are based on continuous optimization, e.g., of transistor geometries, although the subsequent layout step requires values on a pre-defined grid. In addition, sizing of transistors for bipolar and RF circuits frequently necessitates the use of multiples of predefined values for the design parameters. This paper presents a novel method for solving this type of discrete optimization problem. An iterative approach is presented, which is based on pseudo-gradients and a randomized calculation of search regions and steps. Experimental comparisons with simulated annealing and a continuous sizing approach with subsequent discretization clearly show the effectivity and efficiency of the presented method.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132641066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi-Wei Lin, M. Marek-Sadowska, W. Maly, A. Pfitzner, D. Kasprowicz
{"title":"Is there always performance overhead for regular fabric?","authors":"Yi-Wei Lin, M. Marek-Sadowska, W. Maly, A. Pfitzner, D. Kasprowicz","doi":"10.1109/ICCD.2008.4751916","DOIUrl":"https://doi.org/10.1109/ICCD.2008.4751916","url":null,"abstract":"In this paper, we study the circuits built from super-regular, high-density transistor arrays that can be prefabricated and customized using an OPC-free interconnect manufacturing process. The super-regular layout style greatly enhances the chippsilas manufacturability. Unlike other regular fabrics that sacrifice area and performance to improve regularity, the new layout style, combined with a new 3-D geometry transistor, enables to produce circuits with timing and power density comparable to or better than that of conventional CMOS circuits and using less chip area.","PeriodicalId":345501,"journal":{"name":"2008 IEEE International Conference on Computer Design","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131930787","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}