{"title":"Efficient provably good OPC modeling and its applications to interconnect optimization","authors":"Shih-Lun Huang, Chung-Wei Lin, Yao-Wen Chang","doi":"10.1109/ICCD.2010.5647713","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647713","url":null,"abstract":"Optical Proximity Correction (OPC) is the most popular technique to handle design shape distortions arising from subwavelength lithography. Existing OPC models are typically very computationally expensive and thus not efficient to be incorporated for layout optimization. In this paper, we present an efficient, yet sufficiently accurate OPC cost model which can predict the optimal location of a wire segment for OPC optimization and give an upper bound of the interference amount, guaranteeing that the interference amount is never underestimated. Based on this cost model, we propose an OPC-aware wire perturbation algorithm for post-layout interconnect optimization. We show that the effects of wire perturbation have the concavity or monotonicity property which can dramatically reduce the search space for finding the optimal location of each wire for OPC optimization. Further, we can incrementally update the OPC cost of a wire by recomputing only the affected wires because of the property of superposition of our model. Experimental results show that our algorithm can efficiently obtain much better OPC results than a state-of-the-art OPC-friendly router, based on a leading commercial OPC tool.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"329 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116123330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimal power/performance pipelining for error resilient processors","authors":"Nicolas Zea, J. Sartori, Ben Ahrens, Rakesh Kumar","doi":"10.1109/ICCD.2010.5647702","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647702","url":null,"abstract":"Timing speculation has been proposed as a technique for maximizing the energy efficiency of processors with minimal loss in performance. A typical implementation of timing speculation involves speculatively reducing the voltage of a processor to a point where errors are possible but rare, and employing an error recovery mechanism to ensure correct functionality. This allows significant energy savings with a small recovery overhead.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124032912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Helia: Heterogeneous Interconnect for Low Resolution Cache Access in snoop-based chip multiprocessors","authors":"Ali Shafiee, Narges Shahidi, A. Baniasadi","doi":"10.1109/ICCD.2010.5647589","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647589","url":null,"abstract":"In this work we introduce Heterogeneous Interconnect for Low Resolution Cache Access (Helia). Helia improves energy efficiency in snoop-based chip multiprocessors as it eliminates unnecessary activities in both interconnect and cache. This is achieved by using innovative snoop filtering mechanisms coupled with wire management techniques. Our optimizations rely on the observation that a high percentage of cache mismatches could be detected by utilizing a small subset but highly informative portion of the tag bits. Helia relies on the snoop controller to detect possible remote tag mismatches prior to tag array lookup. Power is reduced as a) our wire management techniques permit slow transmission of a subset of tag bits while tag mismatches are being detected and b) we avoid cache access for mismatches detected at the snoop controller. Our Evaluation shows that Helia reduces power in interconnect (dynamic: 64% to 75%, static: 45% to 50%) and cache tag array (dynamic: 57% to 58%, static: 80%) while improving average performance up to 4.4%.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126527430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and implementation of a special purpose embedded system for neural machine interface","authors":"Xiaorong Zhang, H. Huang, Qing Yang","doi":"10.1109/ICCD.2010.5647801","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647801","url":null,"abstract":"Our previous study has shown the potential of using a computer system to accurately decode electromyographic (EMG) signals for neural controlled artificial legs. Because of computation complexity of the training algorithm coupled with real time requirement of controlling artificial legs, traditional embedded systems generally cannot be directly applied to the system. This paper presents a new design of an FPGA-based neural-machine interface for artificial legs. Both the training algorithm and the real time controlling algorithm are implemented on an FPGA. A soft processor built on the FPGA is used to manage hardware components and direct data flows. The implementation and evaluation of this design are based on Altera Stratix II GX EP2SGX90 FPGA device on a PCI Express development board. Our performance evaluations indicate that a speedup of around 280X can be achieved over our previous software implementation with no sacrifice of computation accuracy. The results demonstrate the feasibility of a self-contained, low power, and high performance real-time neural-machine interface for artificial legs.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127514558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zvika Guz, O. Itzhak, I. Keidar, A. Kolodny, A. Mendelson, U. Weiser
{"title":"Threads vs. caches: Modeling the behavior of parallel workloads","authors":"Zvika Guz, O. Itzhak, I. Keidar, A. Kolodny, A. Mendelson, U. Weiser","doi":"10.1109/ICCD.2010.5647747","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647747","url":null,"abstract":"A new generation of high-performance engines now combine graphics-oriented parallel processors with a cache architecture. In order to meet this new trend, new highly-parallel workloads are being developed. However, it is often difficult to predict how a given application would perform on a given architecture. This paper provides a new model capturing the behavior of such parallel workloads on different multi-core architectures. Specifically, we provide a simple analytical model, which, for a given application, describes its performance and power as a function of the number of threads it runs in parallel, on a range of architectures. We use our model (backed by simulations) to study both synthetic workloads and real ones from the PARSEC suite. Our findings recognize distinctly different behavior patterns for different application families and architectures.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132596453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Benedikt Dietrich, S. Nunna, Dip Goswami, S. Chakraborty, M. Gries
{"title":"LMS-based low-complexity game workload prediction for DVFS","authors":"Benedikt Dietrich, S. Nunna, Dip Goswami, S. Chakraborty, M. Gries","doi":"10.1109/ICCD.2010.5647675","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647675","url":null,"abstract":"While dynamic voltage and frequency scaling (DVFS) based power management has been widely studied for video processing, there is very little work on game power management. Recent work on proportional-integral-derivative (PID) controllers fro predicting game workload used hand-turned PID controller gains on relatively short game plays. This left open questions on the robustness of the PID controller and how sensitive the prediction quality is on the choice of the gain values, especially for long game plays involving different scenarios and scene changes. In this paper we propose a Least Mean Squares (LMS) Linear Predictor, which is a regression model commonly used for system parameter identification. Our results show that game workload variation can be estimated using a linear-in-parameters (LIP) model. This observation dramatically reduces the complexity of parameter estimation as the LMS Linear Predictor learns the relevant parameters of the model iteratively as the game progresses. The only parameter to be tuned by the system designer is the learning rate, which is relatively straightforward. Our experimental results using the LMS Linear Predictor show comparable power savings and game quality with those obtained from a highly-tuned PID controller.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131176025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Krishna, J. Postman, Christopher Edmonds, L. Peh, P. Chiang
{"title":"SWIFT: A SWing-reduced interconnect for a Token-based Network-on-Chip in 90nm CMOS","authors":"T. Krishna, J. Postman, Christopher Edmonds, L. Peh, P. Chiang","doi":"10.1109/ICCD.2010.5647666","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647666","url":null,"abstract":"With the advent of chip multi-processors (CMPs), on-chip networks are critical for providing low-power communications that scale to high core counts. With this motivation, we present a 64-bit, 8×8 mesh Network-on-Chip in 90nm CMOS that: a) bypasses flit buffering in routers using Token Flow Control, thereby reducing buffer power along the control path, and b) uses low-voltage-swing crossbars and links to reduce interconnect energy in the data path. These approaches enable 38% power savings and 39% latency reduction, when compared with an equivalent baseline network. An experimental 2×2 core prototype, operating at 400 MHz, validates our design.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132428501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards cool and reliable digital systems: RT level CED techniques with runtime adaptability","authors":"Yu Liu, Kaijie Wu","doi":"10.1109/ICCD.2010.5647625","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647625","url":null,"abstract":"In response to the rising fault susceptibility of ICs due to aggressive device scaling, a number of concurrent error detection (CED) techniques have been proposed. Most existing techniques address the problem at device or logic level. To account for the significant process variations and device aging of today's nano-meter devices, these techniques must always aim at the worst case of fault susceptibility. Recognizing that the power consumption of the CED circuitry for different fault susceptibility varies significantly, these techniques could result in significant overhead. In this paper, we propose register transfer level CED techniques that can be adjusted at runtime according to the actual need. The proposed high-level synthesis technique ensures that the generated datapath consumes minimal power for any CED capability it has been turned to. The proposed approach is tested using known benchmarks.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134523426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bandwidth optimization in asynchronous NoCs by customizing link wire length","authors":"Junbok You, Daniel Gebhardt, K. Stevens","doi":"10.1109/ICCD.2010.5647660","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647660","url":null,"abstract":"The bandwidth requirement for each link on a network-on-chip (NoC) may differ based on topology and traffic properties of the IP cores. Available bandwidth on an asynchronous NoC link will also vary depending on the wire length between sender and receiver. We explore the benefit to NoC performance when this property is used to increase bandwidth on specific links that carry the most traffic of an SoC design. Two methods are used to accomplish this: specifying router locations on the floorplan, and adding pipeline latches on long links. Energy and latency characteristics of an asynchronous NoC are compared to a similarly-designed synchronous NoC. The results indicate that the asynchronous network has lower energy, and link-specific bandwidth optimization has improved the average packet latency. Adding pipeline latches to congested links yields the most improvement. This link-specific optimization is applicable not only to the router and network we present here, but any asynchronous NoC used in a heterogeneous SoC.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133981198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masayuki Sato, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi
{"title":"A voting-based working set assessment scheme for dynamic cache resizing mechanisms","authors":"Masayuki Sato, Ryusuke Egawa, H. Takizawa, Hiroaki Kobayashi","doi":"10.1109/ICCD.2010.5647599","DOIUrl":"https://doi.org/10.1109/ICCD.2010.5647599","url":null,"abstract":"Considering the trade-off between performance and power consumption has become significantly important in multi-core processor design. Under this situation, one promising approach is to employ a power-aware dynamic cache partitioning mechanism. This mechanism individually manages activation of each cache way, and exclusively allocates the minimum number of required ways to each thread. In the mechanism, an appropriate number of ways for a thread is decided based on locality assessment. However, sampling results of cache accesses that are used for locality assessment are disturbed by exceptional behaviors of cache accesses, which happen in a very short period. Such sampling results may change locality assessment results to ones that are not along with the overall trend in a long access-sampling period. These assessment results will excessively adapt the cache to exceptional behaviors, and deteriorate energy efficiency. To avoid such excessive adaptation by the exceptional behaviors, this paper proposes a voting-based working set assessment scheme, in which the number of activated ways is adjusted based on majority voting of locality assessment of several short sampling periods. By using the majority voting, the proposed scheme can identify the periods including exceptional behaviors, and ignore the assessment results of these periods. As a result, the proposed scheme makes the cache resizing mechanism more stable and robust. The experimental results indicate that the proposed scheme can reduce energy consumption by up to 24%, and 10% on an average without significant performance degradation in multi-thread execution on a 2-core CMP.","PeriodicalId":182350,"journal":{"name":"2010 IEEE International Conference on Computer Design","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121313551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}