{"title":"RExCache: Rapid exploration of unified last-level cache","authors":"S. Min, Haris Javaid, S. Parameswaran","doi":"10.1109/ASPDAC.2013.6509661","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509661","url":null,"abstract":"In this paper, we propose to explore design space of a unified last-level cache to improve system performance and energy efficiency. The challenge is to quickly estimate the execution time and energy consumption of the system with distinct cache configurations using minimal number of slow full-system cycle-accurate simulations. To this end, we propose a novel, simple yet highly accurate execution time estimator and a simple, reasonably accurate energy estimator. Our framework, RExCache, combines a cycle-accurate simulator and a trace-driven cache simulator with our novel execution time estimator and energy estimator to avoid cycle-accurate simulations of all the last-level cache configurations. Once execution time and energy estimates are available from the estimators, RExCache chooses minimum execution time or minimum energy consumption cache configuration. Our experiments with nine different applications from mediabench, and 330 last-level cache configurations show that the execution time and energy estimators had at least average absolute accuracy of 99.74% and 80.31% respectively. RExCache took only a few hours (21 hours for H.264enc) to explore last-level cache configurations compared to several days of traditional method (36 days for H.264enc) and cycle-accurate simulations (257 days for H.264enc), enabling quick exploration of the last-level cache. When 100 different real-time constraints on execution time and energy were used, all the cache configurations found by RExCache were similar to those from cycle-accurate simulations. On the other hand, the traditional method found correct cache configurations for only 69 out of 100 constraints. Thus, RExCache has better absolute accuracy than the traditional method, yet reducing the simulation time by at least 97%.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126892688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Duo Liu, Tianzheng Wang, Yi Wang, Z. Shao, Qingfeng Zhuge, E. Sha
{"title":"Curling-PCM: Application-specific wear leveling for phase change memory based embedded systems","authors":"Duo Liu, Tianzheng Wang, Yi Wang, Z. Shao, Qingfeng Zhuge, E. Sha","doi":"10.1109/ASPDAC.2013.6509609","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509609","url":null,"abstract":"Phase change memory (PCM) has been used as NOR flash replacement in embedded systems with its attractive features. However, the endurance of PCM keeps drifting down and greatly limits its adoption in embedded systems. As most embedded systems are application-oriented, we can better utilize PCM by exploring application-specific features such as fixed access patterns and update frequencies to prolong the lifetime of PCM. In this paper, we propose an application-specific wear leveling technique, called Curling-PCM, to evenly distribute write activities across the PCM chip in order to improve the endurance of PCM. The basic idea is to exploit application-specific features in embedded systems and periodically move the hot region across the whole PCM chip. To further reduce the overhead of moving the hot region and improve the performance of PCM-based embedded systems, a fine-grained partial wear leveling policy is proposed in Curling-PCM, by which only part of the hot region is moved during each request handling period. The experimental results show that Curling-PCM can effectively evenly distribute write traffic in PCM chips compared with previous work. We expect this work can serve as a first step towards the full exploration of application-specific features in PCM-based embedded systems.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128059359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimization of overdrive signoff","authors":"T. Chan, A. Kahng, Jiajia Li, S. Nath","doi":"10.1109/ASPDAC.2013.6509619","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509619","url":null,"abstract":"In modern SOC implementations, multi-mode design is commonly used to achieve better circuit performance and power across voltage-scaling, “turbo” and other operating modes. Although there are many tools for multi-mode circuit implementation, to our knowledge there is no available systematic analysis or methodology for the selection of associated signoff modes. We observe that the selection of signoff modes has significant impact on circuit area, power and performance. For example, incorrect choice of signoff voltages for required overdrive frequencies can result in a netlist with 15% suboptimality in power or 21% in area. In this paper, we propose a concept of mode dominance which can be used as a guideline for signoff mode selection. Further, we also propose efficient circuit implementation flows to optimize the selection of signoff modes within several distinct use cases. Our results show that our proposed methodology provides 5-7% improvement in performance compared to the traditional “signoff and scale” method. The signoff modes determined by our methods result in only 0.6% overhead in performance and 8% overhead in power after implementation, compared to the optimal signoff modes.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115977268","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Gurumani, Hisham Cholakkal, Yun Liang, K. Rupnow, Deming Chen
{"title":"High-level synthesis of multiple dependent CUDA kernels on FPGA","authors":"S. Gurumani, Hisham Cholakkal, Yun Liang, K. Rupnow, Deming Chen","doi":"10.1109/ASPDAC.2013.6509613","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509613","url":null,"abstract":"High-level synthesis (HLS) tools provide automatic generation of hardware at the register transfer level (RTL) from algorithm descriptions written in high-level languages, enabling faster creation of custom accelerators for FPGA architectures. Existing HLS tools support a wide variety of input languages, and assist users in design space exploration through automation and feedback on designs' performance bottlenecks. This design space exploration applies techniques such as pipelining, partitioning and resource sharing in order to improve performance, and resource utilization. However, although automated exploration can find some inherent parallelism, data-parallel input source code is still superior for exposing a greater variety of parallelism. In prior work, we demonstrated automated design space exploration of GPU multi-threaded (CUDA) language source code for efficient RTL generation. In this paper, we examine the challenges in extending this automated design space exploration to multiple dependent CUDA kernels, demonstrate a step-by-step procedure for efficiently performing multi-kernel synthesis, and demonstrate the potential of this approach through a case study of a stereo matching algorithm. This study demonstrates that HLS of multiple dependent CUDA kernels can maintain performance parity with the GPU implementation, while consuming over 16X less energy than the GPU. Based on our manual procedure, we identify the key challenges in fully automating the synthesis of multi-kernel CUDA programs.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133861229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unconditionally stable explicit method for the fast 3-D simulation of on-chip power distribution network with through silicon via","authors":"T. Sekine, H. Asai","doi":"10.1109/ASPDAC.2013.6509550","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509550","url":null,"abstract":"The equivalent circuit of an on-chip power distribution network (PDN) has a fine 3-D grid structure due to the vias between equipotential conductors, and the vertical couplings between power and ground lines. In addition, a through silicon via is modeled with inductive and capacitive parasitic elements and appended to the PDN. Therefore, the circuit related to the 3-D IC technology tends to be a tightly coupled large network. For the simulation of this type of network, an explicit time marching scheme has an advantage over conventional general-purpose circuit simulators such as SPICE in the computational cost. However, the explicit method has a strict numerical stability condition, which may limit the maximum time step size and increase the total amount of the cost. In this work, we propose the method which is explicit, but stable with no stability condition. Additionally, the proposed unconditionally-stable explicit method is accelerated more by combining with an order reduction technique.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114567591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MIXSyn: An efficient logic synthesis methodology for mixed XOR-AND/OR dominated circuits","authors":"L. Amarù, P. Gaillardon, G. Micheli","doi":"10.1109/ASPDAC.2013.6509585","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509585","url":null,"abstract":"We present a new logic synthesis methodology, called MIXSyn, that produces area-efficient results for mixed XOR-AND/OR dominated logic functions. MIXSyn is a two step synthesis process. The first step is a hybrid logic optimization that enables selective and distinct optimization of AND/OR and XOR-intensive portions of the logic circuit. The second step is a library-free technology mapping that enhances design flexibility with a tractable computational cost. MIXSyn has been tested on a set of large MCNC benchmarks. Experimental results indicate that MIXSyn produces CMOS circuits with 18.0% and 9.2% fewer devices, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively. MIXSyn is also capable to exploit the opportunity of novel XOR implementations offered by the use of double-gate ambipolar devices. Experimental results show that MIXSyn can reduce the number of ambipolar transistors by 20.9% and 15.3%, on the average, with respect to state-of-art academic and commercial synthesis tools, respectively.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114510995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Heterogeneous memory management for 3D-DRAM and external DRAM with QoS","authors":"L. Tran, F. Kurdahi, A. Eltawil, H. Homayoun","doi":"10.1109/ASPDAC.2013.6509676","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509676","url":null,"abstract":"This paper presents an innovative memory management approach to utilize both 3D-DRAM and external DRAM (ex-DRAM). Our approach dynamically allocates and relocates memory blocks between the 3D-DRAM and the ex-DRAM to exploit the high memory bandwidth and the low memory latency of the 3D-DRAM as well as the high capacity and the low cost of the ex-DRAM. Our simulation shows that in workloads that are not memory intensive, our memory management technique transfers all active memory blocks to the 3D-DRAM which runs faster than the ex-DRAM. In memory intensive workloads, our memory management technique utilizes both the 3D-DRAM and the ex-DRAM to increase the memory bandwidth to alleviate bandwidth congestion. Our approach supports Quality of Service (QoS) for “latency sensitive”, “bandwidth sensitive”, and “insensitive” applications. To improve the performance and satisfy a certain level of QoS, memory blocks of different application types are allocated differently. Compared to the scratchpad memory management mechanism, the average memory access latency of our approach decreases by 19% and 23%, while performance improves by up to 5% and 12% in single threaded benchmarks and multi-threaded benchmarks respectively. Moreover, using our approach, applications do not need to manage memory explicitly like in the scratchpad case. Our memory block relocation comes with negligible performance overhead, particularly for applications which have high spatial memory locality.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121929818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HS3DPG: Hierarchical simulation for 3D P/G network","authors":"Shuai Tao, Xiaoming Chen, Yu Wang, Yuchun Ma, Yiyu Shi, Hui Wang, Huazhong Yang","doi":"10.1109/ASPDAC.2013.6509647","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509647","url":null,"abstract":"As different chips are stacked together in 3D ICs, the power/ground (P/G) network simulation becomes more challenging than that of 2D cases. In this paper, we propose a hierarchical simulation method suitable for 3D P/G network (HS3DPG), which can ensure full parallelism and good scalability with the number of tiers. In the IR drop analysis, when there are 9 tiers, the hierarchical method can be 6.5 times faster than the direct full network simulation. The accuracy of HS3DPG has been verified by a 3D P/G network from the industrial design. Besides, we introduce the “locality” property into HS3DPG to further simplify the simulation. Finally, HS3DPG is used to analyze the voltage distribution of a 3D P/G network with clustered TSVs.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129725016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application-specific fault-tolerant architecture synthesis for digital microfluidic biochips","authors":"M. Alistar, P. Pop, J. Madsen","doi":"10.1109/ASPDAC.2013.6509697","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509697","url":null,"abstract":"Microfluidic-based biochips are replacing the conventional biochemical analyzers, and are able to integrate onchip all the necessary functions for biochemical analysis using microfluidics. The digital microfluidic biochips are based on the manipulation of liquids not as a continuous flow, but as discrete droplets on an array of electrodes. Microfluidic operations, such as transport, mixing, split, are performed on this array by routing the corresponding droplets on a series of electrodes. Researchers have proposed several approaches for the synthesis of digital microfluidic biochips. All previous work assumes that the biochip architecture is given, and most approaches consider a rectangular shape for the electrode array. However, non-regular application-specific architectures are common in practice. Hence, in this paper, we propose an approach to the application-specific architecture synthesis. Our approach can also help the designer to increase the yield by introducing redundant electrodes to tolerate permanent faults. The proposed architecture synthesis algorithm has been evaluated using several benchmarks.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127511846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Siriburanon, W. Deng, Ahmed Musa, K. Okada, A. Matsuzawa
{"title":"A sub-harmonic injection-locked frequency synthesizer with frequency calibration scheme for use in 60GHz TDD transceivers","authors":"T. Siriburanon, W. Deng, Ahmed Musa, K. Okada, A. Matsuzawa","doi":"10.1109/ASPDAC.2013.6509574","DOIUrl":"https://doi.org/10.1109/ASPDAC.2013.6509574","url":null,"abstract":"A 58.1-to-65.0 GHz frequency synthesizer using sub-harmonic injection-locking technique is presented. The synthesizer can generate all 60GHz channels defined by IEEE 802.15.3c, wirelessHD, IEEE 802.11ad, WiGig, and ECMA-387. A frequency calibration scheme is proposed to monitor frequency shift resulting from environmental variations. Implemented in a 65nm CMOS process, the synthesizer achieves a typical phase noise of -117 dBc/Hz @10MHz offset from a carrier frequency of 61.56 GHz.","PeriodicalId":297528,"journal":{"name":"2013 18th Asia and South Pacific Design Automation Conference (ASP-DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128995065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}