P. Otto, Maria Malik, N. Akhlaghi, Rebel Sequeira, H. Homayoun, S. Sikdar
{"title":"Power and performance characterization, analysis and tuning for energy-efficient edge detection on atom and ARM based platforms","authors":"P. Otto, Maria Malik, N. Akhlaghi, Rebel Sequeira, H. Homayoun, S. Sikdar","doi":"10.1109/ICCD.2015.7357153","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357153","url":null,"abstract":"The de facto standard for embedded platforms with medium to low computing demands are ARM with Thumb ISA and Intel Atom with the X86 ISA with multiple cores. Operating these architectures in the milliwatts range while running realtime computer vision corner detection algorithms is a challenging problem. We present the analysis of power, performance and energy-efficiency measurements of Harris corner detection across a wide range of voltage and frequency settings, multicore/multithreading strategies, and compiler and application optimization parameters to find how the interplay of these parameters affect the power, performance and energy-efficiency. Our measurement of results on state-of-the-art embedded platforms demonstrate that a systematic cross-layer optimization at the application level (Sobel filter type, aperture size, number of image tiles), compiler level (branch prediction, function inlining) and system level (voltage and frequency setting, single core vs multicore implementation) significantly improves the energy-efficiency of corner detection, while meeting its real-time performance constraints. This cross-layer optimization improves the energy-efficiency of Harris corner on Atom and ARM by 89.5% and 87.2%, respectively.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121644916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bicky Shakya, Ujjwal Guin, M. Tehranipoor, Domenic Forte
{"title":"Performance optimization for on-chip sensors to detect recycled ICs","authors":"Bicky Shakya, Ujjwal Guin, M. Tehranipoor, Domenic Forte","doi":"10.1109/ICCD.2015.7357116","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357116","url":null,"abstract":"IC recycling has become a grave problem in today's globalized semiconductor industry, with potential impact to critical infrastructures. In order to mitigate this problem, various Design-for-Anti-Counterfeit (DfAC) measures have been recently proposed. In this paper, we look at DfAC strategies based on recycling sensors, most notably the ones based on a pair of ring oscillators, which rely on integrated circuit aging phenomena to detect usage of ICs in the field. We introduce a novel optimization technique that generalizes to most recycling sensors suggested so far in literature and gives manufacturers exact control over parameters that determine sensor performance, such as yield, misprediction and area overhead. A detailed analysis of various factors affecting recycling sensor performance is presented and an optimization problem is formulated and verified using simulations, in order to demonstrate the accuracy of the approach.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"294 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128102193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"InvArch: A hardware eficient architecture for Matrix Inversion","authors":"Umer I. Cheema, G. Nash, R. Ansari, A. Khokhar","doi":"10.1109/ICCD.2015.7357100","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357100","url":null,"abstract":"This paper proposes an efficient architecture (InvArch) for computing matrix inversion using Gauss-Jordan Elimination method. The proposed architecture exploits parallelism through pipelined floating-point computational units and reduces the number of floating-point multiplication units required compared with the existing pipelined implementations. The reduction in multiplication units results in over 80% reduction in hardware for floating point computation units. The architecture performs in-place inversion and provides scalability across the rows and columns. Hardware efficiency is achieved by reaping benefit from regularity in computation and better utilization of pipelined computational resources. Multiple rows are normalized within an iteration of Gauss-Jordan algorithm that allows reduction in number of floating-point multiplication units in the elimination step. In addition to implementing the architecture, an analytical performance model is also developed for InvArch and some related works. InvArch achieves performance comparable to reference architectures in terms of clock cycles and throughput while using significantly less hardware resources.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"489 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115881786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring early and late ALUs for single-issue in-order pipelines","authors":"Alen Bardizbanyan, P. Larsson-Edefors","doi":"10.1109/ICCD.2015.7357163","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357163","url":null,"abstract":"In-order processors are key components in energy-efficient embedded systems. One important design aspect of inorder pipelines is the sequence of pipeline stages: First, the position of the execute stage, in which arithmetic logic unit (ALU) operations and branch prediction are handled, impacts the number of stall cycles that are caused by data dependencies between data memory instructions and their consuming instructions and by address generation instructions that depend on an ALU result. Second, the position of the ALU inside the pipeline impacts the branch penalty. This paper considers the question on how to best make use of ALU resources inside a single-issue in-order pipeline. We begin by analyzing which is the most efficient way of placing a single ALU in an in-order pipeline. We then go on to evaluate what is the most efficient way to make use of two ALUs, one early and one late ALU, which is a technique that has revitalized commercial in-order processors in recent years. Our architectural simulations, which are based on 20 MiBench and 7 SPEC2000 integer benchmarks and a 65-nm postlayout netlist of a complete pipeline, show that utilizing two ALUs in different stages of the pipeline gives better performance and energy efficiency than any other pipeline configuration with a single ALU.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116612730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VLSI implementation of high-throughput, low-energy, configurable MIMO detector","authors":"P. Chuang, M. Sachdev, V. Gaudet","doi":"10.1109/ICCD.2015.7357162","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357162","url":null,"abstract":"This work focuses on a multi-core VLSI implementation of a multiple-input multiple-output (MIMO) detector utilizing a sphere-decoding algorithm. A complex-domain node traversal algorithm that achieves similar performance results as that of an exhaustive-search algorithm where every node is checked and sorted is also described. A 4×4, 64-QAM hard-output detector utilizing this VLSI design occupies 98k gates, and achieves near-ML performance with an average throughput of 1.22 Gb/s and an energy/bit of 23 pJ/b on a nominal 1.2 V supply in a 0.13μm CMOS process. The hard-output design can be further expanded to provide soft-output capability, and achieves an average throughput of 0.65 Gb/s and reaches 10-5 BER at an SNR of 19.7 dB.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129334725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-optimal voltage model supporting a wide range of nodal switching rates for early design-space exploration","authors":"Doyun Kim, Jiangyi Li, Mingoo Seok","doi":"10.1109/ICCD.2015.7357129","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357129","url":null,"abstract":"This paper explores the models of the energy-optimal voltage (VOPT) of near/sub-threshold digital VLSI circuits with a focus on the support for a wide range of nodal switching rates. The previous models can estimate the VOPT of the circuits having relatively high nodal switching rates (VOPT, H), but can become inaccurate in finding the VOPT of the circuits having low nodal switching rate. In this work, therefore, we develop the models for finding (i) the VOPT of the circuits having low nodal switching rates (VOPT, L) and (ii) the critical nodal switching rate point (αcrit) below which the VOPT, L should be used. The models are verified with inverter chains and sub-threshold 10-transistor SRAM arrays in SPICE-level simulation. The model takes only process technology parameters to estimate VOPTs, and can be suitable for early-stage design-space exploration.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129496941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploring the viability of stochastic computing","authors":"Joao Marcos de Aguiar, S. Khatri","doi":"10.1109/ICCD.2015.7357131","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357131","url":null,"abstract":"Recently, stochastic circuits have received significant attention from academia. Stochastic circuits claim to have a reduced energy consumption at the cost of accuracy and delay. In this paper, we explore the power, delay, energy and area of a stochastic circuit (a stochastic multiplier in particular), and compare these metrics with those of a regular multiplier, implemented using the Sum Of Products (SOP) approach. The SOP based multiplier is implemented both using a Kogge-Stone Adder, as well as a Ripple-Carry adder. Our results show that when the stochastic number generator (SNG) and counter are included in the stochastic multiplier (SM), even for 3 bits, the SM consumes more energy to finish one multiplication than an SOP based regular binary multiplier (RM), and this energy consumption grows exponentially as the number of bits increases. If we only consider the stochastic multiplier cell (SMC, which is simply a 2-input AND gate) and ignore the energy of the SNG and counter, the SMC has a better energy consumption for multiplications up to 12 bits. However, even for 3 bits, the SM (or the SMC) is slower by over 5x compared to the regular multiplier, and this delay increases exponentially as the number of bits increases. The area of the SM (including the area of the SNG and counter) is smaller for multipliers with more than 6 bits.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127056828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Power management of pulsed-index communication protocols","authors":"Shahzad Muzaffar, I. Elfadel","doi":"10.1109/ICCD.2015.7357127","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357127","url":null,"abstract":"Pulsed-Index Communication (PIC) is a novel technique for single-channel, high-data-rate, low-power dynamic signaling that does not require any clock and data recovery (CDR). It is fully adapted to the simple yet robust communication needs of IoT devices and sensors. Prior work has focused on the power savings that this protocol can achieve as a result of the elimination of circuitry devoted to clock and data recovery. In this paper, we show that further power saving can be achieved using the duty cycle of the pulse as a power control parameter. This power control policy is applied to a single-wire link with significant power saving achieved above and beyond the savings due to the CDR elimination. These power savings are obtained without any impact on data rate. The pulse control policy is implemented using 45nm CMOS technology and verified on various, single-channel communication links.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129197438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Ahari, Mojtaba Ebrahimi, Fabian Oboril, M. Tahoori
{"title":"Improving reliability, performance, and energy efficiency of STT-MRAM with dynamic write latency","authors":"A. Ahari, Mojtaba Ebrahimi, Fabian Oboril, M. Tahoori","doi":"10.1109/ICCD.2015.7357091","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357091","url":null,"abstract":"High write latency and high write energy are the major challenges in Spin Transfer Torque Magnetic Random Access Memory (STT-MRAM) design. The write operation in STT-MRAM is of stochastic nature. Therefore, it requires a very long timing margin to maintain an acceptable level of reliability and yield. Traditionally, Error Correction Codes (ECCs) are used to reduce the timing margin in STT-MRAM. However, they impose high storage and latency overheads. In this paper, we propose a low-cost architecture-level technique to significantly reduce the amount of required timing margin. This technique employs a handshaking protocol between the memory and its controller to dynamically determine the write latency at run-time. Our simulation infrastructure comprehensively models the combined effect of process variation and stochastic write behavior at circuit-level and abstracts it to architecture-level. The simulation results show that the proposed technique not only considerably reduces the write error rate but also improves the overall system performance on average by 15.4% compared to existing solutions.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124877792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dawei Li, S. Joshi, S. Memik, J. Hoff, S. Jindariani, Tiehui Liu, J. Olsen, N. Tran
{"title":"A methodology for power characterization of associative memories","authors":"Dawei Li, S. Joshi, S. Memik, J. Hoff, S. Jindariani, Tiehui Liu, J. Olsen, N. Tran","doi":"10.1109/ICCD.2015.7357156","DOIUrl":"https://doi.org/10.1109/ICCD.2015.7357156","url":null,"abstract":"Content Addressable Memories (CAM) have become increasingly more important in applications requiring high speed memory search due to their inherent massively parallel processing architecture. We present a complete power analysis methodology for CAM systems to aid the exploration of their power-performance trade-offs in future systems. Our proposed methodology uses detailed transistor level circuit simulation of power behavior and a handful of input data types to simulate full chip power consumption. Furthermore, we applied our power analysis methodology on a custom designed associative memory test chip. This chip was developed by Fermilab for the purpose of developing high performance real-time pattern recognition on high volume data produced by a future large-scale scientific experiment. We applied our methodology to configure a power model for this test chip. Our model is capable of predicting the total average power within 4% of actual power measurements. Our power analysis methodology can be generalized and applied to other CAM-like memory systems and accurately characterize their power behavior.","PeriodicalId":129506,"journal":{"name":"2015 33rd IEEE International Conference on Computer Design (ICCD)","volume":"80 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114032075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}