P. Panda, Sourav Roy, Srikanth Chandrasekaran, Namita Sharma, Jasleen Kaur, Sarath Kumar Kandalam, N. Nagaraj
{"title":"High level energy modeling of controller logic in data caches","authors":"P. Panda, Sourav Roy, Srikanth Chandrasekaran, Namita Sharma, Jasleen Kaur, Sarath Kumar Kandalam, N. Nagaraj","doi":"10.1145/2591513.2591590","DOIUrl":"https://doi.org/10.1145/2591513.2591590","url":null,"abstract":"In modern embedded processor caches, a significant amount of energy dissipation occurs in the controller logic part of the cache. Previous power/energy modeling tools have focused on the core memory part of the cache. We propose energy models for two of these modules -- Write Buffer and Replacement logic. Since this hardware is generally synthesized by designers, our power models are also based on empirical data. We found a linear dependence of the per-access write buffer energy on the write buffer depth and write width. We validated our models on several different benchmark examples, using different technology nodes. Our models generate energy estimates that are within 4.2% of those measured by detailed power simulations, making the models valuable mechanisms for rapid energy estimates during architecture exploration.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116730122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Horizontal benchmark extension for improved assessment of physical CAD research","authors":"A. Kahng, Hyein Lee, Jiajia Li","doi":"10.1145/2591513.2591540","DOIUrl":"https://doi.org/10.1145/2591513.2591540","url":null,"abstract":"The rapid growth in complexity and diversity of IC designs, design flows and methodologies has resulted in a benchmark-centric culture for evaluation of performance and scalability in physicaldesign algorithm research. Landmark papers in the literature present vertical benchmarks that can be used across multiple design flow stages; artificial benchmarks with characteristics that mimic those of real designs; artificial benchmarks with known optimal solutions; as well as benchmark suites created by major companies from internal designs and/or open-source RTL. However, to our knowledge, there has been no work on horizontal benchmark creation, i.e., the creation of benchmarks that enable maximal, comprehensive assessments across commercial and academic tools at one or more specific design stages. Typically, the creation of horizontal benchmarks is limited by mismatches in data models, netlist formats, technology files, library granularity, etc. across different tools, technologies, and benchmark suites. In this paper, we describe methodology and robust infrastructure for horizontal benchmark extension\" that permits maximal leverage of benchmark suites and technologies in \"apples-to-apples\" assessment of both industry and academic optimizers. We demonstrate horizontal benchmark extensions, and the assessments that are thus enabled, in two well-studied domains: place-and-route (four combinations of academic placers/routers, and two commercial P&R tools) and gate sizing (two academic sizers, and three commercial tools). We also point out several issues and precepts for horizontal benchmark enablement.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128230054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A performance enhancing hybrid locally mesh globally star NoC topology","authors":"T. S. Das, P. Ghosal, S. Mohanty, E. Kougianos","doi":"10.1145/2591513.2591544","DOIUrl":"https://doi.org/10.1145/2591513.2591544","url":null,"abstract":"With the rapid increase in the chip density, Network-on-Chip (NoC) is becoming the prevalent architecture for today's complex chip multi processor (CMP) based systems. One of the major challenges of the NoC is to design an enhanced parallel communication centric scalable architecture for the on chip communication. In this paper, a hybrid Mesh based Star topology has been proposed to provide low latency, high throughput and more evenly distributed traffic throughout the network. Simulation results show that a maximum of 62% latency benefit (for size 8x8), 55% (for size 8x8), and 42% (for size 12x12) throughput benefits can be achieved for proposed topology over mesh with a small area overhead.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121220155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A parallel and reconfigurable architecture for efficient OMP compressive sensing reconstruction","authors":"A. Kulkarni, H. Homayoun, T. Mohsenin","doi":"10.1145/2591513.2591598","DOIUrl":"https://doi.org/10.1145/2591513.2591598","url":null,"abstract":"Compressive Sensing (CS) is a novel scheme, in which a signal that is sparse in a known transform domain can be reconstructed using fewer samples. However, the signal reconstruction techniques are computationally intensive and power consuming, which make them impractical for embedded applications. This work presents a parallel and reconfigurable architecture for Orthogonal Matching Pursuit (OMP) algorithm, one of the most popular CS reconstruction algorithms. In this paper, we are proposing the first reconfigurable OMP CS reconstruction architecture which can take different image sizes with sparsity up to 32. The aim is to minimize the hardware complexity, area and power consumption, and improve the reconstruction latency while meeting the reconstruction accuracy. First, the accuracy of reconstructed images is analyzed for different sparsity values and fixed point word length reduction. Next, efficient parallelization techniques are applied to reconstruct signals with variant signal lengths of N. The OMP algorithm is mainly divided into three kernels, where each kernel is parallelized to reduce execution time, and efficient reuse of the matrix operators allows us to reduce area. The proposed architecture can reconstruct images of different sizes and measurements and is implemented on a Xilinx Virtex 7 FPGA. The results indicate that, for a 128x128 image reconstruction, the proposed reconfigurable architecture is 2.67x to 1.8x faster than the previous non-reconfigurable work which is less complex and uses much smaller sparsity.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121427332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparison of FinFET based FPGA LUT designs","authors":"M. Abusultan, S. Khatri","doi":"10.1145/2591513.2591596","DOIUrl":"https://doi.org/10.1145/2591513.2591596","url":null,"abstract":"The FinFET device has gained much traction in recent VLSI designs. In the FinFET device, the conduction channel is vertical, unlike a traditional bulk MOSFET, in which the conduction channel is planar. This yields several benefits, and as a consequence, it is expected that most VLSI designs will utilize FinFETs from the 20nm node and beyond. Despite the fact that several research papers have reported FinFET based circuit and layout realizations for popular circuit blocks, there has been no reported work on the use of FinFETs for Field Programmable Gate Array (FPGA) designs. The key circuit in the FPGA that enables programmability is the n-input Look-up Table (LUT). An n-input LUT can implement any logic function of up to n inputs. In this paper, we present an evaluation of several FPGA LUT designs. We compare these designs from a performance (delay, power, energy) as well as an area perspective. Comparisons are conducted with respect to a bulk based LUT as well. Our results demonstrate that all the FinFET based LUTs exhibit better delays and energy than the bulk based LUT. Based on our comparisons, we have two winning candidate LUTs, one for high performance designs (3X faster than a bulk based LUT) and another for low energy, area constrained designs (83% energy and 58% area compared to a bulk based LUT).","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115937627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Neural network-based accelerators for transcendental function approximation","authors":"Schuyler Eldridge, F. Raudies, D. Zou, A. Joshi","doi":"10.1145/2591513.2591534","DOIUrl":"https://doi.org/10.1145/2591513.2591534","url":null,"abstract":"The general-purpose approximate nature of neural network (NN) based accelerators has the potential to sustain the historic energy and performance improvements of computing systems. We propose the use of NN-based accelerators to approximate mathematical functions in the GNU C Library (glibc) that commonly occur in application benchmarks. Using our NN-based approach to approximate cos, exp, log, pow, and sin we achieve an average energy-delay product (EDP) that is 68x lower than that of traditional glibc execution. In applications, our NN-based approach has an EDP 78% of that of traditional execution at the cost of an average mean squared error (MSE) of 1.56.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131304069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adarsh Reddy Ashammagari, H. Mahmoodi, T. Mohsenin, H. Homayoun
{"title":"Reconfigurable STT-NV LUT-based functional units to improve performance in general-purpose processors","authors":"Adarsh Reddy Ashammagari, H. Mahmoodi, T. Mohsenin, H. Homayoun","doi":"10.1145/2591513.2591535","DOIUrl":"https://doi.org/10.1145/2591513.2591535","url":null,"abstract":"Unavailability of functional units is a major performance bottleneck in general-purpose processors (GPP). In a GPP with limited number of functional units while a functional unit may be heavily utilized at times, creating a performance bottleneck, the other functional units might be under-utilized. We propose a novel idea for adapting functional units in GPP architecture in order to overcome this challenge. For this purpose, a selected set of complex functional units that might be under-utilized such as multiplier and divider, are realized using a programmable look up table-based fabric. This allows for run-time adaptation of functional units to improving performance. The programmable look up tables are realized using magnetic tunnel junction (MTJ) based memories that dissipate near zero leakage and are CMOS compatible. We have applied this idea to a dual issue architecture. The results show that compared to a design with all CMOS functional units a performance improvement of 18%, on average is achieved for standard benchmarks. This comes with 4.1% power increase in integer benchmarks and 2.3% power decrease in floating point benchmarks, compared to a CMOS design.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115837651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trade-off between energy and quality of service through dynamic operand truncation and fusion","authors":"Wenchao Qian, Robert Karam, S. Bhunia","doi":"10.1145/2591513.2591561","DOIUrl":"https://doi.org/10.1145/2591513.2591561","url":null,"abstract":"Energy efficiency has emerged as a major design concern for embedded and portable electronics. Conventional approaches typically impact performance and often require significant design-time modifications. In this paper, we propose a novel approach for improving energy efficiency through judicious fusion of operations. The proposed approach has two major distinctions: (1) the fusion is enabled by operand truncation, which allows representing multiple operations into a reasonably sized lookup table (LUT); and (2) it works for large varieties of functions. Most applications in the domain of digital signal processing (DSP) and graphics can tolerate some computation error without large degradation in output quality. Our approach improves energy efficiency with graceful degradation in quality. The proposed fusion approach can be applied to trade-off energy efficiency with quality at run time and requires virtually no circuit or architecture level modifications in a processor. Using our software tool for automatic fusion and truncation, the effectiveness of the approach is studied for four common applications. Simulation results show promising improvements (19-90%) in energy delay product with minimal impact on quality.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114866826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Forward-scaling, serially equivalent parallelism for FPGA placement","authors":"C. Fobel, G. Grewal, D. Stacey","doi":"10.1145/2591513.2591543","DOIUrl":"https://doi.org/10.1145/2591513.2591543","url":null,"abstract":"Placement run-times continue to dominate the FPGA design flow. Previous attempts at parallel placement methods either only scale to a few threads or result in a significant loss in solution quality as thread-count is increased. We propose a novel method for generating large amounts of parallel work for placement, which scales with the size of the target architecture. Our experimental results show that we nearly reach the limit of the number of possible parallel swaps, while improving critical-path-delay 4.7% compared to VPR. While our proposed implementation currently utilizes a single thread, we still achieve speedups of 13.3x over VPR.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132961271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A current-mode CMOS/memristor hybrid implementation of an extreme learning machine","authors":"Cory E. Merkel, D. Kudithipudi","doi":"10.1145/2591513.2591572","DOIUrl":"https://doi.org/10.1145/2591513.2591572","url":null,"abstract":"In this work, we propose a current-mode CMOS/memristor hybrid implementation of an extreme learning machine (ELM) architecture. We present novel circuit designs for linear, sigmoid,and threshold neuronal activation functions, as well as memristor-based bipolar synaptic weighting. In addition, this work proposes a stochastic version of the least-mean-squares (LMS) training algorithm for adapting the weights between the ELM's hidden and output layers. We simulated our top-level ELM architecture using Cadence AMS Designer with 45 nm CMOS models and an empirical piecewise linear memristor model based on experimental data from an HfOx device. With 10 hidden node neurons, the ELM was able to learn a 2-input XOR function after 150 training epochs.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131008868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}