ACM Great Lakes Symposium on VLSI最新文献

筛选
英文 中文
High level energy modeling of controller logic in data caches 数据缓存中控制器逻辑的高级能量建模
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591590
P. Panda, Sourav Roy, Srikanth Chandrasekaran, Namita Sharma, Jasleen Kaur, Sarath Kumar Kandalam, N. Nagaraj
{"title":"High level energy modeling of controller logic in data caches","authors":"P. Panda, Sourav Roy, Srikanth Chandrasekaran, Namita Sharma, Jasleen Kaur, Sarath Kumar Kandalam, N. Nagaraj","doi":"10.1145/2591513.2591590","DOIUrl":"https://doi.org/10.1145/2591513.2591590","url":null,"abstract":"In modern embedded processor caches, a significant amount of energy dissipation occurs in the controller logic part of the cache. Previous power/energy modeling tools have focused on the core memory part of the cache. We propose energy models for two of these modules -- Write Buffer and Replacement logic. Since this hardware is generally synthesized by designers, our power models are also based on empirical data. We found a linear dependence of the per-access write buffer energy on the write buffer depth and write width. We validated our models on several different benchmark examples, using different technology nodes. Our models generate energy estimates that are within 4.2% of those measured by detailed power simulations, making the models valuable mechanisms for rapid energy estimates during architecture exploration.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116730122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Horizontal benchmark extension for improved assessment of physical CAD research 提高物理CAD研究评价水平基准扩展
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591540
A. Kahng, Hyein Lee, Jiajia Li
{"title":"Horizontal benchmark extension for improved assessment of physical CAD research","authors":"A. Kahng, Hyein Lee, Jiajia Li","doi":"10.1145/2591513.2591540","DOIUrl":"https://doi.org/10.1145/2591513.2591540","url":null,"abstract":"The rapid growth in complexity and diversity of IC designs, design flows and methodologies has resulted in a benchmark-centric culture for evaluation of performance and scalability in physicaldesign algorithm research. Landmark papers in the literature present vertical benchmarks that can be used across multiple design flow stages; artificial benchmarks with characteristics that mimic those of real designs; artificial benchmarks with known optimal solutions; as well as benchmark suites created by major companies from internal designs and/or open-source RTL. However, to our knowledge, there has been no work on horizontal benchmark creation, i.e., the creation of benchmarks that enable maximal, comprehensive assessments across commercial and academic tools at one or more specific design stages. Typically, the creation of horizontal benchmarks is limited by mismatches in data models, netlist formats, technology files, library granularity, etc. across different tools, technologies, and benchmark suites. In this paper, we describe methodology and robust infrastructure for horizontal benchmark extension\" that permits maximal leverage of benchmark suites and technologies in \"apples-to-apples\" assessment of both industry and academic optimizers. We demonstrate horizontal benchmark extensions, and the assessments that are thus enabled, in two well-studied domains: place-and-route (four combinations of academic placers/routers, and two commercial P&R tools) and gate sizing (two academic sizers, and three commercial tools). We also point out several issues and precepts for horizontal benchmark enablement.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128230054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
A performance enhancing hybrid locally mesh globally star NoC topology 一种增强性能的混合局部网格全局星型NoC拓扑
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591544
T. S. Das, P. Ghosal, S. Mohanty, E. Kougianos
{"title":"A performance enhancing hybrid locally mesh globally star NoC topology","authors":"T. S. Das, P. Ghosal, S. Mohanty, E. Kougianos","doi":"10.1145/2591513.2591544","DOIUrl":"https://doi.org/10.1145/2591513.2591544","url":null,"abstract":"With the rapid increase in the chip density, Network-on-Chip (NoC) is becoming the prevalent architecture for today's complex chip multi processor (CMP) based systems. One of the major challenges of the NoC is to design an enhanced parallel communication centric scalable architecture for the on chip communication. In this paper, a hybrid Mesh based Star topology has been proposed to provide low latency, high throughput and more evenly distributed traffic throughout the network. Simulation results show that a maximum of 62% latency benefit (for size 8x8), 55% (for size 8x8), and 42% (for size 12x12) throughput benefits can be achieved for proposed topology over mesh with a small area overhead.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121220155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A parallel and reconfigurable architecture for efficient OMP compressive sensing reconstruction 一种用于高效OMP压缩感知重构的并行可重构结构
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591598
A. Kulkarni, H. Homayoun, T. Mohsenin
{"title":"A parallel and reconfigurable architecture for efficient OMP compressive sensing reconstruction","authors":"A. Kulkarni, H. Homayoun, T. Mohsenin","doi":"10.1145/2591513.2591598","DOIUrl":"https://doi.org/10.1145/2591513.2591598","url":null,"abstract":"Compressive Sensing (CS) is a novel scheme, in which a signal that is sparse in a known transform domain can be reconstructed using fewer samples. However, the signal reconstruction techniques are computationally intensive and power consuming, which make them impractical for embedded applications. This work presents a parallel and reconfigurable architecture for Orthogonal Matching Pursuit (OMP) algorithm, one of the most popular CS reconstruction algorithms. In this paper, we are proposing the first reconfigurable OMP CS reconstruction architecture which can take different image sizes with sparsity up to 32. The aim is to minimize the hardware complexity, area and power consumption, and improve the reconstruction latency while meeting the reconstruction accuracy. First, the accuracy of reconstructed images is analyzed for different sparsity values and fixed point word length reduction. Next, efficient parallelization techniques are applied to reconstruct signals with variant signal lengths of N. The OMP algorithm is mainly divided into three kernels, where each kernel is parallelized to reduce execution time, and efficient reuse of the matrix operators allows us to reduce area. The proposed architecture can reconstruct images of different sizes and measurements and is implemented on a Xilinx Virtex 7 FPGA. The results indicate that, for a 128x128 image reconstruction, the proposed reconfigurable architecture is 2.67x to 1.8x faster than the previous non-reconfigurable work which is less complex and uses much smaller sparsity.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121427332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
A comparison of FinFET based FPGA LUT designs 基于FinFET的FPGA LUT设计比较
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591596
M. Abusultan, S. Khatri
{"title":"A comparison of FinFET based FPGA LUT designs","authors":"M. Abusultan, S. Khatri","doi":"10.1145/2591513.2591596","DOIUrl":"https://doi.org/10.1145/2591513.2591596","url":null,"abstract":"The FinFET device has gained much traction in recent VLSI designs. In the FinFET device, the conduction channel is vertical, unlike a traditional bulk MOSFET, in which the conduction channel is planar. This yields several benefits, and as a consequence, it is expected that most VLSI designs will utilize FinFETs from the 20nm node and beyond. Despite the fact that several research papers have reported FinFET based circuit and layout realizations for popular circuit blocks, there has been no reported work on the use of FinFETs for Field Programmable Gate Array (FPGA) designs. The key circuit in the FPGA that enables programmability is the n-input Look-up Table (LUT). An n-input LUT can implement any logic function of up to n inputs. In this paper, we present an evaluation of several FPGA LUT designs. We compare these designs from a performance (delay, power, energy) as well as an area perspective. Comparisons are conducted with respect to a bulk based LUT as well. Our results demonstrate that all the FinFET based LUTs exhibit better delays and energy than the bulk based LUT. Based on our comparisons, we have two winning candidate LUTs, one for high performance designs (3X faster than a bulk based LUT) and another for low energy, area constrained designs (83% energy and 58% area compared to a bulk based LUT).","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115937627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Neural network-based accelerators for transcendental function approximation 基于神经网络的超越函数逼近加速器
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591534
Schuyler Eldridge, F. Raudies, D. Zou, A. Joshi
{"title":"Neural network-based accelerators for transcendental function approximation","authors":"Schuyler Eldridge, F. Raudies, D. Zou, A. Joshi","doi":"10.1145/2591513.2591534","DOIUrl":"https://doi.org/10.1145/2591513.2591534","url":null,"abstract":"The general-purpose approximate nature of neural network (NN) based accelerators has the potential to sustain the historic energy and performance improvements of computing systems. We propose the use of NN-based accelerators to approximate mathematical functions in the GNU C Library (glibc) that commonly occur in application benchmarks. Using our NN-based approach to approximate cos, exp, log, pow, and sin we achieve an average energy-delay product (EDP) that is 68x lower than that of traditional glibc execution. In applications, our NN-based approach has an EDP 78% of that of traditional execution at the cost of an average mean squared error (MSE) of 1.56.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131304069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Reconfigurable STT-NV LUT-based functional units to improve performance in general-purpose processors 可重构的基于STT-NV lut的功能单元,以提高通用处理器的性能
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591535
Adarsh Reddy Ashammagari, H. Mahmoodi, T. Mohsenin, H. Homayoun
{"title":"Reconfigurable STT-NV LUT-based functional units to improve performance in general-purpose processors","authors":"Adarsh Reddy Ashammagari, H. Mahmoodi, T. Mohsenin, H. Homayoun","doi":"10.1145/2591513.2591535","DOIUrl":"https://doi.org/10.1145/2591513.2591535","url":null,"abstract":"Unavailability of functional units is a major performance bottleneck in general-purpose processors (GPP). In a GPP with limited number of functional units while a functional unit may be heavily utilized at times, creating a performance bottleneck, the other functional units might be under-utilized. We propose a novel idea for adapting functional units in GPP architecture in order to overcome this challenge. For this purpose, a selected set of complex functional units that might be under-utilized such as multiplier and divider, are realized using a programmable look up table-based fabric. This allows for run-time adaptation of functional units to improving performance. The programmable look up tables are realized using magnetic tunnel junction (MTJ) based memories that dissipate near zero leakage and are CMOS compatible. We have applied this idea to a dual issue architecture. The results show that compared to a design with all CMOS functional units a performance improvement of 18%, on average is achieved for standard benchmarks. This comes with 4.1% power increase in integer benchmarks and 2.3% power decrease in floating point benchmarks, compared to a CMOS design.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115837651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Trade-off between energy and quality of service through dynamic operand truncation and fusion 通过动态操作数截断和融合实现能量和服务质量的权衡
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591561
Wenchao Qian, Robert Karam, S. Bhunia
{"title":"Trade-off between energy and quality of service through dynamic operand truncation and fusion","authors":"Wenchao Qian, Robert Karam, S. Bhunia","doi":"10.1145/2591513.2591561","DOIUrl":"https://doi.org/10.1145/2591513.2591561","url":null,"abstract":"Energy efficiency has emerged as a major design concern for embedded and portable electronics. Conventional approaches typically impact performance and often require significant design-time modifications. In this paper, we propose a novel approach for improving energy efficiency through judicious fusion of operations. The proposed approach has two major distinctions: (1) the fusion is enabled by operand truncation, which allows representing multiple operations into a reasonably sized lookup table (LUT); and (2) it works for large varieties of functions. Most applications in the domain of digital signal processing (DSP) and graphics can tolerate some computation error without large degradation in output quality. Our approach improves energy efficiency with graceful degradation in quality. The proposed fusion approach can be applied to trade-off energy efficiency with quality at run time and requires virtually no circuit or architecture level modifications in a processor. Using our software tool for automatic fusion and truncation, the effectiveness of the approach is studied for four common applications. Simulation results show promising improvements (19-90%) in energy delay product with minimal impact on quality.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114866826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Forward-scaling, serially equivalent parallelism for FPGA placement FPGA放置的前向缩放、串行等效并行性
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591543
C. Fobel, G. Grewal, D. Stacey
{"title":"Forward-scaling, serially equivalent parallelism for FPGA placement","authors":"C. Fobel, G. Grewal, D. Stacey","doi":"10.1145/2591513.2591543","DOIUrl":"https://doi.org/10.1145/2591513.2591543","url":null,"abstract":"Placement run-times continue to dominate the FPGA design flow. Previous attempts at parallel placement methods either only scale to a few threads or result in a significant loss in solution quality as thread-count is increased. We propose a novel method for generating large amounts of parallel work for placement, which scales with the size of the target architecture. Our experimental results show that we nearly reach the limit of the number of possible parallel swaps, while improving critical-path-delay 4.7% compared to VPR. While our proposed implementation currently utilizes a single thread, we still achieve speedups of 13.3x over VPR.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132961271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A current-mode CMOS/memristor hybrid implementation of an extreme learning machine 一种电流模式CMOS/忆阻器混合实现的极限学习机
ACM Great Lakes Symposium on VLSI Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591572
Cory E. Merkel, D. Kudithipudi
{"title":"A current-mode CMOS/memristor hybrid implementation of an extreme learning machine","authors":"Cory E. Merkel, D. Kudithipudi","doi":"10.1145/2591513.2591572","DOIUrl":"https://doi.org/10.1145/2591513.2591572","url":null,"abstract":"In this work, we propose a current-mode CMOS/memristor hybrid implementation of an extreme learning machine (ELM) architecture. We present novel circuit designs for linear, sigmoid,and threshold neuronal activation functions, as well as memristor-based bipolar synaptic weighting. In addition, this work proposes a stochastic version of the least-mean-squares (LMS) training algorithm for adapting the weights between the ELM's hidden and output layers. We simulated our top-level ELM architecture using Cadence AMS Designer with 45 nm CMOS models and an empirical piecewise linear memristor model based on experimental data from an HfOx device. With 10 hidden node neurons, the ELM was able to learn a 2-input XOR function after 150 training epochs.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131008868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信