2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors最新文献

筛选
英文 中文
Specialization of the Cell SPE for Media Applications 媒体应用的Cell SPE专门化
C. Meenderinck, B. Juurlink
{"title":"Specialization of the Cell SPE for Media Applications","authors":"C. Meenderinck, B. Juurlink","doi":"10.1109/ASAP.2009.10","DOIUrl":"https://doi.org/10.1109/ASAP.2009.10","url":null,"abstract":"There is a clear trend towards multi-cores to meet the performance requirements of emerging and future applications. A different way to scale performance is, however, to specialize the cores for specific application domains. This option is especially attractive for low-cost embedded systems where less silicon area directly translates to less cost. We propose architectural enhancements to specialize the Cell SPE for video decoding. Specifically, based on deficiencies we observed in the H.264 kernels, we propose a handful of application-specific instructions to improve performance. The speedups achieved are between 1.84 and 2.37.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116949485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Combined Decimal and Binary Floating-Point Multiplier 一个组合的十进制和二进制浮点乘法器
C. Tsen, S. González-Navarro, M. Schulte, Brian J. Hickmann, Katherine Compton
{"title":"A Combined Decimal and Binary Floating-Point Multiplier","authors":"C. Tsen, S. González-Navarro, M. Schulte, Brian J. Hickmann, Katherine Compton","doi":"10.1109/ASAP.2009.28","DOIUrl":"https://doi.org/10.1109/ASAP.2009.28","url":null,"abstract":"In this paper, we describe the first hardware design of a combined binary and decimal floating-point multiplier, based on specifications in the IEEE 754-2008 Floating-point Standard. The multiplier design operates on either (1) 64-bit binary encoded decimal floating-point (DFP) numbers or (2) 64-bit binary floating-point (BFP) numbers. It returns properly rounded results for the rounding modes specified in IEEE 754-2008. The design shares the following hardware resources between the two floating-point datatypes: a 54-bit by 54-bit binary multiplier, portions of the operand encoding/decoding, a 54-bit right shifter, exponent calculation logic, and rounding logic. Our synthesis results show that hardware sharing is feasible and has a reasonable impact on area, latency, and delay. The combined BFP and DFP multiplier occupies only 58% of the total area that would be required by separate BFP and DFP units. Furthermore, the critical path delay of a combined multiplier has a negligible increase over a standalone DFP multiplier, without increasing the number of cycles to perform either BFP or DFP multiplication.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"372 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116057850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
An Input Triggered Polymorphic ASIC for H.264 Decoding 一种用于H.264解码的输入触发多态ASIC
Adarsha Rao, M. Alle, V. Sainath, Reyaz Shaik, Rajashekhar Chowhan, S. Sankaraiah, Sravanthi Mantha, S. Nandy, R. Narayan
{"title":"An Input Triggered Polymorphic ASIC for H.264 Decoding","authors":"Adarsha Rao, M. Alle, V. Sainath, Reyaz Shaik, Rajashekhar Chowhan, S. Sankaraiah, Sravanthi Mantha, S. Nandy, R. Narayan","doi":"10.1109/ASAP.2009.7","DOIUrl":"https://doi.org/10.1109/ASAP.2009.7","url":null,"abstract":"This paper reports the design of an input--triggered polymorphic ASIC for H.264 baseline decoder.Hardware polymorphism is achieved by selectively reusing hardware resources at system and module level. Complete design is done using ESL design tools following a methodology that maintains consistency in testing and verification throughout the design flow. The proposed design can support frame sizes from QCIF to 1080p.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117097788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Improving VLIW Processor Performance Using Three-Dimensional (3D) DRAM Stacking 利用三维(3D) DRAM堆叠改善VLIW处理器性能
Yangyang Pan, Tong Zhang
{"title":"Improving VLIW Processor Performance Using Three-Dimensional (3D) DRAM Stacking","authors":"Yangyang Pan, Tong Zhang","doi":"10.1109/ASAP.2009.11","DOIUrl":"https://doi.org/10.1109/ASAP.2009.11","url":null,"abstract":"This work studies the potential of using emerging 3D integration to improve embedded VLIW computing system. We focus on the 3D integration of one VLIW processor die with multiple high-capacity DRAM dies. Our proposed memory architecture employs 3D stacking technology to bond one die containing several processing clusters to multiple DRAM dies for a primary memory. The 3D technology also enables wide low-latency buses between clusters and memory and enable the latency of 3D DRAM L2 cache comparable to 2D SRAM L2 cache. These enable it to replace the 2D SRAM L2 cache with 3D DRAM L2 cache. The die area for 2D SRAM L2 cache can be re-allocated to additional clusters that can improve the performance of the system. From the simulation results, we find 3D stacking DRAM main memory can improve the system performance by 10%~80% than 2D off-chip DRAM main memory depending on different benchmarks. Also, for a similar logic die area, a four clusters system with 3D DRAM L2 cache and 3D DRAM main memory outperforms a two clusters system with 2D SRAM L2 cache and 3D DRAM main memory by about 10%.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"445 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127607844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Division Unit for Binary Integer Decimals 二进制整数小数的除法单位
T. Lang, A. Nannarelli
{"title":"Division Unit for Binary Integer Decimals","authors":"T. Lang, A. Nannarelli","doi":"10.1109/ASAP.2009.42","DOIUrl":"https://doi.org/10.1109/ASAP.2009.42","url":null,"abstract":"In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm and implements binary encodings (Binary Integer Decimal or BID) for significands. Recent decimal division designs are all based on the Binary Coded Decimal (BCD) encoding. We adapt the radix-10 digit-recurrence algorithm to BID representation and implement the division unit in standard cell technology. The implementation of the proposed BID division unit is compared to that of a BCD based unit implementing the same algorithm. The comparison shows that for normalized operands the BID unit has the same latency as the BCD unit and reduced area, but the normalization is more expensive when implemented in BID.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126537583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Scalar Processing Overhead on SIMD-Only Architectures 纯simd架构上的标量处理开销
A. Azevedo, B. Juurlink
{"title":"Scalar Processing Overhead on SIMD-Only Architectures","authors":"A. Azevedo, B. Juurlink","doi":"10.1109/ASAP.2009.12","DOIUrl":"https://doi.org/10.1109/ASAP.2009.12","url":null,"abstract":"The Cell processor consists of a general-purpose core and eight cores with a complete SIMD instruction set. Although originally designed for multimedia and gaming, it is currently being used for a much broader range of applications.In this paper we evaluate if the Cell SPEs could benefit significantly from a scalar processing unit using two methodologies. In the first methodology the scalar processing overhead is eliminated by replacing all scalar data types by the quadword data type. This methodology is feasible only for relatively small kernels. In the second methodology SPE performance is compared to the performance of a similarly configured PPU, which supports scalar operations. Experimental results show that the scalar processing overhead ranges from 19% to 57% for small kernels and from 12% to 39% for large kernels. Solutions to eliminate this overhead are also discussed.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131757753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs NeMo:一个使用gpu的尖峰神经元神经建模平台
A. Fidjeland, E. Roesch, M. Shanahan, W. Luk
{"title":"NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs","authors":"A. Fidjeland, E. Roesch, M. Shanahan, W. Luk","doi":"10.1109/ASAP.2009.24","DOIUrl":"https://doi.org/10.1109/ASAP.2009.24","url":null,"abstract":"Simulating spiking neural networks is of great interest to scientists wanting to model the functioning of the brain. However, large-scale models are expensive to simulate due to the number and interconnectedness of neurons in the brain. Furthermore, where such simulations are used in an embodied setting, the simulation must be real-time in order to be useful. In this paper we present NeMo, a platform for such simulations which achieves high performance through the use of highly parallel commodity hardware in the form of graphics processing units (GPUs). NeMo makes use of the Izhikevich neuron model which provides a range of realistic spiking dynamics while being computationally efficient. Our GPU kernel can deliver up to 400 million spikes per second. This corresponds to a real-time simulation of around 40 000 neurons under biologically plausible conditions with 1000 synapses per neuron and a mean firing rate of 10 Hz.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116303386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 105
Filtering Global History: Power and Performance Efficient Branch Predictor 过滤全局历史:功率和性能高效分支预测器
R. Ayoub, A. Orailoglu
{"title":"Filtering Global History: Power and Performance Efficient Branch Predictor","authors":"R. Ayoub, A. Orailoglu","doi":"10.1109/ASAP.2009.26","DOIUrl":"https://doi.org/10.1109/ASAP.2009.26","url":null,"abstract":"In this paper we present an Application Customizable Branch Predictor, ACBP, that delivers efficiency in energy savings and performance without compromising prediction accuracy. The idea of our technique is to filter unnecessary global history information within the global history register to minimize the predictor size while maintaining prediction accuracy. We suggest in this work an efficient algorithm to capture the beneficial correlations. A cost-efficient and programmable hardware architecture is presented. Extensive experimental analysis confirms significant improvements in power savings and latency, ranging up to 84% and 30%,respectively.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125405647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Acceleration of Multiresolution Imaging Algorithms: A Comparative Study 多分辨率成像加速算法的比较研究
Richard Membarth, Philipp Kutzer, H. Dutta, Frank Hannig, J. Teich
{"title":"Acceleration of Multiresolution Imaging Algorithms: A Comparative Study","authors":"Richard Membarth, Philipp Kutzer, H. Dutta, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2009.8","DOIUrl":"https://doi.org/10.1109/ASAP.2009.8","url":null,"abstract":"In this paper we consider a multiresolution filter and its realization on the Cell BE and GPUs. We not only present common and specific optimization strategies undertaken for obtaining maximum performance on these architectures, but also how to obtain a speedup of 6.57x and 33.24x compared to an optimized OpenMP baseline implementation. Furthermore, we also undertake automated configuration space exploration of different partitioning possibilities for selection of best tiling parameters.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Parallel Prefix Ling Structures for Modulo 2^n-1 Addition 模2^n-1加法的平行前缀Ling结构
Jun Chen, J. Stine
{"title":"Parallel Prefix Ling Structures for Modulo 2^n-1 Addition","authors":"Jun Chen, J. Stine","doi":"10.1109/ASAP.2009.43","DOIUrl":"https://doi.org/10.1109/ASAP.2009.43","url":null,"abstract":"Parallel-prefix adders draw significant amounts of attention within general-purpose and application-specific architectures because of their logarithmic delay and efficient implementation in VLSI. This paper proposes a scheme to enhance parallel-prefix adders for modulo $2^n - 1$ addition by incorporating Ling equations into parallel-prefix structures. As opposed to previous research, this work clarifies the use of Ling equations for Modulo and provides enhancements to its implementation. Results are given in this work for a placed and routed design within a variation-aware 45nm technology. The implementation results show a significant improvement in delay and even a reduction in power dissipation.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122390631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信