{"title":"Specialization of the Cell SPE for Media Applications","authors":"C. Meenderinck, B. Juurlink","doi":"10.1109/ASAP.2009.10","DOIUrl":"https://doi.org/10.1109/ASAP.2009.10","url":null,"abstract":"There is a clear trend towards multi-cores to meet the performance requirements of emerging and future applications. A different way to scale performance is, however, to specialize the cores for specific application domains. This option is especially attractive for low-cost embedded systems where less silicon area directly translates to less cost. We propose architectural enhancements to specialize the Cell SPE for video decoding. Specifically, based on deficiencies we observed in the H.264 kernels, we propose a handful of application-specific instructions to improve performance. The speedups achieved are between 1.84 and 2.37.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116949485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Tsen, S. González-Navarro, M. Schulte, Brian J. Hickmann, Katherine Compton
{"title":"A Combined Decimal and Binary Floating-Point Multiplier","authors":"C. Tsen, S. González-Navarro, M. Schulte, Brian J. Hickmann, Katherine Compton","doi":"10.1109/ASAP.2009.28","DOIUrl":"https://doi.org/10.1109/ASAP.2009.28","url":null,"abstract":"In this paper, we describe the first hardware design of a combined binary and decimal floating-point multiplier, based on specifications in the IEEE 754-2008 Floating-point Standard. The multiplier design operates on either (1) 64-bit binary encoded decimal floating-point (DFP) numbers or (2) 64-bit binary floating-point (BFP) numbers. It returns properly rounded results for the rounding modes specified in IEEE 754-2008. The design shares the following hardware resources between the two floating-point datatypes: a 54-bit by 54-bit binary multiplier, portions of the operand encoding/decoding, a 54-bit right shifter, exponent calculation logic, and rounding logic. Our synthesis results show that hardware sharing is feasible and has a reasonable impact on area, latency, and delay. The combined BFP and DFP multiplier occupies only 58% of the total area that would be required by separate BFP and DFP units. Furthermore, the critical path delay of a combined multiplier has a negligible increase over a standalone DFP multiplier, without increasing the number of cycles to perform either BFP or DFP multiplication.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"372 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116057850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adarsha Rao, M. Alle, V. Sainath, Reyaz Shaik, Rajashekhar Chowhan, S. Sankaraiah, Sravanthi Mantha, S. Nandy, R. Narayan
{"title":"An Input Triggered Polymorphic ASIC for H.264 Decoding","authors":"Adarsha Rao, M. Alle, V. Sainath, Reyaz Shaik, Rajashekhar Chowhan, S. Sankaraiah, Sravanthi Mantha, S. Nandy, R. Narayan","doi":"10.1109/ASAP.2009.7","DOIUrl":"https://doi.org/10.1109/ASAP.2009.7","url":null,"abstract":"This paper reports the design of an input--triggered polymorphic ASIC for H.264 baseline decoder.Hardware polymorphism is achieved by selectively reusing hardware resources at system and module level. Complete design is done using ESL design tools following a methodology that maintains consistency in testing and verification throughout the design flow. The proposed design can support frame sizes from QCIF to 1080p.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"259 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117097788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving VLIW Processor Performance Using Three-Dimensional (3D) DRAM Stacking","authors":"Yangyang Pan, Tong Zhang","doi":"10.1109/ASAP.2009.11","DOIUrl":"https://doi.org/10.1109/ASAP.2009.11","url":null,"abstract":"This work studies the potential of using emerging 3D integration to improve embedded VLIW computing system. We focus on the 3D integration of one VLIW processor die with multiple high-capacity DRAM dies. Our proposed memory architecture employs 3D stacking technology to bond one die containing several processing clusters to multiple DRAM dies for a primary memory. The 3D technology also enables wide low-latency buses between clusters and memory and enable the latency of 3D DRAM L2 cache comparable to 2D SRAM L2 cache. These enable it to replace the 2D SRAM L2 cache with 3D DRAM L2 cache. The die area for 2D SRAM L2 cache can be re-allocated to additional clusters that can improve the performance of the system. From the simulation results, we find 3D stacking DRAM main memory can improve the system performance by 10%~80% than 2D off-chip DRAM main memory depending on different benchmarks. Also, for a similar logic die area, a four clusters system with 3D DRAM L2 cache and 3D DRAM main memory outperforms a two clusters system with 2D SRAM L2 cache and 3D DRAM main memory by about 10%.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"445 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127607844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Division Unit for Binary Integer Decimals","authors":"T. Lang, A. Nannarelli","doi":"10.1109/ASAP.2009.42","DOIUrl":"https://doi.org/10.1109/ASAP.2009.42","url":null,"abstract":"In this work, we present a radix-10 division unit that is based on the digit-recurrence algorithm and implements binary encodings (Binary Integer Decimal or BID) for significands. Recent decimal division designs are all based on the Binary Coded Decimal (BCD) encoding. We adapt the radix-10 digit-recurrence algorithm to BID representation and implement the division unit in standard cell technology. The implementation of the proposed BID division unit is compared to that of a BCD based unit implementing the same algorithm. The comparison shows that for normalized operands the BID unit has the same latency as the BCD unit and reduced area, but the normalization is more expensive when implemented in BID.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126537583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalar Processing Overhead on SIMD-Only Architectures","authors":"A. Azevedo, B. Juurlink","doi":"10.1109/ASAP.2009.12","DOIUrl":"https://doi.org/10.1109/ASAP.2009.12","url":null,"abstract":"The Cell processor consists of a general-purpose core and eight cores with a complete SIMD instruction set. Although originally designed for multimedia and gaming, it is currently being used for a much broader range of applications.In this paper we evaluate if the Cell SPEs could benefit significantly from a scalar processing unit using two methodologies. In the first methodology the scalar processing overhead is eliminated by replacing all scalar data types by the quadword data type. This methodology is feasible only for relatively small kernels. In the second methodology SPE performance is compared to the performance of a similarly configured PPU, which supports scalar operations. Experimental results show that the scalar processing overhead ranges from 19% to 57% for small kernels and from 12% to 39% for large kernels. Solutions to eliminate this overhead are also discussed.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131757753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NeMo: A Platform for Neural Modelling of Spiking Neurons Using GPUs","authors":"A. Fidjeland, E. Roesch, M. Shanahan, W. Luk","doi":"10.1109/ASAP.2009.24","DOIUrl":"https://doi.org/10.1109/ASAP.2009.24","url":null,"abstract":"Simulating spiking neural networks is of great interest to scientists wanting to model the functioning of the brain. However, large-scale models are expensive to simulate due to the number and interconnectedness of neurons in the brain. Furthermore, where such simulations are used in an embodied setting, the simulation must be real-time in order to be useful. In this paper we present NeMo, a platform for such simulations which achieves high performance through the use of highly parallel commodity hardware in the form of graphics processing units (GPUs). NeMo makes use of the Izhikevich neuron model which provides a range of realistic spiking dynamics while being computationally efficient. Our GPU kernel can deliver up to 400 million spikes per second. This corresponds to a real-time simulation of around 40 000 neurons under biologically plausible conditions with 1000 synapses per neuron and a mean firing rate of 10 Hz.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116303386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Filtering Global History: Power and Performance Efficient Branch Predictor","authors":"R. Ayoub, A. Orailoglu","doi":"10.1109/ASAP.2009.26","DOIUrl":"https://doi.org/10.1109/ASAP.2009.26","url":null,"abstract":"In this paper we present an Application Customizable Branch Predictor, ACBP, that delivers efficiency in energy savings and performance without compromising prediction accuracy. The idea of our technique is to filter unnecessary global history information within the global history register to minimize the predictor size while maintaining prediction accuracy. We suggest in this work an efficient algorithm to capture the beneficial correlations. A cost-efficient and programmable hardware architecture is presented. Extensive experimental analysis confirms significant improvements in power savings and latency, ranging up to 84% and 30%,respectively.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125405647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Richard Membarth, Philipp Kutzer, H. Dutta, Frank Hannig, J. Teich
{"title":"Acceleration of Multiresolution Imaging Algorithms: A Comparative Study","authors":"Richard Membarth, Philipp Kutzer, H. Dutta, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2009.8","DOIUrl":"https://doi.org/10.1109/ASAP.2009.8","url":null,"abstract":"In this paper we consider a multiresolution filter and its realization on the Cell BE and GPUs. We not only present common and specific optimization strategies undertaken for obtaining maximum performance on these architectures, but also how to obtain a speedup of 6.57x and 33.24x compared to an optimized OpenMP baseline implementation. Furthermore, we also undertake automated configuration space exploration of different partitioning possibilities for selection of best tiling parameters.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Prefix Ling Structures for Modulo 2^n-1 Addition","authors":"Jun Chen, J. Stine","doi":"10.1109/ASAP.2009.43","DOIUrl":"https://doi.org/10.1109/ASAP.2009.43","url":null,"abstract":"Parallel-prefix adders draw significant amounts of attention within general-purpose and application-specific architectures because of their logarithmic delay and efficient implementation in VLSI. This paper proposes a scheme to enhance parallel-prefix adders for modulo $2^n - 1$ addition by incorporating Ling equations into parallel-prefix structures. As opposed to previous research, this work clarifies the use of Ling equations for Modulo and provides enhancements to its implementation. Results are given in this work for a placed and routed design within a variation-aware 45nm technology. The implementation results show a significant improvement in delay and even a reduction in power dissipation.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122390631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}