Daniel Llorente, Kimon Karras, Thomas Wild, A. Herkersdorf
{"title":"Buffer allocation for advanced packet segmentation in Network Processors","authors":"Daniel Llorente, Kimon Karras, Thomas Wild, A. Herkersdorf","doi":"10.1109/ASAP.2008.4580182","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580182","url":null,"abstract":"In current network processors, incoming variable-length packets are sliced using only one small segment size and then stored in the buffer. Inconveniently, short data bursts are inadequate for accessing SDRAM, commonly used for packet buffers, due to high activation and pre-charging latencies. Using large segment sizes is not optimal either because though it increases memory bandwidth, the benefit comes at the price of a heavy reduction in storing efficiency. A good solution to achieve simultaneously high performance and memory utilization consists in storing a single packet segmented using multiple segment sizes. In this paper, we study how to allocate memory for these different-sized segments in an efficient way. First we analyze the appropriate segment pool size for a multitude of traffic scenarios. Our experiments show that simple static buffer allocation does not always suffice as different segment pools may be exhausted depending on traffic. Hence we introduce a method for handling multiple segment pools not only in a static but also in a dynamic way, taking advantage of a new set of control structures based on a combination of bitmaps and linked lists. We demonstrate that our method achieves a huge reduction in control buffer size requirements in comparison to state-of-the-art control structures, together with decreasing the average number of accesses to control data.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134252833","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Brisebarre, S. Chevillard, M. Ercegovac, J. Muller, S. Torres
{"title":"An efficient method for evaluating polynomial and rational function approximations","authors":"N. Brisebarre, S. Chevillard, M. Ercegovac, J. Muller, S. Torres","doi":"10.1109/ASAP.2008.4580185","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580185","url":null,"abstract":"In this paper we extend the domain of applicability of the E-method [7, 8], as a hardware-oriented method for evaluating elementary functions using polynomial and rational function approximations. The polynomials and rational functions are computed by solving a system of linear equations using digit-serial iterations on simple and highly regular hardware. For convergence, these systems must be diagonally dominant. The E-method offers an efficient way for the fixed-point evaluation of polynomials and rational functions if their coefficients conform to the diagonal dominance condition. Until now, there was no systematic approach to obtain good approximations to f over an interval [a, b] by rational functions satisfying the constraints required by the E-method. In this paper, we present such an approach which is based on linear programming and lattice basis reduction. We also discuss a design and performance characteristics of a corresponding implementation.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"881 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132900394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA-based hardware accelerator of the heat equation with applications on infrared thermography","authors":"F. Pardo, Paula López Martínez, D. Cabello","doi":"10.1109/ASAP.2008.4580175","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580175","url":null,"abstract":"Modelling of physical phenomena often involves the use of complex systems of equations whose computational solution has demanding requirements in terms of memory and computing power. Among the different techniques proposed, the Finite-Difference Time-Domain (FD-TD) method has the advantage of a feasible hardware implementation that can significantly speed up the computations. This technique is widely used for the solution of partial differential equations in a variety of areas such as antennas design, medical studies, circuit packaging and non-destructive evaluation. In this paper, we present a hardware accelerator of a 3D FD-TD heat equation solver that constitutes the basis of a thermal model of the soil for the non-destructive evaluation of minefields using infrared thermography techniques. In order to be able to work on the field during mine removal activities, a portable and computationally efficient system must be achieved. To this aim, we projected the 3D FD-TD model of the soil onto an FPGA platform using Handel-C and VHDL. A speedup factor of 34 over a single precision PC (C++) is achieved.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124318388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Holger Flatt, Steffen Blume, Sebastian Hesselbarth, Torsten Schünemann, P. Pirsch
{"title":"A parallel hardware architecture for connected component labeling based on fast label merging","authors":"Holger Flatt, Steffen Blume, Sebastian Hesselbarth, Torsten Schünemann, P. Pirsch","doi":"10.1109/ASAP.2008.4580169","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580169","url":null,"abstract":"This paper presents a dedicated parallel hardware architecture for fast connected component labeling. Both, label generation and merging of equivalent labels are accelerated. Label generation is performed for four pixels in parallel. A special linked list based approach for fast label merging is proposed. This results in a compact implementation and shorter processing times compared to published implementations. For prototyping and evaluation purposes, the hardware architecture was integrated into an FPGA-based modular coprocessor architecture. A binary D1 test image is labeled in 1.74 ms on a Virtex-II Pro FPGA running at 140 MHz. Moreover, the architecture can be easily integrated into embedded image processing systems.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123243655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A subsampling pulsed UWB demodulator based on a flexible complex SVD","authors":"Y. Vanderperren, W. Dehaene","doi":"10.1109/ASAP.2008.4580164","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580164","url":null,"abstract":"A flexible digital architecture for a pulsed ultra-wideband demodulator sampling below Nyquist rate is presented. The system is based on a complex Singular Value Decomposition implemented on a configurable systolic array of simple processors. Automatic code generation is applied to cut design time and rapidly assess the implementation cost of several architectures of the processors.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124472075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Zodiac: System architecture implementation for a high-performance Network Security Processor","authors":"Wang Haixin, Bai Guoqiang, C. Hongyi","doi":"10.1109/ASAP.2008.4580160","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580160","url":null,"abstract":"The last few years have seen many significant progresses in the field of application-specific processors. One exemplar is Network Security Processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from Network Processors (NPs). This paper proposes a high-performance NSP intended for both IPSec and SSL protocols acceleration. With a programmable descriptor-based instruction set architecture, the novel design of system architecture leads to a Gbps rate NSP named Zodiac, which is programmable with domain specific instructions for Gbps throughput IPSec and SSL applications. Synthesized with a 0.18 mum CMOS technology, the peak throughput of IPSec ESP tunnel mode can reach up to 1.651 Gbps and over 1000 full SSL handshakes per second are attainable.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115819329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Nussinov RNA secondary structure prediction with systolic arrays on FPGAs","authors":"A. Jacob, J. Buhler, R. Chamberlain","doi":"10.1109/ASAP.2008.4580177","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580177","url":null,"abstract":"RNA structure prediction, or folding, is a compute-intensive task that lies at the core of several search applications in bioinformatics. We begin to address the need for high-throughput RNA folding by accelerating the Nussinov folding algorithm using a 2D systolic array architecture. We adapt classic results on parallel string parenthesization to produce efficient systolic arrays for the Nussinov algorithm, elaborating these array designs to produce fully realized FPGA implementations. Our designs achieve estimated speedups up to 39times on a Xilinx Virtex-II 6000 FPGA over a modern x86 CPU.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129419146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Amaricai, M. Vladutiu, M. Udrescu, L. Prodan, O. Boncalo
{"title":"Floating point multiplication rounding schemes for interval arithmetic","authors":"A. Amaricai, M. Vladutiu, M. Udrescu, L. Prodan, O. Boncalo","doi":"10.1109/ASAP.2008.4580148","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580148","url":null,"abstract":"Floating point multipliers with two differently rounded results for the same operation can be used for increasing the performance of interval multiplication. The present paper stands by this idea, by investigating the idea of using three existing floating point multiplication rounding algorithms for such multipliers - the Even-Seidel, Quach and Yu-Zyner algorithms. These three rounding schemes are modified for interval arithmetic; furthermore, a new rounding scheme is proposed. The estimates rendered by our analysis show that the proposed scheme has the best performance/area ratio.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129590417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Architecture and VLSI realization of a high-speed programmable decoder for LDPC convolutional codes","authors":"M. Tavares, S. Kunze, E. Matús, G. Fettweis","doi":"10.1109/ASAP.2008.4580181","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580181","url":null,"abstract":"In this paper, we present a novel high-speed dual-core programmable decoder architecture for LDPC convolutional codes and their tail-biting versions. This architecture uses a modified Min-Sum algorithm and enables the decoding of a multitude of codes with different node degree distributions, rates and block lengths. We show how the parallelization concepts are derived using the properties of the bipartite graphs underlying the codes. Moreover, the hardware elements composing the architecture will be presented and analyzed in detail. The programmability of the decoder is also considered. Finally, we present the synthesis results for a prototype ASIC which is capable of achieving high decoding throughput still with very high flexibility, relatively low power consumption and small area.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"201 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129043040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuki Kobayashi, M. Jayapala, P. Raghavan, F. Catthoor, M. Imai
{"title":"Operation shuffling over cycle boundaries for low energy L0 clustering","authors":"Yuki Kobayashi, M. Jayapala, P. Raghavan, F. Catthoor, M. Imai","doi":"10.1109/ASAP.2008.4580170","DOIUrl":"https://doi.org/10.1109/ASAP.2008.4580170","url":null,"abstract":"To achieve energy reduction for instruction memory access in VLIW ASIPs, operation shuffling technique has been proposed. The shuffling technique changes assignment of an operation to different slot so that L0 cluster configuration can be improved. The published technique, however, moves operations within a cycle, not between cycles. As a result, the potential gain of energy reduction was limited. This paper proposes a shuffling technique that also moves operations between cycles as well as within a cycle. The experimental results show that the proposed method achieves more efficient energy than the best known shuffling method by up to 15.3% in the best case.","PeriodicalId":246715,"journal":{"name":"2008 International Conference on Application-Specific Systems, Architectures and Processors","volume":"313 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128310939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}