{"title":"Communications infrastructure generation for modular FPGA reconfiguration","authors":"S. Koh, O. Diessel","doi":"10.1109/FPT.2006.270338","DOIUrl":"https://doi.org/10.1109/FPT.2006.270338","url":null,"abstract":"Modules that are swapped dynamically at run-time on an FPGA have varying communication needs over time. In order to support this, we aim to generate a wiring infrastructure that caters for the dynamically-changing module interfaces. This, however, imposes a regular structure for laying out modules on a device, which may result in longer inter-module wiring paths as compared to traditional methods where the netlists are flattened. This paper studies placing modules within a structured layout to compare resulting circuit speeds with those obtained by traditional methods. Our results indicate that the difference in critical path delay is high at very low utilisation, but that the overhead is absorbed as the number of modules and interconnection density increases to realistic levels. The authors conclude that implementing such a wiring infrastructure has manageable overheads while having the added advantage of being amenable to dynamic reconfiguration","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127574698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sigma-delta based clock recovery using on-chip PLL in FPGA","authors":"N. Ge, Yuyu Liu, Huazhong Yang, Hui Wang","doi":"10.1109/FPT.2006.270304","DOIUrl":"https://doi.org/10.1109/FPT.2006.270304","url":null,"abstract":"A clock and data recovery (CDR) circuit is proposed based on the sigma-delta quantization. The phase of the new CDR circuit is adjusted by a sigma-delta modulated reference clock that increases the stability of the system and can easily interface with PLL cores embedded in FPGAs. The approximate linear model of the proposed CDR is analyzed for SONET/SDH applications to evaluate its performance. The measurement shows that the jitter tolerance meets the ITU-T requirement with a high margin of 0.3UI. The commercial equipment has been developed using a single FPGA chip based on the SDM-CDR","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130526033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hardware cache memcpy accelerator","authors":"Stephan Wong, F. Duarte, S. Vassiliadis","doi":"10.1109/FPT.2006.270305","DOIUrl":"https://doi.org/10.1109/FPT.2006.270305","url":null,"abstract":"In this paper, we present a hardware solution to perform the commonly used memcpy operation with the goal to reduce the time to perform the actual memory copies. This is accomplished by taking advantage of the presence of a cache that is found next to many current-day (embedded) processors. Additionally, the currently presented solution assumes that to be copied data is already in the cache and is aligned by the cache-line size. We present the concept and implementation details of the proposed hardware module and the system used to experiment both our hardware and an optimized software implementation of the memcpy function. Experimental results show that the proposed hardware solution is at least 79% faster than an optimized hand-coded software solution","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"222 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133877021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA accelerated tate pairing based cryptosystems over binary fields","authors":"Chang Shu, Soonhak Kwon, K. Gaj","doi":"10.1109/FPT.2006.270309","DOIUrl":"https://doi.org/10.1109/FPT.2006.270309","url":null,"abstract":"Tate pairing based cryptosystems have recently emerged as an alternative to traditional public key cryptosystems because of their ability to be used in multi-party identity-based key management schemes. Due to the inherent parallelism of the existing pairing algorithms, high performance can be achieved via hardware realizations. Three schemes for Tate pairing computations have been proposed in the literature: cubic elliptic, binary elliptic, and binary hyperelliptic. For our implementation we have chosen the binary elliptic case because of the simple underlying algorithms and efficient binary arithmetic. In this paper, we propose a new FPGA-based architecture of the Tate pairing-based computation over the binary fields F2239 and F 2283. Even though our field sizes are larger than in the architectures based on cubic elliptic curves or binary hyperelliptic curves with the same security strength, nevertheless fewer multiplications in the underlying field need to performed. As a result, the computational latency for a pairing computation has been reduced, and our implementation runs 10-to-20 times faster than the equivalent implementations of other pairing-based schemes at the same level of security strength. At the same time, an improvement in the product of latency by area by a factor between 12 and 46 for an equivalent type of implementation has been achieved","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133925490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On-line scheduling of real-time tasks for reconfigurable computing system","authors":"Xuegong Zhou, Ying Wang, XunZhang Huang, Chenglian Peng","doi":"10.1109/FPT.2006.270295","DOIUrl":"https://doi.org/10.1109/FPT.2006.270295","url":null,"abstract":"Efficient task scheduling is very important for obtaining high performance in reconfigurable computing system. Previous researches mostly concentrate on the spatial placement of tasks, and did not pay enough attention to temporal factors. This paper focuses on the on-line scheduling of real-time tasks with known executing time, and introduces the notion of recognition-complete for scheduling algorithms, that is the algorithm can arrange the start time of a newly arrived task as early as possible. A new on-line scheduling algorithm is proposed, which achieves recognition-complete by using the technique of time window. The simulation results show that the proposed algorithm gains a prominent improvement in scheduling performance over previous algorithms","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132803101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Activity-based power estimation and characterization of DSP and multiplier blocks in FPGAs","authors":"Nathalie Chan King Choy, S. Wilton","doi":"10.1109/FPT.2006.270321","DOIUrl":"https://doi.org/10.1109/FPT.2006.270321","url":null,"abstract":"This paper describes an activity-based strategy for estimating the average power dissipation of hard DSP and multiplier blocks embedded in FPGAs. We identified two technical challenges in creating a tool flow to do this: (1) estimating the activity of all nodes in designs containing DSP blocks, and (2) estimating the average power dissipated within the DSP block quickly and accurately. In this paper, we compare several methods to address each of these two challenges. We conclude with a description of our complete power estimation flow","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116434439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The cost of data dependence in motion vector estimation for reconfigurable platforms","authors":"Su-Shin Ang, G. Constantinides, W. Luk, P. Cheung","doi":"10.1109/FPT.2006.270341","DOIUrl":"https://doi.org/10.1109/FPT.2006.270341","url":null,"abstract":"Motion vector estimation is frequently performed as a prelude to the exploitation of temporal redundancies in video applications. As a result, a large volume of work has been done to develop techniques to avoid the heavy memory access requirements of full search motion vector estimation. Often, these approaches introduce data dependence to the algorithm, leading to memory accesses which cannot be determined at design time. Consequently, this complicates the exploitation of data reuse in hardware. In this work, the cost of data dependence is quantified. Experiments indicate that a data dependent fast motion vector estimation approach is faster than full search by up to 47% in the absence of data re-use optimisation. However, full search is approximately 16 times faster than the `fast' motion vector estimation algorithm when a static line buffering scheme and a parallel caching scheme are used respectively to exploit data re-use. Therefore, it is established that data dependence in motion vector estimation is very expensive in terms of hardware performance","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125267530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Within-die delay variability in 90nm FPGAs and beyond","authors":"P. Sedcole, P. Cheung","doi":"10.1109/FPT.2006.270300","DOIUrl":"https://doi.org/10.1109/FPT.2006.270300","url":null,"abstract":"Semiconductor scaling causes increasing and unavoidable within-die parametric variability. This paper describes accurate measurement techniques for characterising both systematic and stochastic delay variability in FPGAs. Results and analysis are presented from measurements made on a sample of 90nm devices, showing that delay per logic element varies stochastically by plusmn3.54% on average over the set. The delay also varies by up to 3.66% across a single die from correlated sources of variability. The results are extrapolated to determine the impact at future technology nodes. The predicted significant performance degradation that variability will cause demonstrates the importance of new circuit or system design techniques to cope with variations in future FPGAs","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125721504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient management of custom instructions for run-time reconfigurable instruction set processors","authors":"S. Lam, Bharathi N. Krishnan, T. Srikanthan","doi":"10.1109/FPT.2006.270323","DOIUrl":"https://doi.org/10.1109/FPT.2006.270323","url":null,"abstract":"The instruction set extension capability of RISPs (reconfigurable instruction set processors) provides an attractive means to meet the flexibility, performance, and cost demands of ubiquitous computing devices. Run-time reconfiguration can further increase the cost efficiency and hardware specialization of these processors by dynamically changing the configuration of the reconfigurable logic to the required functionality. In this paper, we propose the use of a heuristic that leads to the selection of large custom instructions for increased performance gain. Result analysis of six applications from the MiBench embedded benchmark suite show that efficient data-path merging can be applied to the custom instructions to reduce the average number of configurations to less than 8 in a run-time RISP. In addition, there is only a small difference in the average number of configurations when compared to a custom instruction selection strategy that results in lower performance","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126771676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria E. Angelopoulou, K. Masselos, P. Cheung, Y. Andreopoulos
{"title":"A comparison of 2-D discrete wavelet transform computation schedules on FPGAs","authors":"Maria E. Angelopoulou, K. Masselos, P. Cheung, Y. Andreopoulos","doi":"10.1109/FPT.2006.270310","DOIUrl":"https://doi.org/10.1109/FPT.2006.270310","url":null,"abstract":"When it comes to the computation of the 2D discrete wavelet transform (DWT), three major computation schedules have been proposed, namely the row-column, the line-based and the block-based. In this work, the lifting-based designs of these schedules are implemented on FPGA-based platforms to execute the forward 2D DWT, and their comparison is presented. Our implementations are optimized in terms of throughput and memory requirements, in accordance with the specifications of each one of the three computation schedules and the lifting decomposition. All implementations are parameterized with respect to the image size and the number of decomposition levels. Experimental results prove that the suitability of each implementation for a particular application depends on the given specifications, concerning the throughput and the hardware cost","PeriodicalId":354940,"journal":{"name":"2006 IEEE International Conference on Field Programmable Technology","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126791295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}