{"title":"Implementation of IEEE single precision floating point addition and multiplication on FPGAs","authors":"L. Louca, T. A. Cook, W. H. Johnson","doi":"10.1109/FPGA.1996.564761","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564761","url":null,"abstract":"Floating point operations are hard to implement on FPGAs because of the complexity of their algorithms. On the other hand, many scientific problems require floating point arithmetic with high levels of accuracy in their calculations. Therefore, we have explored FPGA implementations of addition and multiplication for IEEE single precision floating-point numbers. Customizations were performed where this was possible in order to save chip area, or get the most out of our prototype board. The implementations tradeoff area and speed for accuracy. The adder is a bit-parallel adder, and the multiplier is a digit-serial multiplier. Prototypes have been implemented on Altera FLEX8000s, and peak rates of 7 MFlops for 32-bit addition and 2.3 MFlops for 32-bit multiplication have been obtained.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127169818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bit-serial pipeline synthesis for multi-FPGA systems with C++ design capture","authors":"T. Isshiki, W. Dai","doi":"10.1109/FPGA.1996.564741","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564741","url":null,"abstract":"Developing applications for a large-scale configurable system composed of state-of-the-art FPGA technology is a grand challenge. FPGAs are inherently resource limited devices in terms of logic, routing, and IO. Without a careful circuit implementation strategy, one would waste a large portion of the potential capacity of the configurable hardware. Also, high-level design entry support is essential for such large-scale hardware. A C++ design tool has been implemented which maps the computational algorithms onto bit-serial pipeline networks which exhibit high performance and maximize the device utilization of each FPGA. With this tool, the designer is able to develop applications in a very short time, and also is able to try out different algorithm implementations easily to see the trade-offs in terms of performance and hardware size instantaneously. Based on this C++ design tool, a number of DSP applications such as 1D and 2D filters, adaptive filters, Inverse Discrete Cosine Transform, and digital neural networks were designed.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127734842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SOP: a reconfigurable massively parallel system and its control-data-flow based compiling method","authors":"Tsukasa Yamauchi, S. Nakaya, N. Kajihara","doi":"10.1109/FPGA.1996.564793","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564793","url":null,"abstract":"This paper describes reconfigurable massively parallel computer system called SOP (Sea Of Processors) that has ability to change its structure and achieves high performance by mapping the control flow and data flow of target algorithms directly on the reconfigurable hardware. SOP system consists of huge number of programmable logic, memory and switch elements. Each logic element is mainly used to map logic/arithmetic operations and control circuits. SOP memory element has ability to process global search, global sorting, heap tree and min/max operations quickly. SOP compiler extracts high degree of parallelism from application programs written in C-language by exploiting operation and function level parallelism using control-data-flow based mapping technique.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130450830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OneChip: an FPGA processor with reconfigurable logic","authors":"Ralph Wittig, P. Chow","doi":"10.1109/FPGA.1996.564773","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564773","url":null,"abstract":"This paper describes a processor architecture called OneChip, which combines a fixed-logic processor core with reconfigurable logic resources. Using the programmable components of this the performance of speed-critical can be improved by customizing OneChip's execution units, or flexibility can be added to the glue logic interfaces of embedded controller applications. OneChip eliminates the shortcomings of other custom compute machines by tightly integrating its reconfigurable resources into a MIPS-like processor. Speedups of close to 50 over strict software implementations on a MIPS R4400 are achievable for computing the DCT.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132788469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Genetic algorithms in software and in hardware-a performance analysis of workstation and custom computing machine implementations","authors":"P. Graham, B. Nelson","doi":"10.1109/FPGA.1996.564847","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564847","url":null,"abstract":"The paper analyzes the performance differences found between the hardware and software versions of a genetic algorithm used to solve the travelling salesman problem. The hardware implementation requires 4 FPGA's on a Splash 2 board and runs at 11 MHz. The software implementation was written in C++ and executed on a 125 MHz HP PA-RISC workstation. The software run time was more than four times that of the hardware (up to 50 times as many cycles). The paper analyses the contribution made to this performance difference by the following hardware features: hard-wired control, custom address generation logic, memory hierarchy efficiency, and both fine- and course-grained parallelism. The results indicate that the major contributor to the hardware performance advantage is fine-grained parallelism-RTL-level parallelism due to operator pipelining. This alone accounts for as much as a 38X cycle-count reduction over the software in one section of the algorithm. The next major contributors include hard-wired control and custom address generation which account for as much as a 3X speedup in other sections of the algorithm. Finally, memory hierarchy inefficiencies in the software (cache misses and paging) and coarse-grained parallelism in the hardware are each shown to have lesser effect on the performance difference between the implementations.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116742377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A quantitative analysis of processor-programmable logic interface","authors":"S. Rajamani, P. Viswanath","doi":"10.1109/FPGA.1996.564852","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564852","url":null,"abstract":"The addition of programmable logic to RISC machines has the potential of exploiting the inherent parallelism of hardware to speedup an application. The authors study the effect of adding a programmable accelerator to DLX, a RISC prototype. They build this model and parameterize the communication overhead between the processor and programmable unit and logic/routing delays inside the programmable unit. They use simulation to evaluate the performance of this model, parameterized by communication overhead and logic delays, by comparing it with the baseline DLX architecture on some sample problems. The methodology is useful in studying the relative importance of the parameters and in projecting the performance of the system, if the programmable logic were to be implemented inside the processor.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133769647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Revisiting Smalltalk-80 blocks: a logic generator for FPGAs","authors":"B. Pottier, J. Llopis","doi":"10.1109/FPGA.1996.564744","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564744","url":null,"abstract":"A Smalltalk-80 block is an encapsulation of code the evaluation of which is delayed, either for casual or repetitive execution (message value), or for process creation (message fork). Execution is driven following the object oriented paradigm with late binding of messages to actual functions. A logic generator is described, which is based on the compilation of blocks to logic specifications. The translation process applies the block to a collection of objects representing a definition set. Resulting and original object collections are then translated into binary logic depending on observed classes. Block reference to remote variables allows sequential circuit implementations. The logic generator operates in an interactive environment supporting BLIF and XNF logic representations. Logic optimization and partitioning is achieved using the SIS package. The testbed is the ArMen parallel computer which includes an FPGA ring.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"497 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134159478","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Snider, P. Kuekes, W. Culbertson, R. Carter, Arnold S. Berger, R. Amerson
{"title":"Exploring architectures for volume visualization on the Teramac custom computer","authors":"G. Snider, P. Kuekes, W. Culbertson, R. Carter, Arnold S. Berger, R. Amerson","doi":"10.1109/FPGA.1996.564750","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564750","url":null,"abstract":"In the past year we have gained experience in custom computing by porting a number of large applications to Teramac. Teramac is a custom computer capable of executing million-gate user designs at speeds approaching one megahertz. Teramac includes software that fully automates the conversion of high-level user designs to configurations that are ready to run on Teramac. Two applications, both in excess of a quarter million gates, are described here: an artery-extraction filter that locates and highlights arteries in medical MRI datasets; and Cube, a volume-rendering engine. We have discovered the importance of scalable, parameterized designs. We have been successful in parameterizing some aspects of our designs and we have identified improvements to our tools which would facilitate further parameterization. Our users have benefitted from the speed at which their applications run on Teramac and have gained confidence in their designs after seeing them run on actual hardware.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133792931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Villasenor, B. Schoner, Kang-Ngee Chia, C. Zapata, Hea Joung Kim, Christopher R. Jones, Shane Lansing, B. Mangione-Smith
{"title":"Configurable computing solutions for automatic target recognition","authors":"J. Villasenor, B. Schoner, Kang-Ngee Chia, C. Zapata, Hea Joung Kim, Christopher R. Jones, Shane Lansing, B. Mangione-Smith","doi":"10.1109/FPGA.1996.564749","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564749","url":null,"abstract":"FPGAs can be used to build systems for automatic target recognition (ATR) that achieve an order of magnitude increase in performance over systems built using general purpose processors. This improvement is possible because the bit-level operations that comprise much of the ATR computational burden map extremely efficiently into FPGAs, and because the specificity of ATR target templates can be leveraged via fast reconfiguration. We describe here algorithms, design tools, and implementation strategies that are being used in a configurable computing system for ATR.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122439311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VLSI architectures for field programmable gate arrays: a case study","authors":"Roger Francis Woods, A. Cassidy, J. Gray","doi":"10.1109/FPGA.1996.564736","DOIUrl":"https://doi.org/10.1109/FPGA.1996.564736","url":null,"abstract":"The ability to achieve highly efficient hardware implementations of algorithms will form a key aspect in the success of custom computing, Developments in VLSI architectures where regularity, simple design and locality of connections appears to be an ideal approach in the development of efficient field programmable gate array (FPGA) designs. The authors present a case study namely the implementation of the majority of a two dimensional (2D) discrete cosine transform (DCT) in a XilinK XC6216 device. The design has a 70% hardware utilisation figure and operates at 25 Mega pixels per second or over 30 frames per second for standard NTSC. The paper clearly demonstrates the hardware efficiency of such an approach. The paper also describes work on architectural synthesis framework to automatically generate highly regular designs for a wider range of complex algorithms.","PeriodicalId":244873,"journal":{"name":"1996 Proceedings IEEE Symposium on FPGAs for Custom Computing Machines","volume":"54 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1996-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124180006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}