{"title":"Evaluating Various Branch-Prediction Schemes for Biomedical-Implant Processors","authors":"C. Strydis, G. Gaydadjiev","doi":"10.1109/ASAP.2009.37","DOIUrl":"https://doi.org/10.1109/ASAP.2009.37","url":null,"abstract":"This paper evaluates various branch-prediction schemes under different cache configurations in terms of performance, power, energy and area on suitably selected biomedical workloads. The benchmark suite used consists of compression, encryption and data-integrity algorithms as well as real implant applications, all executed on realistic biomedical input datasets. Results are used to drive the (micro)architectural design of a novel microprocessor targeting microelectronic implants. Our profiling study has revealed that, under strict or relaxed area constraints and regardless of cache size, the ALWAYS TAKEN and ALWAYS NOT-TAKEN static prediction schemes are, in almost all cases, the most suitable choices for the envisioned implant processor. It is further shown that bimodal predictors with small Branch-Target-Buffer (BTB) tables are suboptimal yet also attractive solutions when processor I/D-cache sizes are up to 1024KB/512KB, respectively.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114475871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Shoufan, Thorsten Wink, H. G. Molter, S. Huss, Falko Strenzke
{"title":"A Novel Processor Architecture for McEliece Cryptosystem and FPGA Platforms","authors":"A. Shoufan, Thorsten Wink, H. G. Molter, S. Huss, Falko Strenzke","doi":"10.1109/ASAP.2009.29","DOIUrl":"https://doi.org/10.1109/ASAP.2009.29","url":null,"abstract":"McEliece scheme represents a code-based public-key cryptosystem. So far, this cryptosystem was not employed because of efficiency questions regarding performance and communication overhead.This paper presents a novel processor architecture as a high-performance platform to execute key generation, encryption and decryption according to this cryptosystem. A prototype of this processor is realized on Virtex-5 FPGA and tested via a software API. A comparison with a similar software solution highlights the performance advantage of the proposed hardware solution.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129742520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Run-Time Detection of Malwares via Dynamic Control-Flow Inspection","authors":"Yong-Joon Park, Zhao Zhang, Songqing Chen","doi":"10.1109/ASAP.2009.30","DOIUrl":"https://doi.org/10.1109/ASAP.2009.30","url":null,"abstract":"Conventional approach of detecting malwares relies on static scanning of malware signature. However, it may not work on the malwares that use software protection methods such as encryption and packing with run-time decryption and unpacking. We propose a hardware-assisted malware detection system that detects malwares during program run time to complement the conventional approach. It searches for control flow-based signature of malware during program execution, therefore bypassing the protection method used by those malwares. A new hardware design is used to assist the collection of control flow information. We have implemented and evaluated a prototype system on top of a full-system simulator based on the Intel x86 architecture. The experimental results show that the system can successfully distinguish all 30 malware variants and other benign programs that we have randomly collected, and that the overall run-time performance overhead is negligible. In short, the study demonstrates that it is a viable approach to detect malware in run time using control flow-based signature.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117295045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing a Highly Parameterized Digital PIV System on Reconfigurable Hardware","authors":"A. Bennis, M. Leeser, G. Tadmor","doi":"10.1109/ASAP.2009.20","DOIUrl":"https://doi.org/10.1109/ASAP.2009.20","url":null,"abstract":"This paper presents PARPIV the design and prototyping of a highly parameterized digital Particle Image Velocimetry (PIV) system implemented on reconfigurable hardware. Despite many improvements to PIV methods over the last twenty years, PIV post-processing remains a computationally intensive task. It becomes a serious bottleneck as camera acquisition rates reach 1000 frames per second. In this research, we aim to substantially speed up PIV processing by implementing it in reconfigurable hardware. Furthermore, this implementation is highly parameterized, supporting adaptation to a variety of setups and application domains. The circuit is parameterized by the dimensions of the captured images as well as the dimensions of the interrogation windows and sub-areas, pixel representation, board memory width, displacement and overlap. Through this work a parameterized library of different VHDL components was built. To the best of the authors’ knowledge, this is the first highly parameterized PIV system implemented on reconfigurable hardware reported in the literature. For a typical PIV configuration with images of 512×512 pixels, 40×40 pixel interrogation windows and 32×32 pixel sub-areas, we achieved about 65 times speedup in hardware over a standard software implementation.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130426700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Hormigo, M. Ortiz, F. Quiles, Francisco J. Jaime, J. Villalba, E. Zapata
{"title":"Efficient Implementation of Carry-Save Adders in FPGAs","authors":"J. Hormigo, M. Ortiz, F. Quiles, Francisco J. Jaime, J. Villalba, E. Zapata","doi":"10.1109/ASAP.2009.22","DOIUrl":"https://doi.org/10.1109/ASAP.2009.22","url":null,"abstract":"Most Field Programmable Gate Array (FPGA) devices have a special fast carry propagation logic intended to optimize addition operations. The redundant adders do not easily fit into this specialized carry-logic and, consequently, they require double hardware resources than carry propagate adders, while showing a similar delay for small size operands. Therefore, carry-save adders are not usually implemented on FPGA devices, although they are very useful in ASIC implementations. In this paper we study efficient implementations of carry-save adders on FPGA devices, taking advantage of the specialized carry-logic. We show that it is possible to implement redundant adders with a hardware cost close to that of a carry propagate adder. Specifically, for 16 bits and bigger wordlengths, redundant adders are clearly faster and have an area requirement similar to carry propagate adders. Among all the redundant adders studied, the 4:2 compressor is the fastest one, presents the best exploitation of the logic resources within FPGA slices and the easiest way to adapt classical algorithms to efficiently fit FPGA resources.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113975657","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MSA-CUDA: Multiple Sequence Alignment on Graphics Processing Units with CUDA","authors":"Yongchao Liu, B. Schmidt, D. Maskell","doi":"10.1109/ASAP.2009.14","DOIUrl":"https://doi.org/10.1109/ASAP.2009.14","url":null,"abstract":"Progressive alignment is a widely used approach for computing multiple sequence alignments (MSAs). However, aligning several hundred or thousand sequences with popular progressive alignment tools such as ClustalW requires hours or even days on state-of-the-art workstations. This paper presents MSA-CUDA, a parallel MSA program, which parallelizes all three stages of the ClustalW processing pipeline using CUDA and achieves significant speedups compared to the sequential ClustalW for a variety of large protein sequence datasets. Our tests on a GeForce GTX 280 GPU demonstrate average speedups of 36.91 (for long protein sequences), 18.74 (for average-length protein sequences), and 11.27 (for short protein sequences) compared to the sequential ClustalW running on a Pentium 4 3.0 GHz processor. Our MSA-CUDA outperforms ClustalW-MPI running on 32 cores of a high performance workstation cluster.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130589597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}