{"title":"Acceleration of the long read mapping on a PC-FPGA architecture (abstract only)","authors":"Peng Chen, Chao Wang, Xi Li, Xuehai Zhou","doi":"10.1145/2435264.2435329","DOIUrl":"https://doi.org/10.1145/2435264.2435329","url":null,"abstract":"The genome sequence alignment, whereby ultra scale of sequence reads should be compared to an enormous long reference, has been one central challenge to the biologists for a long period. For recent years, new sequencing technology makes it possible to generate longer reads (sequences of genome fragments) which seem more valuable for the life science research. It has been foreseen that long genome reads (length longer than 200 base pairs) will dominate the field in the near future. Unfortunately, most of the state-of-art aligners nowadays are optimized and only applicable for the short read mapping while present long read aligners are still not satisfying at the aspect of speed. In this paper, we propose a novel PC-FPGA hybrid system to improve the performance of the long read mapping. The BWA-SW algorithm is chosen as the alignment approach and by accelerating the bottleneck of the algorithm, our solution could archive a significant improvement in term of speed. Experiments demonstrate that the described system is as accurate as the BWA-SW aligner and about 1.41-2.73 times faster than it for reads with lengths ranging from 500bp to 2000bp.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"21 1","pages":"271"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75668908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A novel multithread routing method for FPGAs (abstract only)","authors":"Chun Zhu, Qiuli Li, Jian Wang, Jinmei Lai","doi":"10.1145/2435264.2435322","DOIUrl":"https://doi.org/10.1145/2435264.2435322","url":null,"abstract":"We propose a platform-independent multithread routing method for FPGAs including two aspects: single high fanout net is routed parallel within itself and several low fanout nets are routed parallel between themselves. Routing for high fanout nets usually takes considerable time because of the large physical area surrounded by bounding boxes to traverse and tens of terminals to connect. Therefore, one high fanout net is partitioned into several subnets with fewer terminals and smaller bounding boxes to be routed in parallel. However, low fanout nets with intrinsic small bounding boxes and few terminals could hardly be divided. Instead, low fanout nets whose bounding boxes are not overlapping with each other are routed concurrently. A new graph, named bounding box graph, was utilized to facilitate the process of selecting several nets to be routed concurrently. In this graph, one vertex stands for a corresponding net and one edge between two connected vertex means that the two represented nets have their bounding boxes overlapped. Several strategies are introduced to balance the load among threads and ensure the deterministic results. The routing times scale down with increasing number of threads. On a 4-core processor, this technique improves the run-time by ~1.9 × with routing quality degrading by no more than 2.3%.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"30 1","pages":"269"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81246277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating subsequence similarity search based on dynamic time warping distance with FPGA","authors":"Zilong Wang, Sitao Huang, Lanjun Wang, Hao Li, Yu Wang, Huazhong Yang","doi":"10.1145/2435264.2435277","DOIUrl":"https://doi.org/10.1145/2435264.2435277","url":null,"abstract":"Subsequence search, especially subsequence similarity search, is one of the most important subroutines in time series data mining algorithms, and there is increasing evidence that Dynamic Time Warping (DTW) is the best distance metric. However, in spite of the great effort in software speedup techniques, including early abandoning strategies, lower bound, indexing, computation-reuse, DTW still cost too much time for many applications, e.g. 80% of the total time. Since DTW is a 2-Dimension sequential dynamic search with quite high data dependency, it is hard to use parallel hardware to accelerate it. In this work, we propose a novel framework for FPGA based subsequence similarity search and a novel PE-ring structure for DTW calculation. This framework utilizes the data reusability of continuous DTW calculations to reduce the bandwidth and exploit the coarse-grain parallelism; meanwhile guarantees the accuracy with a two-phase precision reduction. The PE-ring supports on-line updating patterns of arbitrary lengths, and utilizes the hard-wired synchronization of FPGA to realize the fine-grained parallelism. It also achieves flexible parallelism degree to do performance-cost trade-off. The experimental results show that we can achieve several orders of magnitude speedup in accelerating subsequence similarity search compared with the best software and current GPU/FPGA implementations in different datasets.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"141 1","pages":"53-62"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77301047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Location, location, location: the role of spatial locality in asymptotic energy minimization","authors":"A. DeHon","doi":"10.1145/2435264.2435291","DOIUrl":"https://doi.org/10.1145/2435264.2435291","url":null,"abstract":"Locality exploitation is essential to asymptotic energy minimization for gate array netlist evaluation. Naive implementations that ignore locality, including flat crossbars and simple processors based on monolithic memories, can require O(N2) energy for an N node graph. Specifically, it is important to exploit locality (1) to reduce the size of the description of the graph, (2) to reduce data movement, and (3) to reduce instruction movement. FPGAs exploit all three. FPGAs with a Rent Exponent p<0.5 running designs with p<0.5 achieve asymptotically optimal Theta(N) energy. FPGA designs with p>0.5 and implementations with metal layers that grow as O(N(p-0.5)) require only O(N(p+0.5)) energy; this bound can be achieved with O(1) metal layers with a novel multicontext design that has heterogeneous context depth. In contrast, a p>0.5 FPGA design on an implementation technology with O(1) metal layers requires O(N(2p)) energy.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"74 12 1","pages":"137-146"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79990777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FPGA bitstream compression and decompression using LZ and golomb coding (abstract only)","authors":"Jinsong Mao, Hao Zhou, Haijiang Ye, Jinmei Lai","doi":"10.1145/2435264.2435309","DOIUrl":"https://doi.org/10.1145/2435264.2435309","url":null,"abstract":"In this paper we propose an optimized bitstream compression algorithm based on LZ and a novel architecture of decompressor, the proposed algorithm improves the Compression Ratio by fully utilizing the regularity of configuration bits of CLB (Configurable Logic Box) in FPGA and using the variable length Golomb coding method. The experimental results show that the Optimized method can improve the Compression Ratio of LZSS by 32.3% for bitstream with high regularity and 10.3% for bitstream with low regularity, and our approach shows a higher flexibility than the BMC+RLE arithmetic when compressing the bitstream with high regularity for various FPGA. Moreover, we design a two-buffer-window decompressor to download the compressed bitstreams. In order to increase the throughput of the proposed decompressor, we design a multi-stage data selector in it. The post-simulation of the decompressor shows that its throughput is up to 9280 Mbps under 65nm CMOS process. And that is 4352Mbps when verified on a Virtex-5 FPGA.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"99 1","pages":"265"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80274715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA-based transient error simulator for evaluating resilient system designs (abstract only)","authors":"Chia-Hsiang Chen, Shiming Song, Zhengya Zhang","doi":"10.1145/2435264.2435328","DOIUrl":"https://doi.org/10.1145/2435264.2435328","url":null,"abstract":"Error-resilient designs have become more important with the continued device scaling. One critical challenge of designing error-resilient systems is the lack of tools to quickly and accurately evaluate the effectiveness and performance of such systems. We propose an FPGA-based transient error simulator to accelerate transient error simulations incorporating accurate datapath delay models and realistic error models. Compared to conventional digital error simulators, the FPGA-based transient error simulator operates at a finer time step and captures intricate interactions between errors and datapath under different circuit-level error detection and correction techniques. The error simulator is constructed using configurable datapath delay model and error model, making it general-purpose and widely applicable. We demonstrate the capability of this simulator in the evaluation of two popular error-resilient design techniques, pre-edge and post-edge detection and correction, using a synthesized CORDIC processor and an Alpha processor that operate under soft error, coupling noise and voltage droop models. The proposed error simulator uncovers insights to guide practical designs, including the choice of checking window in pre-edge designs and the optimal operating frequency in post-edge designs. The FPGA-based transient simulation will complement circuit simulation and system emulation for resilient system designs.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"271"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85318388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards simulator-like observability for FPGAs: a virtual overlay network for trace-buffers","authors":"Eddie Hung, S. Wilton","doi":"10.1145/2435264.2435272","DOIUrl":"https://doi.org/10.1145/2435264.2435272","url":null,"abstract":"The rising complexity of verification has led to an increase in the use of FPGA prototyping, which can run at significantly higher operating frequencies and achieve much higher coverage than logic simulations. However, a key challenge is observability into these devices, which can be solved by embedding trace-buffers to record on-chip signal values. Rather than connecting a predetermined subset of circuits signals to dedicated trace-buffer inputs at compile-time, in this work we propose that a virtual overlay network is built to multiplex all on-chip signals to all on-chip trace-buffers. Subsequently, at debug-time, the designer can choose a signal subset for observation. To minimize its overhead, we build this network out of unused routing multiplexers, and by using optimal bipartite graph matching techniques, we show that any subset of on-chip signals can be connected to 80-90% of the maximum trace-buffer capacity in less than 50 seconds.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"19-28"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88913013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating ncRNA homology search with FPGAs","authors":"Nathaniel McVicar, W. L. Ruzzo, S. Hauck","doi":"10.1145/2435264.2435276","DOIUrl":"https://doi.org/10.1145/2435264.2435276","url":null,"abstract":"Over the last decade, the number of known biologically important non-coding RNAs (ncRNAs) has increased by orders of magnitude. The function performed by a specific ncRNA is partially determined by its structure, defined by which nucleotides of the molecule form pairs. These correlations may span large and variable distances in the linear RNA molecule. Because of these characteristics, algorithms that search for ncRNAs belonging to known families are computationally expensive, often taking many CPU weeks to run. To improve the speed of this search, multiple search algorithms arranged into a series of progressively more stringent filters can be used. In this paper, we present an FPGA based implementation of some of these algorithms. This is the first FPGA based approach to attempt to accelerate multiple filters used in ncRNA search. The FPGA is reconfigured for each filter, resulting in a total system speedup of 25x when compared with a single CPU.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"38 1 1","pages":"43-52"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88205453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An FPGA based parallel architecture for music melody matching","authors":"Hao Wang, Jyh-Charn S. Liu","doi":"10.1145/2435264.2435305","DOIUrl":"https://doi.org/10.1145/2435264.2435305","url":null,"abstract":"We propose an FPGA-based high performance parallel architecture for music retrieval through singing. The database consists of monophonic MIDI files which are modeled into strings, and the user sung query is modeled as a set of regular expressions (regexp), with consideration of possible key transpositions and tempo variations to tolerate imperfectly sung queries. An approximate regexp matching algorithm is developed to calculate the similarity between a regexp and a string, using edit distance as the metrics. The algorithm supports user sung queries starting anywhere in the database song, not necessarily from the beginning. Using the proposed formal models and algorithms, the similarity between the user sung query and each song in the database can be evaluated and the top-10 most similar results will be reported. We designed the approximate regexp matching algorithm in such way that all terms of the regexp can execute concurrently, which perfectly fits the massive parallelism provided by FPGA. The FPGA implemented melody matching engine (MME) is a parameterized modular architecture that can be reconfigured to implement different regexps by simply updating their parameter registers, and can therefore avoid the time-consuming code re-synthesis. MME also includes an on-board DDR2 memory to store the database, so that they can be read in to calculate edit distances locally on the board. This way, each MME forms a self-contained system and multiple MMEs can be clustered to increase parallel processing power, with virtually no overhead. MME is evaluated using the query corpus of ThinkIT with 355 sung files and database of 5563 MIDI files. It achieves a top-10 hit rate of 90.7% and a runtime of 19.4 seconds, averaging 54.6 milliseconds for a single query. MME achieves significant speedup over software-based systems while providing the same level of flexibility.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"4 5 1","pages":"235-244"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78847876","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware acceleration of TEA and XTEA algorithms on FPGA, GPU and multi-core processors (abstract only)","authors":"V. Venugopal, D. Shila","doi":"10.1145/2435264.2435326","DOIUrl":"https://doi.org/10.1145/2435264.2435326","url":null,"abstract":"Field programmable gate arrays (FPGA) are extensively used for rapid prototyping in embedded system applications. While hardware acceleration can be done via specialized processors like a Graphical Processing Unit (GPU), they can also be accomplished with FPGAs for more specialized scenarios. GPUs essentially consist of massively parallel cores and have high memory bandwidth; FPGAs, on the other hand, provide flexibility in terms of customizable I/O and computational resources. In this paper, we explore the usage of GPUs and FPGAs as cryptographic co-processors in streaming dataflow systems with huge rate of data inhalation. Two classic lightweight encryption algorithms, Tiny Encryption Algorithm (TEA) and Extended Tiny Encryption Algorithm (XTEA), are targeted for implementation on GPUs and FPGAs. The GPU implementations of TEA and XTEA in this study depict a maximum speedup of 13x over CPU based implementation. The pipelined FPGA implementation is able to realize a throughput of 6-9x more than the GPU for small plaintext sizes.","PeriodicalId":87257,"journal":{"name":"FPGA. ACM International Symposium on Field-Programmable Gate Arrays","volume":"107 1","pages":"270"},"PeriodicalIF":0.0,"publicationDate":"2013-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79347328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}