{"title":"A Coarse Grain Reconfigurable Architecture for sequence alignment problems in bio-informatics","authors":"Pei Liu, A. Hemani","doi":"10.1109/SASP.2010.5521146","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521146","url":null,"abstract":"A Coarse Grain Reconfigurable Architecture (CGRA) tailored for accelerating bio-informatics algorithms is proposed. The key innovation is a light weight bio-informatics processor that can be reconfigured to perform different Add Compare and Select operations of the popular sequencing algorithms. A programmable and scalable architectural platform instantiates an array of such processing elements and allows arbitrary partitioning and scheduling schemes and capable of solving complete sequencing algorithms including the sequential phases and deal with arbitrarily large sequences. The key difference of the proposed CGRA based solution compared to FPGA and GPU based solutions is a much better match of the architecture and algorithm for the core computational need as well as the system level architectural need. This claim is quantified for three popular sequencing algorithms: the Needleman-Wunsch, Smith-Waterman and HMMER. For the same degree of parallelism, we provide a 5 X and 15 X speed-up improvements compared to FPGA and GPU respectively. For the same size of silicon, the advantage grows by a factor of another 10 X.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121296176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Customized architectures for faster route finding in GPS-based navigation systems","authors":"Jason Loew, D. Ponomarev, P. Madden","doi":"10.1109/SASP.2010.5521148","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521148","url":null,"abstract":"GPS based navigation systems became popular in dedicated handheld devices, and are now also found in modern cell phones, and other small personal devices. A key element of any navigation system is fast and effective route finding, and this depends heavily on Dijkstra's shortest path algorithm. Dijkstra's algorithm is serial in nature; prior efforts to accelerate it through parallel processing have had almost no success. In this paper, we present a practical approach to extract small-scale parallelism by shifting priority queue operations to a secondary tightly-coupled processor. We obtain a substantial speedup on real-world graphs (in particular, road maps), allowing the development of navigation systems that are more responsive, and also lower in total power consumption.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125081140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hardware pipeline for accelerating ray traversal algorithms on streaming processors","authors":"Michael Steffen, Joseph Zambreno","doi":"10.1109/SASP.2010.5521150","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521150","url":null,"abstract":"Ray Tracing is a graphics rendering method that uses rays to trace the path of light in a computer model. To accelerate the processing of rays, scenes are typically compiled into smaller spatial boxes using a tree structure and rays then traverse the tree structure to determine relevant spatial boxes. This allows computations involving rays and scene objects to be limited to only objects close to the ray and does not require processing all elements in the computer model. We present a ray traversal pipeline designed to accelerate ray tracing traversal algorithms using a combination of currently used programmable graphics processors and a new fixed hardware pipeline. Our fixed hardware pipeline performs an initial traversal operation that quickly identifies a smaller sized, fixed granularity spatial bounding box from the original scene. This spatial box can then be traversed further to identify subsequently smaller spatial bounding boxes using any user-defined acceleration algorithm. We show that our pipeline allows for an expected level of user programmability, including development of custom data structures, and can support a wide range of processor architectures. The performance of our pipeline is evaluated for ray traversal and intersection stages using a kd-tree ray tracing algorithm and a custom simulator modeling a generic streaming processor architecture. Experimental results show that our pipeline reduces the number of executed instructions on a graphics processor for the traversal operation by 2.15X for visible rays. The memory bandwidth required for traversal is also reduced by a factor of 1.3X for visible rays.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116997763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dominik Auras, Sylvain Girbal, H. Berry, O. Temam, S. Yehia
{"title":"CMA: Chip multi-accelerator","authors":"Dominik Auras, Sylvain Girbal, H. Berry, O. Temam, S. Yehia","doi":"10.1109/SASP.2010.5521152","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521152","url":null,"abstract":"Custom acceleration has been a standard choice in embedded systems thanks to the power density and performance efficiency it provides. Parallelism is another orthogonal scalability path that efficiently overcomes the increasing limitation of frequency scaling in current general-purpose architectures. In this paper we propose a multi-accelerator architecture that combines the best of both worlds, parallelism and custom acceleration, while addressing the programmability inconvenience of heterogeneous multiprocessing systems. A Chip Multi-Accelerator (CMA) is a regular parallel architecture where each core is complemented with a custom accelerator to speed up specific functions. Furthermore, by using techniques to efficiently merge more than one custom accelerator together, we are able to cram as many accelerators as needed by the application or a domain of applications. We demonstrate our approach on a Software Defined Radio (SDR) case study. We show that starting from a baseline description of several SDR waveforms and candidate tasks for acceleration, we are able to map the different waveforms on the heterogeneous multi-accelerator architecture while keeping a logical view of a regular multi-core architecture, thus simplifying the mapping of the waveforms onto the multi-accelerator.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114304128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating DNA analysis applications on GPU clusters","authors":"Antonino Tumeo, Oreste Villa","doi":"10.1109/SASP.2010.5521145","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521145","url":null,"abstract":"DNA analysis is an emerging application of high performance bioinformatics. Modern sequencing machinery are able to provide, in few hours, large input streams of data which needs to be matched against exponentially growing databases of known fragments. The ability to recognize these patterns effectively and fastly may allow extending the scale and the reach of the investigations performed by biology scientists. Aho-Corasick is an exact, multiple pattern matching algorithm often at the base of this application. In this paper we present an efficient implementation of the Aho-Corasick algorithm for high performance clusters accelerated with Graphic Processing Units (GPUs). We discuss how we partitioned and adapted the algorithm to fit the Tesla C1060 GPU and then present a MPI based implementation for a heterogeneous high performance cluster. We compare this implementation to MPI and MPI with pthreads based implementations for a homogeneous cluster of x86 processors, discussing the stability vs. the performance and the scaling of the solutions, taking into consideration aspects such as the bandwidth among the different nodes.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129492264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design of a custom VEE core in a chip multiprocessor","authors":"Dan Upton, K. Hazelwood","doi":"10.1109/SASP.2010.5521138","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521138","url":null,"abstract":"Chip multiprocessors provide an opportunity for continuing performance growth in the face of limited single-thread parallelism. Although the best design path for such chips remains open, application-specific core designs have shown promise. This work considers the design of an application-specific core for a virtual execution environment. We use Pin, a widely-used dynamic binary instrumentation system, as a representative process-level VEE. Through a combination of microarchitectural simulation and hardware performance counters, we profile the VEE in terms of cache behavior, functional unit usage, and branch predictor behavior, and compare its performance to the performance of benchmark applications. We then show that running the VEE on our specialized core uses up to 15% less power per cycle and up to 5% less energy overall than running the same VEE on a general-purpose core.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133865970","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient design and generation of a multi-facet arbiter","authors":"J. Jou, Yun-Lung Lee, Sih-Sian Wu","doi":"10.1109/SASP.2010.5521137","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521137","url":null,"abstract":"Based on the arbiter template developed in [1], we presented an efficient, modular, and scalable decentralized parallel design of a new multi-facet arbiter. Moreover, with this modular and reusable hardware design, we have implemented a parametric arbiter generator that automatically generates various multi-facet arbiters. With the decentralized parallel design and the generator, not only a fastest and smallest round-robin arbiter but also other type arbiters were designed and generated on the fly. The experiment results were given to show the designs' excellent performances.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133097080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"I-cache configurability for temperature reduction through replicated cache partitioning","authors":"M. Paul, Peter Petrov","doi":"10.1109/SASP.2010.5521143","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521143","url":null,"abstract":"On-chip caches have been known to be a major contributor to leakage power as they occupy a sizable fraction of the chip's real estate and as such have been the target of power optimization techniques. However, many of these techniques do not consider the effects of temperature on leakage power and can hence be suboptimal since leakage power rises rapidly with temperature. When large fractions of the cache are disabled and only a small partition is used, the power density increases significantly which leads to increased temperature and leakage. We propose a temperature reduction methodology that leverages recently introduced configurable caches, in order to not only assign to the task a cache partition commensurate to its current demand but also to minimize the associated power density and temperature. In order to counteract the effect of elevated power density and achieve temperature reductions, in the proposed technique each such cache partition is replicated and only one of the replicas is active at any time. The inactive partition replicas are placed into a low-power drowsy mode while the primary partition services the task's instruction requests. By periodically switching the tasks association between replica cache partitions, the power density and hence the temperature are reduced.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134622081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Khodor Ahmad Fawaz, T. Arslan, S. Khawam, M. Muir, I. Nousias, Iain A. B. Lindsay, A. Erdogan
{"title":"A dynamically reconfigurable asynchronous processor","authors":"Khodor Ahmad Fawaz, T. Arslan, S. Khawam, M. Muir, I. Nousias, Iain A. B. Lindsay, A. Erdogan","doi":"10.1109/SASP.2010.5521141","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521141","url":null,"abstract":"The main design requirements for high-throughput mobile applications are energy efficiency and programmability. This paper presents a novel dynamically reconfigurable processor that targets these requirements. Our processor consists of a heterogeneous array of coarse grain asynchronous cells. The architecture maintains most of the benefits of custom asynchronous design, while also providing programmability via conventional high-level languages. Results show that our processor delivers considerably lower power consumption when compared to a market leading VLIW and a low-power ARM processor, while maintaining their throughput performance. For example, our processor resulted in a reduction in power consumption over the ARM7 processor of over 9 times when running the bilinear demosaicing algorithm at the same throughput. Our processor was also compared to an equivalent synchronous design, resulting in a power reduction of up to 15%.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128338758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Shan, Tianji Wu, Yu Wang, Bo Wang, Zilong Wang, Ningyi Xu, Huazhong Yang
{"title":"FPGA and GPU implementation of large scale SpMV","authors":"Yi Shan, Tianji Wu, Yu Wang, Bo Wang, Zilong Wang, Ningyi Xu, Huazhong Yang","doi":"10.1109/SASP.2010.5521144","DOIUrl":"https://doi.org/10.1109/SASP.2010.5521144","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a fundamental operation for many applications. Many studies have been done to implement the SpMV on different platforms, while few work focused on the very large scale datasets with millions of dimensions. This paper addresses the challenges of implementing large scale SpMV with FPGA and GPU in the application of web link graph analysis. In the FPGA implementation, we designed the task partition and memory hierarchy according to the analysis of datasets scale and their access pattern. In the GPU implementation, we designed a fast and scalable SpMV routine with three passes, using a modified Compressed Sparse Row format. Results show that FPGA and GPU implementation achieves about 29x and 30x speedup on a StratixII EP2S180 FPGA and Radeon 5870 Graphic Card respectively compared with a Phenom 9550 CPU.","PeriodicalId":119893,"journal":{"name":"2010 IEEE 8th Symposium on Application Specific Processors (SASP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2010-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130114735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}