Rafael J. N. Silva, Brunno F. Goldstein, Leandro Santiago, A. Sena, L. A. J. Marzulo, Tiago A. O. Alves, F. França
{"title":"Task Scheduling in Sucuri Dataflow Library","authors":"Rafael J. N. Silva, Brunno F. Goldstein, Leandro Santiago, A. Sena, L. A. J. Marzulo, Tiago A. O. Alves, F. França","doi":"10.1109/SBAC-PADW.2016.15","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.15","url":null,"abstract":"Sucuri is a minimalistic Python library that provides dataflow programming through a reasonably simple syntax. It allows transparent execution on computer clusters and natural exploitation of parallelism. In Sucuri, programmers instantiate a dataflow graph, where each node is assigned to a function and edges represent data dependencies between nodes. The original implementation of Sucuri adopts a centralized scheduler, which incurs high communication overheads, specially in clusters with a large number of machines. In this paper we modify Sucuri so that each machine in a cluster will have its own scheduler. Before execution, the dataflow graph is partitioned, so that nodes can be distributed among the machines of the cluster. In runtime, idle workers will grab tasks from a ready queue in their local scheduler. Experimental results confirm that the solution can reduce communication overheads, improving performance in larger clusters.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127015762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Optimization for SpMV on Multi-GPU Systems Using Threads and Multiple Streams","authors":"Ping Guo, Changjiang Zhang","doi":"10.1109/SBAC-PADW.2016.20","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.20","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a key operation in scientific computing and engineering ap-plications. This paper presents an optimization strategy to improve SpMV performance on the multi-GPU systems by adopting OpenMP threads and multiple CUDA streams. We propose an efficient scheme to control multiple GPUs jointly complete SpMV computations by making use of OpenMP threads. Moreover, we adopt streamed approach to increase concurrency to further improve SpMV performance. In our paper, we use HYB (Hybrid ELL/COO), a hybrid sparse storage format, to demonstrate the effectiveness of our proposed approach. Our experimental results show that our approach achieves an average speedup of 3.80 over the existing SpMV implementation on a single GPU.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125174005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a GPU Abstraction for Lua","authors":"Raphael Ribeiro, Paulo Motta","doi":"10.1109/SBAC-PADW.2016.11","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.11","url":null,"abstract":"The use of GPUs for accelerating parallel applications is a consolidated approach. However, it is still difficult to write applications for this type of hardware, which is mostly done in compiled languages like C. Some effort has been employed to provide developers with libraries and frameworks for interpreted languages to be able to take advantage of the computing capabilities of GPUs. In this context we created a hardware abstraction for the Lua programming language that uses the facilities of the LLVM project to compile part of the application to the GPU native format, while the rest of the application remains in interpreted Lua. This main application controls the GPU device through a library and loads the compiled function for execution, in the end it may retrieve the results from the device. This compilation of the Lua function code into a GPU kernel is done in a transparent fashion, allowing the user to access the underlying hardware, without the complexities related to the traditional GPU programming.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122559831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synchronization-Free Automatic Parallelization for Arbitrarily Nested Affine Loops","authors":"T. Klimek, M. Pałkowski, W. Bielecki","doi":"10.1109/SBAC-PADW.2016.16","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.16","url":null,"abstract":"This paper presents a new approach for extracting synchronization-free parallelism available in program loop nests. The approach allows for extracting parallelism for arbitrarily nested parametric loop nests, where the loop bounds and data accesses are affine functions of loop indices and symbolic parameters. Parallelization is realized using the transitive closure of a dependence graph. Speed-up of parallel code produced by means of the approach is studied using the NAS benchmark suite. Parallelism of loop nests is obtained by creating a kernel of computations represented in the OpenMP standard to be executed independently on multi-core computers. Results of an experimental study carried out by means of the many integrated core architecture Intel Xeon Phi is discussed.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126356204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Efficient 2D Router Architecture for Extending the Performance of Inhomogeneous 3D NoC-Based Multi-Core Architectures","authors":"Michael Opoku Agyeman, W. Zong","doi":"10.1109/SBAC-PADW.2016.22","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.22","url":null,"abstract":"To meet the performance and scalability demands of the fast-paced technological growth towards exascale and Big-Data processing with the performance bottleneck of conventional metal based interconnects, alternative interconnect fabrics such as inhomogeneous three dimensional integrated Network-on-Chip (3D NoC) has emanated as a cost-effective solution for emerging multi-core design. However, these interconnects trade-off optimized performance for cost by restricting the number of area and power hungry 3D routers. Consequently, in this paper, we propose a low-latency adaptive router with a low-complexity single-cycle bypassing mechanism to alleviate the performance degradation due to the slow 2D routers in inhomogeneous 3D NoCs. By combining the low-complexity bypassing technique with adaptive routing, the proposed router is able to balance the traffic in the network to reduce the average packet latency under various traffic loads. Simulation shows that, the proposed router can reduce the average packet delay by an average of 45% in 3D NoCs.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131463621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Dataflow to Hardware Synthesis Framework on FPGAs","authors":"Youngsoo Kim, Shrikant S. Jadhav, C. Gloster","doi":"10.1109/SBAC-PADW.2016.24","DOIUrl":"https://doi.org/10.1109/SBAC-PADW.2016.24","url":null,"abstract":"We present a dataflow based performance estimation and synthesis framework that will help hardware designers quantify the algorithm performance and synthesize their HW designs onto Field Programmable Gate Arrays (FPGAs). Typically, Digital Signal Processing (DSP) systems are designed by making gradual architectural choices in HW refinement steps. These decisions are based on performance quantification by high level DSP algorithm developers and HW implementation engineers. The main obstacle to this refinement is the provision of reasonably correct performance estimations to guide HW designers in Design Space Exploration (DSE) at an early stage. HW designers face challenges when they need to quantify the performance of their designs, especially when resources are limited. We use dataflow models by describing their hardware detail only as necessary. Dataflow based performance estimation achieves the efficient generation of qualitative and quantitative parameters for the assessment of HW candidates. Reconfigurable logic can be used to off-load the primary computational kernel onto a custom computing machine in order to reduce execution time by an order of magnitude as compared to kernel execution on a general purpose processor. Specifically, FPGAs can be used to accelerate these kernels using hardware-based custom logic implementations. In this paper, we demonstrate a framework for algorithm acceleration from the dataflow to synthesized HDL design. Experimental results show a linear speedup by adding reasonably small processing elements in FPGA as opposed to using a software implementation running on a typical general purpose processor.","PeriodicalId":186179,"journal":{"name":"2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116655096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}