Proceedings of the 34th ACM International Conference on Supercomputing最新文献_第5页

MKPipe: a compiler framework for optimizing multi-kernel workloads in OpenCL for FPGA MKPipe:用于优化OpenCL中FPGA的多内核工作负载的编译器框架

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2020-02-05 DOI: 10.1145/3392717.3392757

Ji Liu, A. Kafi, Xipeng Shen, Huiyang Zhou

{"title":"MKPipe: a compiler framework for optimizing multi-kernel workloads in OpenCL for FPGA","authors":"Ji Liu, A. Kafi, Xipeng Shen, Huiyang Zhou","doi":"10.1145/3392717.3392757","DOIUrl":"https://doi.org/10.1145/3392717.3392757","url":null,"abstract":"OpenCL for FPGA enables developers to design FPGAs using a programming model similar for processors. Recent works have shown that code optimization at the OpenCL level is important to achieve high computational efficiency. However, existing works either focus primarily on optimizing single kernels or solely depend on channels to design multi-kernel pipelines. In this paper, we propose a source-to-source compiler framework, MKPipe, for optimizing multi-kernel workloads in OpenCL for FPGA. Besides channels, we propose new schemes to enable multi-kernel pipelines. Our optimizing compiler employs a systematic approach to explore the tradeoffs of these optimizations methods. To enable more efficient overlapping between kernel execution, we also propose a novel workitem/workgroup-id remapping technique. Furthermore, we propose new algorithms for throughput balancing and resource balancing to tune the optimizations upon individual kernels in the multi-kernel workloads. Our results show that our compiler-optimized multi-kernels achieve up to 3.6x (1.4x on average) speedup over the baseline, in which the kernels have already been optimized individually.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125703610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A scalable framework for solving fractional diffusion equations 求解分数扩散方程的可伸缩框架

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2019-11-27 DOI: 10.1145/3392717.3392769

Max Carlson, R. Kirby, H. Sundar

{"title":"A scalable framework for solving fractional diffusion equations","authors":"Max Carlson, R. Kirby, H. Sundar","doi":"10.1145/3392717.3392769","DOIUrl":"https://doi.org/10.1145/3392717.3392769","url":null,"abstract":"The study of fractional order differential operators (involving non-integer derivative terms) is receiving renewed attention in many scientific fields from photonics to speech modeling. While numerous scalable codes exist for solving integer-order partial differential equations (PDEs), the same is not true for fractional order PDEs. Therefore, there is a need for highly scalable numerical methods and codes for solving fractional order PDEs on complex geometries. The key challenge is that most approaches for fractional PDEs have at least quadratic complexity in both storage and compute, and are challenging to scale. We present a scalable framework for solving fractional diffusion equations using the method of eigen-function expansion. This includes a scalable parallel algorithm to efficiently compute the full set of eigenvalues and eigenvectors for a discretized Laplace eigenvalue problem and apply them to construct approximate solutions to fractional order model problems. We demonstrate the efficacy of our methods by performing strong and weak scalability tests using complex geometries on TACC's Frontera compute cluster. We also show that our approach compares favorably against existing dense and sparse solvers. In our largest solve, we estimated half a million eigenpairs using 28,672 cores.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131485433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

AMOEBA: a coarse grained reconfigurable architecture for dynamic GPU scaling AMOEBA:用于动态GPU扩展的粗粒度可重构架构

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 2019-11-08 DOI: 10.1145/3392717.3392738

Xianwei Cheng, Hui Zhao, M. Kandemir, Beilei Jiang, Gayatri Mehta

{"title":"AMOEBA: a coarse grained reconfigurable architecture for dynamic GPU scaling","authors":"Xianwei Cheng, Hui Zhao, M. Kandemir, Beilei Jiang, Gayatri Mehta","doi":"10.1145/3392717.3392738","DOIUrl":"https://doi.org/10.1145/3392717.3392738","url":null,"abstract":"Different GPU applications exhibit varying scalability patterns with network-on-chip (NoC), coalescing, memory and control divergence, and L1 cache behavior. A GPU consists of several Streaming Multi-processors (SMs) that collectively determine how shared resources are partitioned and accessed. Recent years have seen divergent paths in SM scaling towards scale-up (fewer, larger SMs) vs. scale-out (more, smaller SMs). However, neither scaling up nor scaling out can meet the scalability requirement of all applications running on a given GPU system, which inevitably results in performance degradation and resource under-utilization for some applications. In this work, we investigate major design parameters that influence GPU scaling. We then propose AMOEBA, a solution to GPU scaling through reconfigurable SM cores. AMOEBA monitors and predicts application scalability at run-time and adjusts the SM configuration to meet program requirements. AMOEBA also enables dynamic creation of heterogeneous SMs through independent fusing or splitting. AMOEBA is a microarchitecture-based solution and requires no additional programming effort or custom compiler support. Our experimental evaluations with application programs from various benchmark suites indicate that AMOEBA is able to achieve a maximum performance gain of 4.3x, and generates an average performance improvement of 47% when considering all benchmarks tested.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132725685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Optimizing supercompilers for supercomputers 优化超级计算机的超级编译器

Proceedings of the 34th ACM International Conference on Supercomputing Pub Date : 1989-03-20 DOI: 10.1145/3392717.3400034

M. Wolfe

引用次数: 38