Yuan Tang, R. You, Haibin Kan, Jesmin Jahan Tithi, P. Ganapathi, R. Chowdhury
{"title":"Improving Parallelism of Recursive Stencil Computations without Sacrificing Cache Performance","authors":"Yuan Tang, R. You, Haibin Kan, Jesmin Jahan Tithi, P. Ganapathi, R. Chowdhury","doi":"10.1145/2686745.2686752","DOIUrl":"https://doi.org/10.1145/2686745.2686752","url":null,"abstract":"The state-of-the-art \"trapezoidal decomposition algorithm\" for stencil computations on modern multicore machines use recursive divide-and-conquer (DAC) to achieve asymptotically optimal cache complexity cache-obliviously. But the same DAC approach restricts parallelism by introducing artificial dependencies among subtasks in addition to those arising from the defining stencil equations. As a result, the trapezoidal decomposition algorithm has suboptimal parallelism. In this paper we present a variant of the parallel trapezoidal decomposition algorithm called \"cache-oblivious wavefront\" (COW) that starts execution of recursive subtasks earlier than the start time prescribed by the original algorithm without violating any real dependencies implied by the underlying recurrences, and thus reducing serialization due to artificial dependencies. The reduction in serialization leads to an improvement in parallelism. Moreover, since we do not change the DAC-based decomposition of tasks used in the original algorithm, cache performance does not suffer. We provide experimental measurements of absolute running times, burdened span by Cilkview, and L1/L2 cache misses by PAPI to validate our claims.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127548024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Basu, Samuel Williams, Brian Van Straalen, L. Oliker, Mary W. Hall
{"title":"Converting Stencils to Accumulations Forcommunication-Avoiding Optimizationin Geometric Multigrid","authors":"P. Basu, Samuel Williams, Brian Van Straalen, L. Oliker, Mary W. Hall","doi":"10.1145/2686745.2686749","DOIUrl":"https://doi.org/10.1145/2686745.2686749","url":null,"abstract":"This paper describes a compiler transformation on stencil operators that automatically converts a standard stencil representation into an accumulation. We use this as an enabling transformation to optimize the stencil operators in the context of Geometric Multigrid (GMG), a widely used method to solve partial differential equations. GMG has four stencil operators, the smoother, residual, restriction, and interpolation some of which require inter-process and inter-thread communication. This new optimization allows us, at each level of a GMG V-Cycle, to fuse all operators when recursing down the V-Cycle, and all smooth operations when returning up the V-Cycle. In turn, this fusion allows us to create a parallel wavefront across the fused operators that reduces communication. Thus, these combined optimizations reduce vertical (through the memory hierarchy) data movement and horizontal (inter-thread and inter-process) messages and synchronization.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129019059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Stencils in Scientific Computations","authors":"A. Dubey","doi":"10.1145/2686745.2686756","DOIUrl":"https://doi.org/10.1145/2686745.2686756","url":null,"abstract":"Stencils occur in many areas, but they are ubiquitous in scientific computing. They range from the simple Jacobi iterations to the extremely complex ones used in the solution of highly nonlinear partial differential equations (PDE). High level programming languages typically used in implementation of scientific software, by not providing explicit support for stencils, force each implementation to make choices about expressing its specifics such as dimensionality, data layout, order of access and order of operations. These choices often hide the opportunity for optimizations from the compilers. Therehave been attempts to provide abstractions for simpler stencils, and they have met with success in some areas, but multiphysics scientific applications present challenges that cannot be met by simple stencil abstractions. The applications may have hierarchy, or non-uniformity, or both in their discretizations which cannot be expressed by stencils describing uniform discretizations. The physics operators being applied maybe non-linear which would demand composability of stencils. As the order of the solution method increases, the size and the reach of stencil also increases, and there may be conditions that imply the application of the stencil to an arbitrary subset of the discretized points. And finally, if there are multiple steps involved in an update, intermediate results need to be managed. AMR Shift Calculus, (Phil Colella and Brian Van Straalen 2014), provides a generalized abstraction that addresses many of these concerns. It provides a means of expressing stencil computations in the form of a collection of shift operations combined with associated weights, that can be applied to a specified collection of discretized points. The shift calculus also addresses the hierarchy in the discretization, and defines operators on stencils that allow more complex stencils to be composed from simpler ones. Because the shift calculus makes it possible to express the computation concisely and precisely, it gets around the problem of false dependencies. Additionally, the composability of the stencil operators exposes possibilities of loop or even function fusion, and the granularity for holding intermediate values to the compiler for better optimization opportunities. The included slide presentation is organized in five sections. The first section gives examples of discretization from simple Poisson to complex compressible Navier-Stokes (CNS) equations and addresses thelevel of abstraction needed to express the computations on these discretizations. The second section outlines several challenges that are unique to scientific applications, and the ways in which many abstractions that have proved useful elsewhere fail to work with scientific computing. The third section goes on to describe the AMR shift calculus with emphasis on features that are typically not found in other approaches to stencils based abstractions, but are necessary for the solving complex PDE'","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133881372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"StenSAL: A Single Assignment Language for Relentlessly Executing Explicit Stencil Algorithms","authors":"Lucas A. Wilson, Jeffery von Ronne","doi":"10.1145/2686745.2686747","DOIUrl":"https://doi.org/10.1145/2686745.2686747","url":null,"abstract":"Many different scientific domains make use of stencil-based algorithms to solve mathematical equations for computational modeling and simulation. Existing imperative languages map well onto physical hardware, but can be difficult for domain scientists to map to mathematical stencil algorithms. StenSAL is a domain specific language which is tailored to the expression of explicit stencil algorithms through deterministic tasks chained together through single assignment data dependencies, and generates programs that map to the relentless execution model of computation. We provide a description of the StenSAL language and grammar, some of the sanity checks that can be performed on StenSAL programs before code generation, and how the compiler translates StenSAL into Python.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116542731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabian Dütsch, K. Djelassi, Michael Haidl, S. Gorlatch
{"title":"HLSF: A High-Level; C++-Based Framework for Stencil Computations on Accelerators","authors":"Fabian Dütsch, K. Djelassi, Michael Haidl, S. Gorlatch","doi":"10.1145/2686745.2686751","DOIUrl":"https://doi.org/10.1145/2686745.2686751","url":null,"abstract":"The development of programs for modern systems with GPUs and other accelerators is a complex and error-prone task. The popular GPU programming approaches like CUDA and OpenCL require a deep knowledge of the underlying architecture to achieve good performance. We present HLSF -- a high-level framework that greatly simplifies the development of stencil-based applications on systems with accelerators. The main novel features of HLSF are as follows: 1) it provides a high-level interface for stencils that hides from the programmer the low-level management of the parallelism and memory on accelerators; 2) it allows the developer to write programs in the pure C++ style, using all convenient features of the most recent C++14 standard. Our experimental evaluation shows that the framework significantly reduces the programming effort for stencil-based applications, while delivering performance competitive to CUDA and OpenCL.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Extensible Framework for Composing Stencils with Common Scientific Computing Patterns","authors":"L. Truong, Chick Markley, A. Fox","doi":"10.1145/2686745.2686750","DOIUrl":"https://doi.org/10.1145/2686745.2686750","url":null,"abstract":"The SEJITS framework supports creating embedded domain-specific languages (DSELs) and code generators, a pair of which is called a specializer, with much less effort than creating a full DSL compiler---typically just a few hundred lines of code. SEJITS' main benefit is allowing application writers to stay entirely in high-level languages such as Python by using specialized Python functions (that is, functions written in one of the Python-embedded DSELs) to generate code that runs at native speed. One existing SEJITS DSEL is Sepya [10], a Python DSEL for stencil computations that generates OpenMP and Cilk+ code competitive with existing DSL compilers such as Pochoir and Halide. We extend Sepya to generate OpenCL code for targetting GPUs, and in the process, extend SEJITS with support for meta-specializers, whose job is to enable and optimize the composition of existing specializers written by third parties. In this work, we demonstrate meta-specialization by detecting and removing extraneous data copies to and from the GPU to compose multiple specializer calls (stencil and non-stencil). We also explore the variants of loop fusion to further improve performance of composing these operations. The performance of the generated stencil code is 20x faster SciPy and competitive with existing stencil DSELs on realistic code excerpts. Since meta-specializers must compose and optimize specializers created by third parties, we extend SEJITS with support for meta-specializer hooks, allowing existing specializers to be incrementally enabled for meta-specialization without breaking backwards compatibility. The Sepya and SEJITS extensions together extend the range of platforms for which highly optimized code can be generated and open new possibilities for optimizing the composition of existing specializers.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127265872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Second Workshop on Optimizing Stencil Computations","authors":"Saman P. Amarasinghe, S. Kamil, P. Sadayappan","doi":"10.1145/2686745","DOIUrl":"https://doi.org/10.1145/2686745","url":null,"abstract":"It is our great pleasure to welcome you to the second Workshop on Optimizing Stencil Computations (WOSC). We are happy to report that the overall quality of this year's submissions was high, and that the resulting papers span a spectrum of issues and ideas for stencil optimization. \u0000 \u0000In addition to the paper submissions, this year we have four invited talks from the perspective of applications that use stencil computations, spanning image processing, scientific computing, and physical simulation. These talks, along with the panel discussion, help bridge the gap between those optimizing stencils and those using them. \u0000 \u0000Above and beyond the formal program, we hope that this year's workshop will further serve as a venue to exchange ideas, to start collaborations, and to bring the various communities working to optimize stencil computations together for interesting discussions and new ideas.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130064667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eunjung Park, Christos Kartsaklis, T. Janjusic, John Cavazos
{"title":"Trace-Driven Memory Access Pattern Recognition in Computational Kernels","authors":"Eunjung Park, Christos Kartsaklis, T. Janjusic, John Cavazos","doi":"10.1145/2686745.2686748","DOIUrl":"https://doi.org/10.1145/2686745.2686748","url":null,"abstract":"Classifying memory access patterns is paramount to the selection of the right set of optimizations and determination of the parallelization strategy. Static analyses suffer from ambiguities present in source code, which modern compilation techniques, such as profile-guided optimization, alleviate by observing runtime behavior and feeding back into the compilation flow. This paper discusses a dynamic analysis technique for recognizing memory access patterns, with application to the stencils domain, and presents our design and C++ implementation using the memory-tracing tool Gleipnir. Finally, we evaluate and discuss the performance and matching capability of our classifiers in the context of the Polybench scientific benchmark suite, which includes both stencil and matrix computations.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115188615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nanoblock Unroll: Towards the Automatic Generation of Stencil Codes with the Optimal Performance","authors":"T. Muranushi, Keigo Nitadori, J. Makino","doi":"10.1145/2686745.2686746","DOIUrl":"https://doi.org/10.1145/2686745.2686746","url":null,"abstract":"A number of automatic code generation systems have been proposed for stencil computations on modern parallel computers. However, codes they generate are rather inefficient. Typically they achieve < 10% of the peak performance of the platforms. The primary cause for this inefficiency is that the generated codes contain several layers of array indices for array accesses. This layers of indices prevent the compiler from generating efficient assembly codes. In this paper we propose a new approach for the automatic code generation in which the generated code is \"compiler-friendly\", in the sense that the compilers can generate highly optimized assembly codes than typical automatically generated codes. We demonstrate the effectiveness of our approach with a simple example of diffusion equation on a small grid. The measured efficiency can reach 85% of the theoretical peak.","PeriodicalId":367066,"journal":{"name":"Proceedings of the Second Workshop on Optimizing Stencil Computations","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131538487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}