Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming最新文献_第3页

SIMD code generation for stencils on brick decompositions 用于砖分解的模板SIMD代码生成

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178537

Tuowen Zhao, Mary W. Hall, P. Basu, Samuel Williams, H. Johansen

引用次数: 4

Designing scalable FPGA architectures using high-level synthesis 使用高级合成设计可扩展的FPGA架构

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178527

J. D. F. Licht, Michaela Blott, T. Hoefler

引用次数: 22

Transparent GPU memory management for DNNs 透明GPU内存管理的dnn

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178531

Jungho Park, Hyungmin Cho, Wookeun Jung, Jaejin Lee

引用次数: 2

Cache-tries: concurrent lock-free hash tries with constant-time operations Cache-tries:使用常量时间操作的并发无锁哈希尝试

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178498

Aleksandar Prokopec

引用次数: 19

Stamp-it, amortized constant-time memory reclamation in comparison to five other schemes 与其他五种方案相比，盖章，平摊的恒定时间内存回收

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178532

Manuel Pöter, J. Träff

引用次数: 5

FlashR: parallelize and scale R for machine learning using SSDs FlashR:使用ssd对机器学习进行并行化和缩放

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178501

Da Zheng, Disa Mhembere, J. Vogelstein, C. Priebe, R. Burns

{"title":"FlashR: parallelize and scale R for machine learning using SSDs","authors":"Da Zheng, Disa Mhembere, J. Vogelstein, C. Priebe, R. Burns","doi":"10.1145/3178487.3178501","DOIUrl":"https://doi.org/10.1145/3178487.3178501","url":null,"abstract":"R is one of the most popular programming languages for statistics and machine learning, but it is slow and unable to scale to large datasets. The general approach for having an efficient algorithm in R is to implement it in C or FORTRAN and provide an R wrapper. FlashR accelerates and scales existing R code by parallelizing a large number of matrix functions in the R base package and scaling them beyond memory capacity with solid-state drives (SSDs). FlashR performs memory hierarchy aware execution to speed up parallelized R code by (i) evaluating matrix operations lazily, (ii) performing all operations in a DAG in a single execution and with only one pass over data to increase the ratio of computation to I/O, (iii) performing two levels of matrix partitioning and reordering computation on matrix partitions to reduce data movement in the memory hierarchy. We evaluate FlashR on various machine learning and statistics algorithms on inputs of up to four billion data points. Despite the huge performance gap between SSDs and RAM, FlashR on SSDs closely tracks the performance of FlashR in memory for many algorithms. The R implementations in FlashR outperforms H2O and Spark MLlib by a factor of 3 -- 20.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132907451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Performance challenges in modular parallel programs 模块化并行程序的性能挑战

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178516

Umut A. Acar, V. Aksenov, A. Charguéraud, Mike Rainey

{"title":"Performance challenges in modular parallel programs","authors":"Umut A. Acar, V. Aksenov, A. Charguéraud, Mike Rainey","doi":"10.1145/3178487.3178516","DOIUrl":"https://doi.org/10.1145/3178487.3178516","url":null,"abstract":"Over the past decade, many programming languages and systems for parallel-computing have been developed, including Cilk, Fork/Join Java, Habanero Java, Parallel Haskell, Parallel ML, and X10. Although these systems raise the level of abstraction at which parallel code are written, performance continues to require the programmer to perform extensive optimizations and tuning, often by taking various architectural details into account. One such key optimization is granularity control, which requires the programmer to determine when and how parallel tasks should be sequentialized. In this paper, we briefly describe some of the challenges associated with automatic granularity control when trying to achieve portable performance for parallel programs with arbitrary nesting of parallel constructs. We consider a result from the functional-programming community, whose starting point is to consider an \"oracle\" that can predict the work of parallel codes, and thereby control granularity. We discuss the challenges in implementing such an oracle and proving that it has the desired theoretical properties under the nested-parallel programming model.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133279984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Juggler: a dependence-aware task-based execution framework for GPUs 杂耍器:gpu的依赖感知任务执行框架

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178492

M. Belviranli, Seyong Lee, J. Vetter, L. Bhuyan

{"title":"Juggler: a dependence-aware task-based execution framework for GPUs","authors":"M. Belviranli, Seyong Lee, J. Vetter, L. Bhuyan","doi":"10.1145/3178487.3178492","DOIUrl":"https://doi.org/10.1145/3178487.3178492","url":null,"abstract":"Scientific applications with single instruction, multiple data (SIMD) computations show considerable performance improvements when run on today's graphics processing units (GPUs). However, the existence of data dependences across thread blocks may significantly impact the speedup by requiring global synchronization across multiprocessors (SMs) inside the GPU. To efficiently run applications with interblock data dependences, we need fine-granular task-based execution models that will treat SMs inside a GPU as stand-alone parallel processing units. Such a scheme will enable faster execution by utilizing all internal computation elements inside the GPU and eliminating unnecessary waits during device-wide global barriers. In this paper, we propose Juggler, a task-based execution scheme for GPU workloads with data dependences. The Juggler framework takes applications embedding OpenMP 4.5 tasks as input and executes them on the GPU via an efficient in-device runtime, hence eliminating the need for kernel-wide global synchronization. Juggler requires no or little modification to the source code, and once launched, the runtime entirely runs on the GPU without relying on the host through the entire execution. We have evaluated Juggler on an NVIDIA Tesla P100 GPU and obtained up to 31% performance improvement against global barrier based implementation, with minimal runtime overhead.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121204093","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Interval-based memory reclamation 基于间隔的内存回收

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178488

Haosen Wen, Joseph Izraelevitz, Wentao Cai, H. A. Beadle, M. Scott

{"title":"Interval-based memory reclamation","authors":"Haosen Wen, Joseph Izraelevitz, Wentao Cai, H. A. Beadle, M. Scott","doi":"10.1145/3178487.3178488","DOIUrl":"https://doi.org/10.1145/3178487.3178488","url":null,"abstract":"In this paper we present interval-based reclamation (IBR), a new approach to safe reclamation of disconnected memory blocks in nonblocking concurrent data structures. Safe reclamation is a difficult problem: a thread, before freeing a block, must ensure that no other threads are accessing that block; the required synchronization tends to be expensive. In contrast with epoch-based reclamation, in which threads reserve all blocks created after a certain time, or pointer-based reclamation (e.g., hazard pointers), in which threads reserve individual blocks, IBR allows a thread to reserve all blocks known to have existed in a bounded interval of time. By comparing a thread's reserved interval with the lifetime of a detached but not yet reclaimed block, the system can determine if the block is safe to free. Like hazard pointers, IBR avoids the possibility that a single stalled thread may reserve an unbounded number of blocks; unlike hazard pointers, it avoids a memory fence on most pointer-following operations. It also avoids the need to explicitly \"unreserve\" a no-longer-needed pointer. We describe three specific IBR schemes (one with several variants) that trade off performance, applicability, and space requirements. IBR requires no special hardware or OS support. In experiments with data structure microbenchmarks, it also compares favorably (in both time and space) to other state-of-the-art approaches, making it an attractive alternative for libraries of concurrent data structures.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124986435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 51

An effective fusion and tile size model for optimizing image processing pipelines 一种用于图像处理管道优化的有效融合和瓦片大小模型

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming Pub Date : 2018-02-10 DOI: 10.1145/3178487.3178507

Abhinav Jangda, Uday Bondhugula

{"title":"An effective fusion and tile size model for optimizing image processing pipelines","authors":"Abhinav Jangda, Uday Bondhugula","doi":"10.1145/3178487.3178507","DOIUrl":"https://doi.org/10.1145/3178487.3178507","url":null,"abstract":"Effective models for fusion of loop nests continue to remain a challenge in both general-purpose and domain-specific language (DSL) compilers. The difficulty often arises from the combinatorial explosion of grouping choices and their interaction with parallelism and locality. This paper presents a new fusion algorithm for high-performance domain-specific compilers for image processing pipelines. The fusion algorithm is driven by dynamic programming and explores spaces of fusion possibilities not covered by previous approaches, and is driven by a cost function more concrete and precise in capturing optimization criteria than prior approaches. The fusion model is particularly tailored to the transformation and optimization sequence applied by PolyMage and Halide, two recent DSLs for image processing pipelines. Our model-driven technique when implemented in PolyMage provides significant improvements (up to 4.32X) over PolyMage's approach (which uses auto-tuning to aid its model), and over Halide's automatic approach (by up to 2.46X) on two state-of-the-art shared-memory multicore architectures.","PeriodicalId":193776,"journal":{"name":"Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132858632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30