2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)最新文献_第2页

MicroSpec: Speculation-centric fine-grained parallelization for FSM computations MicroSpec:用于FSM计算的以推测为中心的细粒度并行化

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967965

Junqiao Qiu, Zhijia Zhao, Bin Ren

{"title":"MicroSpec: Speculation-centric fine-grained parallelization for FSM computations","authors":"Junqiao Qiu, Zhijia Zhao, Bin Ren","doi":"10.1145/2967938.2967965","DOIUrl":"https://doi.org/10.1145/2967938.2967965","url":null,"abstract":"Finite state machines (FSMs) are basic computation models that play essential roles in many applications. Enabling efficient parallel FSM execution is critical to the performance of these applications. However, they are very challenging to parallelize due to their inherent data dependencies that occur at each step of computations. Existing efforts on FSM parallelization either explore coarse-grained speculative parallelism or leverage parallel prefixsum. The former ignores prevalent fine-grained hardware parallelism on modern processors (such as ILP or SIMD parallelism) while the latter limits the benefits of fine-grained parallelism mainly to state enumeration. This work presents MicroSpec, a set of parallelization techniques that, for the first time, expose fine-grained speculative parallelism to FSM computations. Based on a rigorous analysis of three types of parallelism at fine-grained level, MicroSpec consists of a list of four fine-grained speculative parallelization approaches along with a speculation-oriented data transformation. Experiments on a large set of realworld FSM benchmarks show that MicroSpec achieves substantial performance improvement over the state-of-the-art.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125790439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Speculatively exploiting cross-invocation parallelism 推测性地利用交叉调用并行性

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967959

Jialu Huang, Prakash Prabhu, T. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, David I. August

{"title":"Speculatively exploiting cross-invocation parallelism","authors":"Jialu Huang, Prakash Prabhu, T. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, David I. August","doi":"10.1145/2967938.2967959","DOIUrl":"https://doi.org/10.1145/2967938.2967959","url":null,"abstract":"Automatic parallelization has shown promise in producing scalable multi-threaded programs for multi-core architectures. Most existing automatic techniques parallelize independent loops and insert global synchronization between loop invocations. For programs with many loop invocations, frequent synchronization often becomes the performance bottleneck. Some techniques exploit cross-invocation parallelism to overcome this problem. Using static analysis, they partition iterations among threads to avoid cross-thread dependences. However, this approach may fail if dependence pattern information is not available at compile time. To address this limitation, this work proposes SPECCROSS-the first automatic parallelization technique to exploit cross-invocation parallelism using speculation. With speculation, iterations from different loop invocations can execute concurrently, and the program synchronizes only on misspeculation. This allows SPECCROSS to adapt to dependence patterns that only manifest on particular inputs at runtime. Evaluation on eight programs shows that SPECCROSS achieves a geomean speedup of 3.43× over parallel execution without cross-invocation parallelization.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131199419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management NUMA的可伸缩任务并行性:用于协调调度和内存管理的统一抽象

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967946

Andi Drebes, Antoniu Pop, K. Heydemann, Albert Cohen, Nathalie Drach-Temam

{"title":"Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management","authors":"Andi Drebes, Antoniu Pop, K. Heydemann, Albert Cohen, Nathalie Drach-Temam","doi":"10.1145/2967938.2967946","DOIUrl":"https://doi.org/10.1145/2967938.2967946","url":null,"abstract":"Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5× higher performance than NUMA-aware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114880533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

POSTER - hVISC: A portable abstraction for heterogeneous parallel systems hVISC:异构并行系统的可移植抽象

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2976039

Prakalp Srivastava, Maria Kotsifakou, Matthew D. Sinclair, Rakesh Komuravelli, Vikram S. Adve, S. Adve

引用次数: 0

POSTER: Pagoda: A runtime system to maximize GPU utilization in data parallel tasks with limited parallelism POSTER: Pagoda:一个运行时系统，在有限并行性的数据并行任务中最大化GPU利用率

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2974055

T. Yeh, Amit Sabne, Putt Sakdhnagool, R. Eigenmann, Timothy G. Rogers

{"title":"POSTER: Pagoda: A runtime system to maximize GPU utilization in data parallel tasks with limited parallelism","authors":"T. Yeh, Amit Sabne, Putt Sakdhnagool, R. Eigenmann, Timothy G. Rogers","doi":"10.1145/2967938.2974055","DOIUrl":"https://doi.org/10.1145/2967938.2974055","url":null,"abstract":"Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125032235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Student research poster: A low complexity cache sharing mechanism to address system fairness 学生研究海报:解决系统公平性的低复杂度缓存共享机制

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2971464

Vicent Selfa, J. Sahuquillo, S. Petit, M. E. Gómez

引用次数: 0

Optimizing indirect memory references with milk 用牛奶优化间接记忆引用

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967948

Vladimir Kiriansky, Yunming Zhang, Saman P. Amarasinghe

{"title":"Optimizing indirect memory references with milk","authors":"Vladimir Kiriansky, Yunming Zhang, Saman P. Amarasinghe","doi":"10.1145/2967938.2967948","DOIUrl":"https://doi.org/10.1145/2967938.2967948","url":null,"abstract":"Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap. In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP. We evaluate the Milk compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3×.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116464719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Student research poster: Slack-aware shared bandwidth management in GPUs 学生研究海报:gpu中感知松弛的共享带宽管理

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2971470

Saumay Dublish

引用次数: 2

Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs 自动利用隐含的管道并行从多个依赖内核的gpu

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2967952

Gwangsun Kim, Jiyun Jeong, John Kim, M. Stephenson

{"title":"Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs","authors":"Gwangsun Kim, Jiyun Jeong, John Kim, M. Stephenson","doi":"10.1145/2967938.2967952","DOIUrl":"https://doi.org/10.1145/2967938.2967952","url":null,"abstract":"Execution of GPGPU workloads consists of different stages including data I/O on the CPU, memory copy between the CPU and GPU, and kernel execution. While GPU can remain idle during I/O and memory copy, prior work has shown that overlapping data movement (I/O and memory copies) with kernel execution can improve performance. However, when there are multiple dependent kernels, the execution of the kernels is serialized and the benefit of overlapping data movement can be limited. In order to improve the performance of workloads that have multiple dependent kernels, we propose to automatically overlap the execution of kernels by exploiting implicit pipeline parallelism. We first propose Coarse-grained Reference Counting-based Scoreboarding (CRCS) to guarantee correctness during overlapped execution of multiple kernels. However, CRCS alone does not necessarily improve overall performance if the thread blocks (or CTAs) are scheduled sequentially. Thus, we propose an alternative CTA scheduler - Pipeline Parallelism-aware CTA Scheduler (PPCS) that takes available pipeline parallelism into account in CTA scheduling to maximize pipeline parallelism and improve overall performance. Our evaluation results show that the proposed mechanisms can improve performance by up to 67% (33% on average). To the best of our knowledge, this is one of the first work that enables overlapped execution of multiple dependent kernels without any kernel modification or explicitly expressing dependency by the programmer.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131552209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

Student research poster - from processing-in-Memory to Processing-in-Storage 学生研究海报-从内存处理到存储处理

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI: 10.1145/2967938.2971463

R. Kaplan

引用次数: 0