2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)最新文献

筛选
英文 中文
MicroSpec: Speculation-centric fine-grained parallelization for FSM computations MicroSpec:用于FSM计算的以推测为中心的细粒度并行化
Junqiao Qiu, Zhijia Zhao, Bin Ren
{"title":"MicroSpec: Speculation-centric fine-grained parallelization for FSM computations","authors":"Junqiao Qiu, Zhijia Zhao, Bin Ren","doi":"10.1145/2967938.2967965","DOIUrl":"https://doi.org/10.1145/2967938.2967965","url":null,"abstract":"Finite state machines (FSMs) are basic computation models that play essential roles in many applications. Enabling efficient parallel FSM execution is critical to the performance of these applications. However, they are very challenging to parallelize due to their inherent data dependencies that occur at each step of computations. Existing efforts on FSM parallelization either explore coarse-grained speculative parallelism or leverage parallel prefixsum. The former ignores prevalent fine-grained hardware parallelism on modern processors (such as ILP or SIMD parallelism) while the latter limits the benefits of fine-grained parallelism mainly to state enumeration. This work presents MicroSpec, a set of parallelization techniques that, for the first time, expose fine-grained speculative parallelism to FSM computations. Based on a rigorous analysis of three types of parallelism at fine-grained level, MicroSpec consists of a list of four fine-grained speculative parallelization approaches along with a speculation-oriented data transformation. Experiments on a large set of realworld FSM benchmarks show that MicroSpec achieves substantial performance improvement over the state-of-the-art.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125790439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Speculatively exploiting cross-invocation parallelism 推测性地利用交叉调用并行性
Jialu Huang, Prakash Prabhu, T. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, David I. August
{"title":"Speculatively exploiting cross-invocation parallelism","authors":"Jialu Huang, Prakash Prabhu, T. Jablin, Soumyadeep Ghosh, Sotiris Apostolakis, Jae W. Lee, David I. August","doi":"10.1145/2967938.2967959","DOIUrl":"https://doi.org/10.1145/2967938.2967959","url":null,"abstract":"Automatic parallelization has shown promise in producing scalable multi-threaded programs for multi-core architectures. Most existing automatic techniques parallelize independent loops and insert global synchronization between loop invocations. For programs with many loop invocations, frequent synchronization often becomes the performance bottleneck. Some techniques exploit cross-invocation parallelism to overcome this problem. Using static analysis, they partition iterations among threads to avoid cross-thread dependences. However, this approach may fail if dependence pattern information is not available at compile time. To address this limitation, this work proposes SPECCROSS-the first automatic parallelization technique to exploit cross-invocation parallelism using speculation. With speculation, iterations from different loop invocations can execute concurrently, and the program synchronizes only on misspeculation. This allows SPECCROSS to adapt to dependence patterns that only manifest on particular inputs at runtime. Evaluation on eight programs shows that SPECCROSS achieves a geomean speedup of 3.43× over parallel execution without cross-invocation parallelization.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131199419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management NUMA的可伸缩任务并行性:用于协调调度和内存管理的统一抽象
Andi Drebes, Antoniu Pop, K. Heydemann, Albert Cohen, Nathalie Drach-Temam
{"title":"Scalable task parallelism for NUMA: A uniform abstraction for coordinated scheduling and memory management","authors":"Andi Drebes, Antoniu Pop, K. Heydemann, Albert Cohen, Nathalie Drach-Temam","doi":"10.1145/2967938.2967946","DOIUrl":"https://doi.org/10.1145/2967938.2967946","url":null,"abstract":"Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5× higher performance than NUMA-aware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114880533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
POSTER - hVISC: A portable abstraction for heterogeneous parallel systems hVISC:异构并行系统的可移植抽象
Prakalp Srivastava, Maria Kotsifakou, Matthew D. Sinclair, Rakesh Komuravelli, Vikram S. Adve, S. Adve
{"title":"POSTER - hVISC: A portable abstraction for heterogeneous parallel systems","authors":"Prakalp Srivastava, Maria Kotsifakou, Matthew D. Sinclair, Rakesh Komuravelli, Vikram S. Adve, S. Adve","doi":"10.1145/2967938.2976039","DOIUrl":"https://doi.org/10.1145/2967938.2976039","url":null,"abstract":"Programming heterogeneous parallel systems can be extremely complex because a single system may include multiple different parallelism models, instruction sets, and memory hierarchies, and different systems use different combinations of these features. We propose a carefully designed parallel abstraction of heterogeneous hardware - a hierarchical dataflow graph with shared memory and vector instructions - that is able to capture the parallelism in a wide range of popular parallel hardware. We use this abstraction, which we call hVISC, to define a Virtual Instruction Set Architecture (ISA) that aims to address both functional portability and performance portability across heterogeneous systems. hVISC is more general than existing virtual instruction sets such as PTX, HSAIL and SPIR, e.g., it can capture both streaming parallelism and general dataflow parallelism.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116752543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
POSTER: Pagoda: A runtime system to maximize GPU utilization in data parallel tasks with limited parallelism POSTER: Pagoda:一个运行时系统,在有限并行性的数据并行任务中最大化GPU利用率
T. Yeh, Amit Sabne, Putt Sakdhnagool, R. Eigenmann, Timothy G. Rogers
{"title":"POSTER: Pagoda: A runtime system to maximize GPU utilization in data parallel tasks with limited parallelism","authors":"T. Yeh, Amit Sabne, Putt Sakdhnagool, R. Eigenmann, Timothy G. Rogers","doi":"10.1145/2967938.2974055","DOIUrl":"https://doi.org/10.1145/2967938.2974055","url":null,"abstract":"Massively multithreaded GPUs achieve high throughput by running thousands of threads in parallel. To fully utilize the hardware, contemporary workloads spawn work to the GPU in bulk by launching large tasks, where each task is a kernel that contains thousands of threads that occupy the entire GPU. GPUs face severe underutilization and their performance benefits vanish if the tasks are narrow, i.e., they contain less than 512 threads. Latency-sensitive applications in network, signal, and image processing that generate a large number of tasks with relatively small inputs are examples of such limited parallelism. Recognizing the issue, CUDA now allows 32 simultaneous tasks on GPUs; however, that still leaves significant room for underutilization. This paper presents Pagoda, a runtime system that virtualizes GPU resources, using an OS-like daemon kernel called MasterKernel. Tasks are spawned from the CPU onto Pagoda as they become available, and are scheduled by the MasterKernel at the warp granularity. This level of control enables the GPU to keep scheduling and executing tasks as long as free warps are found, dramatically reducing underutilization. Experimental results on real hardware demonstrate that Pagoda achieves a geometric mean speedup of 2.44x over PThreads running on a 20-core CPU, 1.43x over CUDA-HyperQ, and 1.33x over GeMTC, the state-of-the-art runtime GPU task scheduling system.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125032235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Student research poster: A low complexity cache sharing mechanism to address system fairness 学生研究海报:解决系统公平性的低复杂度缓存共享机制
Vicent Selfa, J. Sahuquillo, S. Petit, M. E. Gómez
{"title":"Student research poster: A low complexity cache sharing mechanism to address system fairness","authors":"Vicent Selfa, J. Sahuquillo, S. Petit, M. E. Gómez","doi":"10.1145/2967938.2971464","DOIUrl":"https://doi.org/10.1145/2967938.2971464","url":null,"abstract":"Shared caches have become, de facto, the common design choice in current multi-cores, ranging from embedded devices to high-performance processors. In these systems, requests from multiple applications compete for the cache resources, degrading to different extents their progress, quantified as the performance of individual applications compared to isolated execution. The difference between the progresses of the running applications yields the system to unpredictable behavior and causes a fairness problem. This problem can be addressed by carefully partitioning cache resources among the contending applications, but to be effective, a partitioning approach needs to estimate the per-application progress.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123143418","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimizing indirect memory references with milk 用牛奶优化间接记忆引用
Vladimir Kiriansky, Yunming Zhang, Saman P. Amarasinghe
{"title":"Optimizing indirect memory references with milk","authors":"Vladimir Kiriansky, Yunming Zhang, Saman P. Amarasinghe","doi":"10.1145/2967938.2967948","DOIUrl":"https://doi.org/10.1145/2967938.2967948","url":null,"abstract":"Modern applications such as graph and data analytics, when operating on real world data, have working sets much larger than cache capacity and are bottlenecked by DRAM. To make matters worse, DRAM bandwidth is increasing much slower than per CPU core count, while DRAM latency has been virtually stagnant. Parallel applications that are bound by memory bandwidth fail to scale, while applications bound by memory latency draw a small fraction of much-needed bandwidth. While expert programmers may be able to tune important applications by hand through heroic effort, traditional compiler cache optimizations have not been sufficiently aggressive to overcome the growing DRAM gap. In this paper, we introduce milk - a C/C++ language extension that allows programmers to annotate memory-bound loops concisely. Using optimized intermediate data structures, random indirect memory references are transformed into batches of efficient sequential DRAM accesses. A simple semantic model enhances programmer productivity for efficient parallelization with OpenMP. We evaluate the Milk compiler on parallel implementations of traditional graph applications, demonstrating performance gains of up to 3×.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"234 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116464719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Student research poster: Slack-aware shared bandwidth management in GPUs 学生研究海报:gpu中感知松弛的共享带宽管理
Saumay Dublish
{"title":"Student research poster: Slack-aware shared bandwidth management in GPUs","authors":"Saumay Dublish","doi":"10.1145/2967938.2971470","DOIUrl":"https://doi.org/10.1145/2967938.2971470","url":null,"abstract":"Due to lack of sufficient compute threads in memory-intensive applications, GPUs often exhaust all the active warps and therefore, the memory latencies get exposed and appear in the critical path. In such a scenario, the shared on-chip and off-chip memory bandwidth appear more performance critical to cores with few or no active warps, in contrast to cores with sufficient active warps. In this work, we use the slack of memory responses as a metric to identify the criticality of shared bandwidth to different cores. Consequently, we propose a slack-aware DRAM scheduling policy to prioritize requests from cores with negative slack, ahead of row-buffer hits. We also propose a request throttling mechanism to reduce the shared bandwidth demand of cores that have enough active warps to sustain execution. The above techniques help in reducing the memory latencies that appear in the critical path by increasing the memory latencies that can be hidden by multithreading.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126987823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs 自动利用隐含的管道并行从多个依赖内核的gpu
Gwangsun Kim, Jiyun Jeong, John Kim, M. Stephenson
{"title":"Automatically exploiting implicit Pipeline Parallelism from multiple dependent kernels for GPUs","authors":"Gwangsun Kim, Jiyun Jeong, John Kim, M. Stephenson","doi":"10.1145/2967938.2967952","DOIUrl":"https://doi.org/10.1145/2967938.2967952","url":null,"abstract":"Execution of GPGPU workloads consists of different stages including data I/O on the CPU, memory copy between the CPU and GPU, and kernel execution. While GPU can remain idle during I/O and memory copy, prior work has shown that overlapping data movement (I/O and memory copies) with kernel execution can improve performance. However, when there are multiple dependent kernels, the execution of the kernels is serialized and the benefit of overlapping data movement can be limited. In order to improve the performance of workloads that have multiple dependent kernels, we propose to automatically overlap the execution of kernels by exploiting implicit pipeline parallelism. We first propose Coarse-grained Reference Counting-based Scoreboarding (CRCS) to guarantee correctness during overlapped execution of multiple kernels. However, CRCS alone does not necessarily improve overall performance if the thread blocks (or CTAs) are scheduled sequentially. Thus, we propose an alternative CTA scheduler - Pipeline Parallelism-aware CTA Scheduler (PPCS) that takes available pipeline parallelism into account in CTA scheduling to maximize pipeline parallelism and improve overall performance. Our evaluation results show that the proposed mechanisms can improve performance by up to 67% (33% on average). To the best of our knowledge, this is one of the first work that enables overlapped execution of multiple dependent kernels without any kernel modification or explicitly expressing dependency by the programmer.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131552209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Student research poster - from processing-in-Memory to Processing-in-Storage 学生研究海报-从内存处理到存储处理
R. Kaplan
{"title":"Student research poster - from processing-in-Memory to Processing-in-Storage","authors":"R. Kaplan","doi":"10.1145/2967938.2971463","DOIUrl":"https://doi.org/10.1145/2967938.2971463","url":null,"abstract":"This paper studied a new technology and approach - Processing-in-Storage (PrinS). Resistive technologies have shown high-density, high-endurance and low energy. Such features might provide the conditions for a petascale storage device, where each cell has both storage and processing capabilities. There are numerous research directions in studying the approach, to list a few: building peripheral circuitry to support multiple types of computations, maintaining low latency for data access, providing high-throughput processing, ease of programmability, etc.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128993734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信