2014 23rd International Conference on Parallel Architecture and Compilation (PACT)最新文献

筛选
英文 中文
SpongeDirectory: Flexible sparse directories utilizing multi-level memristors SpongeDirectory:灵活的稀疏目录,利用多级忆阻器
Lunkai Zhang, D. Strukov, Heba Saadeldeen, Dongrui Fan, Mingzhe Zhang, D. Franklin
{"title":"SpongeDirectory: Flexible sparse directories utilizing multi-level memristors","authors":"Lunkai Zhang, D. Strukov, Heba Saadeldeen, Dongrui Fan, Mingzhe Zhang, D. Franklin","doi":"10.1145/2628071.2628081","DOIUrl":"https://doi.org/10.1145/2628071.2628081","url":null,"abstract":"Cache-coherent shared memory is critical for programmability in many-core systems. Several directory-based schemes have been proposed, but dynamic, non-uniform sharing make efficient directory storage challenging, with each giving up storage space, performance or energy. We introduce SpongeDirectory, a sparse directory structure that exploits multi-level memristory technology. SpongeDirectory expands directory storage in-place when needed by increasing the number of bits stored on a single memristor device, trading latency and energy for storage. We explore several SpongeDirectory configurations, finding that a provisioning rate of 0.5× with memristors optimized for low energy consumption is the most competitive. This optimal SpongeDirectory configuration has performance comparable to a conventional sparse directory, requires 18× less storage space, and consumes 8× less energy.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"32 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115691967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Automatic execution of single-GPU computations across multiple GPUs 跨多个gpu自动执行单gpu计算
Javier Cabezas, L. Vilanova, Isaac Gelado, T. Jablin, N. Navarro, W. Hwu
{"title":"Automatic execution of single-GPU computations across multiple GPUs","authors":"Javier Cabezas, L. Vilanova, Isaac Gelado, T. Jablin, N. Navarro, W. Hwu","doi":"10.1145/2628071.2628109","DOIUrl":"https://doi.org/10.1145/2628071.2628109","url":null,"abstract":"We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to detect array access patterns in GPU kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of GPU caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between GPUs. Results show 1.95× and 3.73× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130630081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
VAST: The illusion of a large memory space for GPUs VAST:为gpu提供大内存空间的错觉
Janghaeng Lee, M. Samadi, S. Mahlke
{"title":"VAST: The illusion of a large memory space for GPUs","authors":"Janghaeng Lee, M. Samadi, S. Mahlke","doi":"10.1145/2628071.2628075","DOIUrl":"https://doi.org/10.1145/2628071.2628075","url":null,"abstract":"Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size of the physical memory on a specific GPU. In this paper, we present Virtual Address Space for Throughput processors (VAST), an automatic GPU memory management system that provides an OpenCL program with the illusion of a virtual memory space. Based on the available physical memory on the target GPU, VAST does the following: automatically partitions the data parallel workload into chunks; efficiently extracts the precise working set required for the divided workload; rearranges the working set in contiguous memory space; and, transforms the kernel to operate on the reorganized working set. With VAST, the programmer is responsible for developing a data parallel kernel in OpenCL without concern for physical memory space limitations of individual GPUs. VAST transparently handles code generation dealing with the constraints of the actual physical memory and improves the re-targetability of the OpenCL with moderate overhead. Experiments demonstrate that a real GPU, NVIDIA GTX 760 with 2 GB of memory, can compute any size of data without program changes achieving 2.6× speedup over CPU exeuction, which is a realistic alternative for large data computation.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":" 25","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120827855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 38
Heterogeneous microarchitectures trump voltage scaling for low-power cores 异构微架构胜过低功耗内核的电压缩放
Andrew Lukefahr, Shruti Padmanabha, R. Das, R. Dreslinski, T. Wenisch, S. Mahlke
{"title":"Heterogeneous microarchitectures trump voltage scaling for low-power cores","authors":"Andrew Lukefahr, Shruti Padmanabha, R. Das, R. Dreslinski, T. Wenisch, S. Mahlke","doi":"10.1145/2628071.2628078","DOIUrl":"https://doi.org/10.1145/2628071.2628078","url":null,"abstract":"Heterogeneous architectures offer many potential avenues for improving energy efficiency in today's low-power cores. Two common approaches are dynamic voltage/frequency scaling (DVFS) and heterogeneous microarchitectures (HMs). Traditionally both approaches have incurred large switching overheads, which limit their applicability to coarse-grain program phases. However, recent research has demonstrated low-overhead mechanisms that enable switching at granularities as low as 1K instructions. The question remains, in this fine-grained switching regime, which form of heterogeneity offers better energy efficiency for a given level of performance? The effectiveness of these techniques depend critically on both efficient architectural implementation and accurate scheduling to maximize energy efficiency for a given level of performance. Therefore, we develop PaTH, an offline analysis tool, to compute (near-)optimal schedules, allowing us to determine Pareto-optimal energy savings for a given architecture. We leverage PaTH to study the potential energy efficiency of fine-grained DVFS and HMs, as well as a hybrid approach. We show that HMs achieve higher energy savings than DVFS for a given level of performance. While at a coarse granularity the combination of DVFS and HMs still proves beneficial, for fine-grained scheduling their combination makes little sense as HMs alone provide the bulk of the energy efficiency.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"237 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114752280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Virtues and limitations of commodity hardware transactional memory 商品硬件事务性存储器的优点和局限性
Nuno Diegues, P. Romano, L. Rodrigues
{"title":"Virtues and limitations of commodity hardware transactional memory","authors":"Nuno Diegues, P. Romano, L. Rodrigues","doi":"10.1145/2628071.2628080","DOIUrl":"https://doi.org/10.1145/2628071.2628080","url":null,"abstract":"Over the last years Transactional Memory (TM) gained growing popularity as a simpler, attractive alternative to classic lock-based synchronization schemes. Recently, the TM landscape has been profoundly changed by the integration of Hardware TM (HTM) in Intel commodity processors, raising a number of questions on the future of TM. We seek answers to these questions by conducting the largest study on TM to date, comparing different locking techniques, hardware and software TMs, as well as different combinations of these mechanisms, from the dual perspective of performance and power consumption. Our study sheds a mix of light and shadows on currently available commodity HTM: on one hand, we identify workloads in which HTM clearly outperforms any alternative synchronization mechanism; on the other hand, we show that current HTM implementations suffer of restrictions that narrow the scope in which these can be more effective than state of the art software solutions. Thanks to the results of our study, we identify a number of compelling research problems in the areas of TM design, compilers and self-tuning.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127639098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
Warp-aware trace scheduling for GPUs gpu的扭曲感知跟踪调度
James A. Jablin, T. Jablin, O. Mutlu, M. Herlihy
{"title":"Warp-aware trace scheduling for GPUs","authors":"James A. Jablin, T. Jablin, O. Mutlu, M. Herlihy","doi":"10.1145/2628071.2628101","DOIUrl":"https://doi.org/10.1145/2628071.2628101","url":null,"abstract":"GPU performance depends not only on thread/warp level parallelism (TLP) but also on instruction-level parallelism (ILP). It is not enough to schedule instructions within basic blocks, it is also necessary to exploit opportunities for ILP optimization beyond branch boundaries. Unfortunately, modern GPUs cannot dynamically carry out such optimizations because they lack hardware branch prediction and cannot speculatively execute instructions beyond a branch. We propose to circumvent these limitations by adapting Trace Scheduling, a technique originally developed for microcode optimization. Trace Scheduling divides code into traces (or paths), and optimizes each trace in a context-independent way. Adapting Trace Scheduling to GPU code requires revisiting and revising each step of microcode Trace Scheduling to attend to branch and warp behavior, identifying instructions on the critical path, avoiding warp divergence, and reducing divergence time. Here, we propose \"Warp-Aware Trace Scheduling\" for GPUs. As evaluated on the Rodinia Benchmark Suite using dynamic profiling, our fully-automatic optimization achieves a geometric mean speedup of 1.10× on a real system by increasing instructions executed per cycle (IPC) by a harmonic mean of 1.12× and reducing instruction serialization and total instructions executed.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134405081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Bounded memory scheduling of dynamic task graphs 动态任务图的有限内存调度
Dragos Sbirlea, Zoran Budimlic, Vivek Sarkar
{"title":"Bounded memory scheduling of dynamic task graphs","authors":"Dragos Sbirlea, Zoran Budimlic, Vivek Sarkar","doi":"10.1145/2628071.2628090","DOIUrl":"https://doi.org/10.1145/2628071.2628090","url":null,"abstract":"It is now widely recognized that increased levels of parallelism is a necessary condition for improved application performance on multicore computers. However, as the number of cores increases, the memory-per-core ratio is expected to further decrease, making per-core memory efficiency of parallel programs an even more important concern in future systems. For many parallel applications, the memory requirements can be significantly larger than for their sequential counterparts and, more importantly, their memory utilization depends critically on the schedule used when running them. To address this problem we propose bounded memory scheduling (BMS) for parallel programs expressed as dynamic task graphs, in which an upper bound is imposed on the program's peak memory. Using the inspector/executor model, BMS tailors the set of allowable schedules to either guarantee that the program can be executed within the given memory bound, or throw an error during the inspector phase without running the computation if no feasible schedule can be found. Since solving BMS is NP-hard, we propose an approach in which we first use our heuristic algorithm, and if it fails we fall back on a more expensive optimal approach which is sped up by the best-effort result of the heuristic. Through evaluation on seven benchmarks, we show that BMS gracefully spans the spectrum between fully parallel and serial execution with decreasing memory bounds. Comparison with OpenMP shows that BMS-CnC can execute in 53% of the memory required by OpenMP while running at 90% (or more) of OpenMP's performance.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133461259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Locality-aware memory association for multi-target worksharing in OpenMP OpenMP中多目标工作共享的位置感知内存关联
T. Scogland, W. Feng
{"title":"Locality-aware memory association for multi-target worksharing in OpenMP","authors":"T. Scogland, W. Feng","doi":"10.1145/2628071.2671428","DOIUrl":"https://doi.org/10.1145/2628071.2671428","url":null,"abstract":"Heterogeneity is an ever-growing challenge in computing. The clearest example is the increasing popularity of GPUs, and purpose-designed coprocessors such as Intel Xeon Phi. Even disregarding coprocessors, heterogeneity continues to increase with the rise in CPU core counts, adaptive per-core frequencies, and increasingly hierarchical and complex memory systems. Take a system with four memory nodes, associated with four cores each, and four GPUs, each with a distinct address space and tens to hundreds of cores pro­grammed like a bulk-synchronous parallel cluster. In this case, we are effectively programming clusters of miniature constellations in every node.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132101146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ILP and TLP in shared memory applications: A limit study 共享内存应用中的ILP和TLP:一个极限研究
Ehsan Fatehi, Paul V. Gratz
{"title":"ILP and TLP in shared memory applications: A limit study","authors":"Ehsan Fatehi, Paul V. Gratz","doi":"10.1145/2628071.2628093","DOIUrl":"https://doi.org/10.1145/2628071.2628093","url":null,"abstract":"With the breakdown of Dennard scaling, future processor designs will be at the mercy of power limits as Chip MultiProcessor (CMP) designs scale out to many-cores. It is critical, therefore, that future CMPs be optimally designed in terms of performance efficiency with respect to power. A characterization analysis of future workloads is imperative to ensure maximum returns of performance per Watt consumed. Hence, a detailed analysis of emerging workloads is necessary to understand their characteristics with respect to hardware in terms of power and performance tradeoffs. In this paper, we conduct a limit study simultaneously analyzing the two dominant forms of parallelism exploited by modern computer architectures: Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP). This study gives insights into the upper bounds of performance that future architectures can achieve. Furthermore it identifies the bottlenecks of emerging workloads. To the best of our knowledge, our work is the first study that combines the two forms of parallelism into one study with modern applications. We evaluate the PARSEC multithreaded benchmark suite using a specialized trace-driven simulator. We make several contributions describing the high-level behavior of next-generation applications. For example, we show these applications contain up to a factor of 929× more ILP than what is currently being extracted from real machines. We then show the effects of breaking the application into increasing numbers of threads (exploiting TLP), instruction window size, realistic branch prediction, realistic memory latency, and thread dependencies on exploitable ILP. Our examination shows that theses benchmarks differed vastly from one another. As a result, we expect no single, homogeneous, micro-architecture will work optimally for all, arguing for reconfigurable, heterogeneous designs.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"5 3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132443350","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Automatic optimization of thread-coarsening for graphics processors 图形处理器线程粗化的自动优化
A. Magni, Christophe Dubach, M. O’Boyle
{"title":"Automatic optimization of thread-coarsening for graphics processors","authors":"A. Magni, Christophe Dubach, M. O’Boyle","doi":"10.1145/2628071.2628087","DOIUrl":"https://doi.org/10.1145/2628071.2628087","url":null,"abstract":"OpenCL has been designed to achieve functional portability across multi-core devices from different vendors. However, the lack of a single cross-target optimizing compiler severely limits performance portability of OpenCL programs. Programmers need to manually tune applications for each specific device, preventing effective portability. We target a compiler transformation specific for data-parallel languages: thread-coarsening and show it can improve performance across different GPU devices. We then address the problem of selecting the best value for the coarsening factor parameter, i.e., deciding how many threads to merge together. We experimentally show that this is a hard problem to solve: good configurations are difficult to find and naive coarsening in fact leads to substantial slowdowns. We propose a solution based on a machine-learning model that predicts the best coarsening factor using kernel-function static features. The model automatically specializes to the different architectures considered. We evaluate our approach on 17 benchmarks on four devices: two Nvidia GPUs and two different generations of AMD GPUs. Using our technique, we achieve speedups between 1.11× and 1.33× on average.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117048992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 71
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信