Proceedings of the 2016 International Conference on Supercomputing最新文献

筛选
英文 中文
AEQUITAS: Coordinated Energy Management Across Parallel Applications AEQUITAS:跨并行应用的协调能源管理
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926260
Haris Ribic, Yu David Liu
{"title":"AEQUITAS: Coordinated Energy Management Across Parallel Applications","authors":"Haris Ribic, Yu David Liu","doi":"10.1145/2925426.2926260","DOIUrl":"https://doi.org/10.1145/2925426.2926260","url":null,"abstract":"A growing number of energy optimization solutions operate at the application runtime level. Despite delivering promising results, these application-scoped optimizations are fundamentally greedy: They assume to have an exclusive access to power management and often perform poorly when multiple power-managing applications co-exist, or different threads of the same application share power management hardware. In this paper, we introduce AEQUITAS, a first step to address this critical yet largely overlooked problem. The insight behind AEQUITAS is that co-existing applications may view power-managing hardware as a shared resource and coordinate power management decisions. As a concrete instance of this philosophy, we evaluated our ideas on top of a state-of-the-art energy-efficient work-stealing runtime. Experiments show that without AEQUITAS, multiple co-existing power-managing application runtimes suffer up to 32% performance loss and negate all power savings. With AEQUITAS, the beneficial energy-performance tradeoff reported in the single-application setting (12.9% energy savings and 2.5% performance loss) can be retained, but in a much more challenging setting where multiple power-managing runtimes co-exist on parallel architectures and multiple CPU cores share the same power domain.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126835914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Prefetching Techniques for Near-memory Throughput Processors 近内存吞吐量处理器的预取技术
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926282
Reena Panda, Yasuko Eckert, N. Jayasena, Onur Kayiran, Michael Boyer, L. John
{"title":"Prefetching Techniques for Near-memory Throughput Processors","authors":"Reena Panda, Yasuko Eckert, N. Jayasena, Onur Kayiran, Michael Boyer, L. John","doi":"10.1145/2925426.2926282","DOIUrl":"https://doi.org/10.1145/2925426.2926282","url":null,"abstract":"Near-memory processing or processing-in-memory (PIM) is regaining a lot of interest recently as a viable solution to overcome the challenges imposed by memory wall. This trend has been mainly fueled by the emergence of 3D-stacked memories. GPUs are touted as great candidates for in-memory processors due to their superior bandwidth utilization capabilities. Although putting a GPU core beneath memory exposes it to unprecedented memory bandwidth, in this paper, we demonstrate that significant opportunities still exist to improve the performance of the simpler, in-memory GPU processors (GPU-PIM) by improving their memory performance. Thus, we propose three light-weight, practical memory-side prefetchers to improve the performance of GPU-PIM systems. The proposed prefetchers exploit the patterns in individual memory accesses and synergy in the wavefront-localized memory streams, combined with a better understanding of the memory-system state, to prefetch from DRAM row buffers into on-chip prefetch buffers, thereby achieving over 75% prefetcher accuracy and 40% improvement in row buffer locality. In order to maximize utilization of prefetched data and minimize thrashing, the prefetchers also use a novel prefetch buffer management policy based on a unique dead-row prediction mechanism together with an eviction-based prefetch-trigger policy to control their aggressiveness. The proposed prefetchers improve performance by over 60% (max) and 9% on average as compared to the baseline, while achieving over 33% of the performance benefits of perfect-L2 using less than 5.6KB of additional hardware. The proposed prefetchers also outperform the state-of-the-art memory-side prefetcher, OWL by more than 20%.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115114294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
HOPE: Enabling Efficient Service Orchestration in Software-Defined Data Centers 希望:在软件定义的数据中心实现高效的服务编排
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926257
Yang Hu, Chao Li, Longjun Liu, Tao Li
{"title":"HOPE: Enabling Efficient Service Orchestration in Software-Defined Data Centers","authors":"Yang Hu, Chao Li, Longjun Liu, Tao Li","doi":"10.1145/2925426.2926257","DOIUrl":"https://doi.org/10.1145/2925426.2926257","url":null,"abstract":"The functional scope of today's software-defined data centers (SDDC) has expanded to such an extent that servers face a growing amount of critical background operational tasks like load monitoring, logging, migration, and duplication, etc. These ancillary operations, which we refer to as management operations, often nibble the stringent data center power envelope and exert a tremendous amount of pressure on front-end user tasks. However, existing power capping, peak shaving, and time shifting mechanisms mainly focus on managing data center power demand at the \"macro level\" -- they do not distinguish ancillary background services from user tasks, and therefore often incur significant performance degradation and energy overhead. In this study we explore \"micro-level\" power management in SDDC: tuning a specific set of critical loads for the sake of overall system efficiency and performance. Specifically, we look at management operations that can often lead to resource contention and energy overhead in an IaaS SDDC. We assess the feasibility of this new power management paradigm by characterizing the resource and power impact of various management operations. We propose HOPE, a new system optimization framework for eliminating the potential efficiency bottleneck caused by the management operations in the SDDC. HOPE is implemented on a customized OpenStack cloud environment with heavily instrumented power infrastructure. We thoroughly validate HOPE models and optimization efficacy under various user workload scenarios. Our deployment experiences show that the proposed technique allows SDDC to reduce energy consumption by 19%, reduce management operation execution time by 25.4%, and in the meantime improve workload performance by 30%.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125382621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Write-Aware Management of NVM-based Memory Extensions 基于nvm的内存扩展的写感知管理
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926284
Amro Awad, S. Blagodurov, Yan Solihin
{"title":"Write-Aware Management of NVM-based Memory Extensions","authors":"Amro Awad, S. Blagodurov, Yan Solihin","doi":"10.1145/2925426.2926284","DOIUrl":"https://doi.org/10.1145/2925426.2926284","url":null,"abstract":"Emerging Non-Volatile Memory (NVM) technologies, such as 3D XPoint, are expected to be in production as early as 2016. Emerging NVMs are very attractive for several reasons. First, they are non-volatile and hence incur no refresh power. Second, they are dense and promising for scaling down further. Finally, they are fast and have latencies comparable to DRAM. On the other side, using emerging NVMs as direct replacement for DRAM as the main memory is challenging. Compared to DRAM, emerging NVMs can endure a very limited number of writes per cell. Furthermore, their write latency is typically much slower and more energy consuming than DRAM, e.g., Phase Change Memory (PCM) writes are multiple of times slower than that of DRAM. An important use case for emerging NVMs is using them as fast memory extensions. Memory extensions are hidden from programmers and managed by the Operating System (OS). Any access to pages held in the memory extension will cause a page fault. Later, the memory manager moves the faulting page to DRAM and maps the page. While similar in concept to the swap file, memory extensions bypass the file system. Furthermore, memory extensions are dedicated for being used as memory and hence avoid contention with the file system. In this paper, we emulate an NVM-based memory extension and study its impact on performance on a real system. We also study how to improve its performance using OS-level prefetching. We show the importance of having the system software and the NVM controller work in concert for reducing the number of writes. Our best scheme where the system software and the NVM controller work in concert could reduce the number of writes to only 5% of the original baseline (increasing its lifetime by 20x).","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131991526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
High Performance Design for HDFS with Byte-Addressability of NVM and RDMA 具有NVM和RDMA字节可寻址的HDFS高性能设计
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926290
Nusrat S. Islam, Md. Wasi-ur-Rahman, Xiaoyi Lu, D. Panda
{"title":"High Performance Design for HDFS with Byte-Addressability of NVM and RDMA","authors":"Nusrat S. Islam, Md. Wasi-ur-Rahman, Xiaoyi Lu, D. Panda","doi":"10.1145/2925426.2926290","DOIUrl":"https://doi.org/10.1145/2925426.2926290","url":null,"abstract":"Non-Volatile Memory (NVM) offers byte-addressability with DRAM like performance along with persistence. Thus, NVMs provide the opportunity to build high-throughput storage systems for data-intensive applications. HDFS (Hadoop Distributed File System) is the primary storage engine for MapReduce, Spark, and HBase. Even though HDFS was initially designed for commodity hardware, it is increasingly being used on HPC (High Performance Computing) clusters. The outstanding performance requirements of HPC systems make the I/O bottlenecks of HDFS a critical issue to rethink its storage architecture over NVMs. In this paper, we present a novel design for HDFS to leverage the byte-addressability of NVM for RDMA (Remote Direct Memory Access)-based communication. We analyze the performance potential of using NVM for HDFS and re-design HDFS I/O with memory semantics to exploit the byte-addressability fully. We call this design NVFS (NVM- and RDMA-aware HDFS). We also present cost-effective acceleration techniques for HBase and Spark to utilize the NVM-based design of HDFS by storing only the HBase Write Ahead Logs and Spark job outputs to NVM, respectively. We also propose enhancements to use the NVFS design as a burst buffer for running Spark jobs on top of parallel file systems like Lustre. Performance evaluations show that our design can improve the write and read throughputs of HDFS by up to 4x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 45%. The proposed design also reduces the overall execution time of the SWIM workload by up to 18% over HDFS with a maximum benefit of 37% for job-38. For Spark TeraSort, our proposed scheme yields a performance gain of up to 11%. The performances of HBase insert, update, and read operations are improved by 21%, 16%, and 26%, respectively. Our NVM-based burst buffer can improve the I/O performance of Spark PageRank by up to 24% over Lustre. To the best of our knowledge, this paper is the first attempt to incorporate NVM with RDMA for HDFS.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127307989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 68
Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication 平衡哈希和高效GPU稀疏一般矩阵-矩阵乘法
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926273
Pham Nguyen Quang Anh, Rui Fan, Yonggang Wen
{"title":"Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication","authors":"Pham Nguyen Quang Anh, Rui Fan, Yonggang Wen","doi":"10.1145/2925426.2926273","DOIUrl":"https://doi.org/10.1145/2925426.2926273","url":null,"abstract":"General sparse matrix-matrix multiplication (SpGEMM) is a core component of many algorithms. A number of recent works have used high throughput graphics processing units (GPUs) to accelerate SpGEMM. However, exploiting the power of GPUs for SpGEMM requires addressing a number of challenges, including highly imbalanced workloads and large numbers of inefficient random global memory accesses. This paper presents a SpGEMM algorithm which uses several novel techniques to overcome these problems. We first propose two low cost methods to achieve perfect load balancing during the most expensive step in SpGEMM. Next, we show how to eliminate nearly all random global memory accesses using shared memory based hash tables. To optimize the performance of the hash tables, we propose a lightweight method to estimate the number of nonzeros in the output matrix. We compared our algorithm to the CUSP, CUSPARSE and the state-of-the-art BHSPARSE GPU SpGEMM algorithms, and show that it performs 5.6x, 2.4x and 1.5x better on average, and up to 11.8x, 9.5x and 2.5x better in the best case, respectively. Furthermore, we show that our algorithm performs especially well on highly imbalanced and unstructured matrices.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130700390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU 无相干多视图:在GPU上启用参考识别数据放置
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926277
Guoyang Chen, Xipeng Shen
{"title":"Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU","authors":"Guoyang Chen, Xipeng Shen","doi":"10.1145/2925426.2926277","DOIUrl":"https://doi.org/10.1145/2925426.2926277","url":null,"abstract":"A Graphic Processing Unit (GPU) system is typically equipped with many types of memory (e.g., global, constant, texture, shared, cache). Data placement determines what data are placed on which type of memory, essential for GPU memory performance. Prior optimizations of data placement always require a single view of a data object on memory, which limits the optimization effectiveness. In this work, we propose coherence-free multiview, an approach that allows multiple views of a single data object to co-exist on GPU memory during a GPU kernel execution. We demonstrate that under certain conditions, the multiple views can remain incoherent while facilitating enhanced data placement. We present a theorem and some compiler support to ensure the soundness of the usage of coherence-free multiview. We further develop reference-discerning data placement, a new way to enhance data placements on GPU. It enables more flexible data placements by using coherence-free multiview to leverage the slack in coherence requirement of some GPU programs. Experiments on three types of GPU systems show that, with less than 200KB space cost, the new data placement technique can provide a 1.6X average (up to 4.27X) speedup.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133406942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Barrier-Aware Warp Scheduling for Throughput Processors 吞吐量处理器的障碍感知Warp调度
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926267
Yuxi Liu, Zhibin Yu, L. Eeckhout, V. Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Chengzhong Xu
{"title":"Barrier-Aware Warp Scheduling for Throughput Processors","authors":"Yuxi Liu, Zhibin Yu, L. Eeckhout, V. Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Chengzhong Xu","doi":"10.1145/2925426.2926267","DOIUrl":"https://doi.org/10.1145/2925426.2926267","url":null,"abstract":"Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phase-divergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling. To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS' hardware cost is limited to 6 bytes per streaming multiprocessor (SM).","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130037055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs 共享gpu的数据传输感知多应用程序调度
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926271
M. Belviranli, Farzad Khorasani, L. Bhuyan, Rajiv Gupta
{"title":"CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs","authors":"M. Belviranli, Farzad Khorasani, L. Bhuyan, Rajiv Gupta","doi":"10.1145/2925426.2926271","DOIUrl":"https://doi.org/10.1145/2925426.2926271","url":null,"abstract":"Recent generations of GPUs and their corresponding APIs provide means for sharing compute resources among multiple applications with greater efficiency than ever. This advance has enabled the GPUs to act as shared computation resources in multi-user environments, like supercomputers and cloud computing. Recent research has focused on maximizing the utilization of GPU computing resources by simultaneously executing multiple GPU applications (i.e., concurrent kernels) via temporal or spatial partitioning. However, they have not considered maximizing the utilization of the PCI-e bus which is equally important as applications spend a considerable amount of time on data transfers. In this paper, we present a complete execution framework, CuMAS, to enable `data-transfer aware' sharing of GPUs across multiple CUDA applications. We develop a novel host-side CUDA task scheduler and a corresponding runtime, to capture multiple CUDA calls and re-order them for improved overall system utilization. Different from the preceding studies, CuMAS scheduler treats PCI-e up-link & down-link buses and the GPU itself as separate resources. It schedules corresponding phases of CUDA applications so that the total resource utilization is maximized. We demonstrate that the data-transfer aware nature of CuMAS framework improves the throughput of simultaneously executed CUDA applications by up to 44% when run on NVIDIA K40c GPU using applications from CUDA SDK and Rodinia benchmark suite.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127167444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Hybrid CPU-GPU scheduling and execution of tree traversals 混合CPU-GPU调度和树遍历的执行
Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926261
Jianqiao Liu, Nikhil Hegde, Milind Kulkarni
{"title":"Hybrid CPU-GPU scheduling and execution of tree traversals","authors":"Jianqiao Liu, Nikhil Hegde, Milind Kulkarni","doi":"10.1145/2925426.2926261","DOIUrl":"https://doi.org/10.1145/2925426.2926261","url":null,"abstract":"GPUs offer the promise of massive, power-efficient parallelism. However, exploiting this parallelism requires code to be carefully structured to deal with the limitations of the SIMT execution model. In recent years, there has been much interest in mapping irregular applications to GPUs: applications with unpredictable, data-dependent behaviors. While most of the work in this space has focused on ad hoc implementations of specific algorithms, recent work has looked at generic techniques for mapping a large class of tree traversal algorithms to GPUs, through careful restructuring of the tree traversal algorithms to make them behave more regularly. Unfortunately, even this general approach for GPU execution of tree traversal algorithms is reliant on ad hoc, hand-written, algorithm-specific scheduling (i.e., assignment of threads to warps) to achieve high performance. The key challenge of scheduling is that it is a highly irregular process, that requires the inspection of thread behavior and then careful sorting of those threads into warps. In this paper, we present a novel scheduling and execution technique for tree traversal algorithms that is both general and automatic. The key novelty is a hybrid, inspector-executor approach: the GPU partially executes tasks to inspect thread behavior and transmits information back to the CPU, which uses that information to perform the scheduling itself, before executing the remaining, carefully scheduled, portion of the traversals on the GPU. We applied this framework to six tree traversal algorithms, achieving significant speedups over optimized GPU code that does not perform application-specific scheduling. Further, we show that in many cases, our hybrid approach is able to deliver better performance even than GPU code that uses handtuned, application-specific scheduling.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130276569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信