{"title":"AEQUITAS: Coordinated Energy Management Across Parallel Applications","authors":"Haris Ribic, Yu David Liu","doi":"10.1145/2925426.2926260","DOIUrl":"https://doi.org/10.1145/2925426.2926260","url":null,"abstract":"A growing number of energy optimization solutions operate at the application runtime level. Despite delivering promising results, these application-scoped optimizations are fundamentally greedy: They assume to have an exclusive access to power management and often perform poorly when multiple power-managing applications co-exist, or different threads of the same application share power management hardware. In this paper, we introduce AEQUITAS, a first step to address this critical yet largely overlooked problem. The insight behind AEQUITAS is that co-existing applications may view power-managing hardware as a shared resource and coordinate power management decisions. As a concrete instance of this philosophy, we evaluated our ideas on top of a state-of-the-art energy-efficient work-stealing runtime. Experiments show that without AEQUITAS, multiple co-existing power-managing application runtimes suffer up to 32% performance loss and negate all power savings. With AEQUITAS, the beneficial energy-performance tradeoff reported in the single-application setting (12.9% energy savings and 2.5% performance loss) can be retained, but in a much more challenging setting where multiple power-managing runtimes co-exist on parallel architectures and multiple CPU cores share the same power domain.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126835914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reena Panda, Yasuko Eckert, N. Jayasena, Onur Kayiran, Michael Boyer, L. John
{"title":"Prefetching Techniques for Near-memory Throughput Processors","authors":"Reena Panda, Yasuko Eckert, N. Jayasena, Onur Kayiran, Michael Boyer, L. John","doi":"10.1145/2925426.2926282","DOIUrl":"https://doi.org/10.1145/2925426.2926282","url":null,"abstract":"Near-memory processing or processing-in-memory (PIM) is regaining a lot of interest recently as a viable solution to overcome the challenges imposed by memory wall. This trend has been mainly fueled by the emergence of 3D-stacked memories. GPUs are touted as great candidates for in-memory processors due to their superior bandwidth utilization capabilities. Although putting a GPU core beneath memory exposes it to unprecedented memory bandwidth, in this paper, we demonstrate that significant opportunities still exist to improve the performance of the simpler, in-memory GPU processors (GPU-PIM) by improving their memory performance. Thus, we propose three light-weight, practical memory-side prefetchers to improve the performance of GPU-PIM systems. The proposed prefetchers exploit the patterns in individual memory accesses and synergy in the wavefront-localized memory streams, combined with a better understanding of the memory-system state, to prefetch from DRAM row buffers into on-chip prefetch buffers, thereby achieving over 75% prefetcher accuracy and 40% improvement in row buffer locality. In order to maximize utilization of prefetched data and minimize thrashing, the prefetchers also use a novel prefetch buffer management policy based on a unique dead-row prediction mechanism together with an eviction-based prefetch-trigger policy to control their aggressiveness. The proposed prefetchers improve performance by over 60% (max) and 9% on average as compared to the baseline, while achieving over 33% of the performance benefits of perfect-L2 using less than 5.6KB of additional hardware. The proposed prefetchers also outperform the state-of-the-art memory-side prefetcher, OWL by more than 20%.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115114294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HOPE: Enabling Efficient Service Orchestration in Software-Defined Data Centers","authors":"Yang Hu, Chao Li, Longjun Liu, Tao Li","doi":"10.1145/2925426.2926257","DOIUrl":"https://doi.org/10.1145/2925426.2926257","url":null,"abstract":"The functional scope of today's software-defined data centers (SDDC) has expanded to such an extent that servers face a growing amount of critical background operational tasks like load monitoring, logging, migration, and duplication, etc. These ancillary operations, which we refer to as management operations, often nibble the stringent data center power envelope and exert a tremendous amount of pressure on front-end user tasks. However, existing power capping, peak shaving, and time shifting mechanisms mainly focus on managing data center power demand at the \"macro level\" -- they do not distinguish ancillary background services from user tasks, and therefore often incur significant performance degradation and energy overhead. In this study we explore \"micro-level\" power management in SDDC: tuning a specific set of critical loads for the sake of overall system efficiency and performance. Specifically, we look at management operations that can often lead to resource contention and energy overhead in an IaaS SDDC. We assess the feasibility of this new power management paradigm by characterizing the resource and power impact of various management operations. We propose HOPE, a new system optimization framework for eliminating the potential efficiency bottleneck caused by the management operations in the SDDC. HOPE is implemented on a customized OpenStack cloud environment with heavily instrumented power infrastructure. We thoroughly validate HOPE models and optimization efficacy under various user workload scenarios. Our deployment experiences show that the proposed technique allows SDDC to reduce energy consumption by 19%, reduce management operation execution time by 25.4%, and in the meantime improve workload performance by 30%.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125382621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Write-Aware Management of NVM-based Memory Extensions","authors":"Amro Awad, S. Blagodurov, Yan Solihin","doi":"10.1145/2925426.2926284","DOIUrl":"https://doi.org/10.1145/2925426.2926284","url":null,"abstract":"Emerging Non-Volatile Memory (NVM) technologies, such as 3D XPoint, are expected to be in production as early as 2016. Emerging NVMs are very attractive for several reasons. First, they are non-volatile and hence incur no refresh power. Second, they are dense and promising for scaling down further. Finally, they are fast and have latencies comparable to DRAM. On the other side, using emerging NVMs as direct replacement for DRAM as the main memory is challenging. Compared to DRAM, emerging NVMs can endure a very limited number of writes per cell. Furthermore, their write latency is typically much slower and more energy consuming than DRAM, e.g., Phase Change Memory (PCM) writes are multiple of times slower than that of DRAM. An important use case for emerging NVMs is using them as fast memory extensions. Memory extensions are hidden from programmers and managed by the Operating System (OS). Any access to pages held in the memory extension will cause a page fault. Later, the memory manager moves the faulting page to DRAM and maps the page. While similar in concept to the swap file, memory extensions bypass the file system. Furthermore, memory extensions are dedicated for being used as memory and hence avoid contention with the file system. In this paper, we emulate an NVM-based memory extension and study its impact on performance on a real system. We also study how to improve its performance using OS-level prefetching. We show the importance of having the system software and the NVM controller work in concert for reducing the number of writes. Our best scheme where the system software and the NVM controller work in concert could reduce the number of writes to only 5% of the original baseline (increasing its lifetime by 20x).","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131991526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nusrat S. Islam, Md. Wasi-ur-Rahman, Xiaoyi Lu, D. Panda
{"title":"High Performance Design for HDFS with Byte-Addressability of NVM and RDMA","authors":"Nusrat S. Islam, Md. Wasi-ur-Rahman, Xiaoyi Lu, D. Panda","doi":"10.1145/2925426.2926290","DOIUrl":"https://doi.org/10.1145/2925426.2926290","url":null,"abstract":"Non-Volatile Memory (NVM) offers byte-addressability with DRAM like performance along with persistence. Thus, NVMs provide the opportunity to build high-throughput storage systems for data-intensive applications. HDFS (Hadoop Distributed File System) is the primary storage engine for MapReduce, Spark, and HBase. Even though HDFS was initially designed for commodity hardware, it is increasingly being used on HPC (High Performance Computing) clusters. The outstanding performance requirements of HPC systems make the I/O bottlenecks of HDFS a critical issue to rethink its storage architecture over NVMs. In this paper, we present a novel design for HDFS to leverage the byte-addressability of NVM for RDMA (Remote Direct Memory Access)-based communication. We analyze the performance potential of using NVM for HDFS and re-design HDFS I/O with memory semantics to exploit the byte-addressability fully. We call this design NVFS (NVM- and RDMA-aware HDFS). We also present cost-effective acceleration techniques for HBase and Spark to utilize the NVM-based design of HDFS by storing only the HBase Write Ahead Logs and Spark job outputs to NVM, respectively. We also propose enhancements to use the NVFS design as a burst buffer for running Spark jobs on top of parallel file systems like Lustre. Performance evaluations show that our design can improve the write and read throughputs of HDFS by up to 4x and 2x, respectively. The execution times of data generation benchmarks are reduced by up to 45%. The proposed design also reduces the overall execution time of the SWIM workload by up to 18% over HDFS with a maximum benefit of 37% for job-38. For Spark TeraSort, our proposed scheme yields a performance gain of up to 11%. The performances of HBase insert, update, and read operations are improved by 21%, 16%, and 26%, respectively. Our NVM-based burst buffer can improve the I/O performance of Spark PageRank by up to 24% over Lustre. To the best of our knowledge, this paper is the first attempt to incorporate NVM with RDMA for HDFS.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127307989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balanced Hashing and Efficient GPU Sparse General Matrix-Matrix Multiplication","authors":"Pham Nguyen Quang Anh, Rui Fan, Yonggang Wen","doi":"10.1145/2925426.2926273","DOIUrl":"https://doi.org/10.1145/2925426.2926273","url":null,"abstract":"General sparse matrix-matrix multiplication (SpGEMM) is a core component of many algorithms. A number of recent works have used high throughput graphics processing units (GPUs) to accelerate SpGEMM. However, exploiting the power of GPUs for SpGEMM requires addressing a number of challenges, including highly imbalanced workloads and large numbers of inefficient random global memory accesses. This paper presents a SpGEMM algorithm which uses several novel techniques to overcome these problems. We first propose two low cost methods to achieve perfect load balancing during the most expensive step in SpGEMM. Next, we show how to eliminate nearly all random global memory accesses using shared memory based hash tables. To optimize the performance of the hash tables, we propose a lightweight method to estimate the number of nonzeros in the output matrix. We compared our algorithm to the CUSP, CUSPARSE and the state-of-the-art BHSPARSE GPU SpGEMM algorithms, and show that it performs 5.6x, 2.4x and 1.5x better on average, and up to 11.8x, 9.5x and 2.5x better in the best case, respectively. Furthermore, we show that our algorithm performs especially well on highly imbalanced and unstructured matrices.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130700390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Coherence-Free Multiview: Enabling Reference-Discerning Data Placement on GPU","authors":"Guoyang Chen, Xipeng Shen","doi":"10.1145/2925426.2926277","DOIUrl":"https://doi.org/10.1145/2925426.2926277","url":null,"abstract":"A Graphic Processing Unit (GPU) system is typically equipped with many types of memory (e.g., global, constant, texture, shared, cache). Data placement determines what data are placed on which type of memory, essential for GPU memory performance. Prior optimizations of data placement always require a single view of a data object on memory, which limits the optimization effectiveness. In this work, we propose coherence-free multiview, an approach that allows multiple views of a single data object to co-exist on GPU memory during a GPU kernel execution. We demonstrate that under certain conditions, the multiple views can remain incoherent while facilitating enhanced data placement. We present a theorem and some compiler support to ensure the soundness of the usage of coherence-free multiview. We further develop reference-discerning data placement, a new way to enhance data placements on GPU. It enables more flexible data placements by using coherence-free multiview to leverage the slack in coherence requirement of some GPU programs. Experiments on three types of GPU systems show that, with less than 200KB space cost, the new data placement technique can provide a 1.6X average (up to 4.27X) speedup.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133406942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuxi Liu, Zhibin Yu, L. Eeckhout, V. Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Chengzhong Xu
{"title":"Barrier-Aware Warp Scheduling for Throughput Processors","authors":"Yuxi Liu, Zhibin Yu, L. Eeckhout, V. Reddi, Yingwei Luo, Xiaolin Wang, Zhenlin Wang, Chengzhong Xu","doi":"10.1145/2925426.2926267","DOIUrl":"https://doi.org/10.1145/2925426.2926267","url":null,"abstract":"Parallel GPGPU applications rely on barrier synchronization to align thread block activity. Few prior work has studied and characterized barrier synchronization within a thread block and its impact on performance. In this paper, we find that barriers cause substantial stall cycles in barrier-intensive GPGPU applications although GPGPUs employ lightweight hardware-support barriers. To help investigate the reasons, we define the execution between two adjacent barriers of a thread block as a warp-phase. We find that the execution progress within a warp-phase varies dramatically across warps, which we call warp-phase-divergence. While warp-phase-divergence may result from execution time disparity among warps due to differences in application code or input, and/or shared resource contention, we also pinpoint that warp-phase-divergence may result from warp scheduling. To mitigate barrier induced stall cycle inefficiency, we propose barrier-aware warp scheduling (BAWS). It combines two techniques to improve the performance of barrier-intensive GPGPU applications. The first technique, most-waiting-first (MWF), assigns a higher scheduling priority to the warps of a thread block that has a larger number of warps waiting at a barrier. The second technique, critical-fetch-first (CFF), fetches instructions from the warp to be issued by MWF in the next cycle. To evaluate the efficiency of BAWS, we consider 13 barrier-intensive GPGPU applications, and we report that BAWS speeds up performance by 17% and 9% on average (and up to 35% and 30%) over loosely-round-robin (LRR) and greedy-then-oldest (GTO) warp scheduling, respectively. We compare BAWS against recent concurrent work SAWS, finding that BAWS outperforms SAWS by 7% on average and up to 27%. For non-barrier-intensive workloads, we demonstrate that BAWS is performance-neutral compared to GTO and SAWS, while improving performance by 5.7% on average (and up to 22%) compared to LRR. BAWS' hardware cost is limited to 6 bytes per streaming multiprocessor (SM).","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130037055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Replichard: Towards Tradeoff between Consistency and Performance for Metadata","authors":"Zhiying Li, Ruini Xue, Lixiang Ao","doi":"10.1145/2925426.2926292","DOIUrl":"https://doi.org/10.1145/2925426.2926292","url":null,"abstract":"Metadata scalability is critical for distributed systems as the storage scale is growing rapidly. Because of the strict consistency requirement of metadata, many existing metadata services utilize a fundamentally unscalable design for the sake of easy management, while others provide improved scalability but lead to unacceptable latency and management complexity. Without delivering scalable performance, metadata will be the bottleneck of the entire system. Based on the observation that real file dependencies are few, and there are usually more idempotent than non-idempotent operations, we propose a practical strategy, Replichard, allowing a tradeoff between metadata consistency and scalable performance. Replichard provides metadata services through a cluster of metadata servers, in which a flexible consistency scheme is adopted: strict consistency for non-idempotent operations with dynamic write-lock sharding, and relaxed consistency with accuracy estimations of return values where consistency for idempotent requests is relaxed to achieve high throughput. Write-locks are dynamically created at subtree-level and designated to independent metadata servers in an application-oriented manner. A subtree metadata update that occurs on a particular server is replicated to all metadata servers conforming to the application \"start-end\" semantics, resulting in an eventually consistent namespace. An asynchronous notification mechanism is also devised to enable users to deal with potential stale reads from operations of relaxed consistency. A prototype was implemented based on HDFS, and the experimental results show promising scalability and performance for both micro benchmarks and various real-world applications written in Pig, Hive and MapReduce.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126520678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Belviranli, Farzad Khorasani, L. Bhuyan, Rajiv Gupta
{"title":"CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs","authors":"M. Belviranli, Farzad Khorasani, L. Bhuyan, Rajiv Gupta","doi":"10.1145/2925426.2926271","DOIUrl":"https://doi.org/10.1145/2925426.2926271","url":null,"abstract":"Recent generations of GPUs and their corresponding APIs provide means for sharing compute resources among multiple applications with greater efficiency than ever. This advance has enabled the GPUs to act as shared computation resources in multi-user environments, like supercomputers and cloud computing. Recent research has focused on maximizing the utilization of GPU computing resources by simultaneously executing multiple GPU applications (i.e., concurrent kernels) via temporal or spatial partitioning. However, they have not considered maximizing the utilization of the PCI-e bus which is equally important as applications spend a considerable amount of time on data transfers. In this paper, we present a complete execution framework, CuMAS, to enable `data-transfer aware' sharing of GPUs across multiple CUDA applications. We develop a novel host-side CUDA task scheduler and a corresponding runtime, to capture multiple CUDA calls and re-order them for improved overall system utilization. Different from the preceding studies, CuMAS scheduler treats PCI-e up-link & down-link buses and the GPU itself as separate resources. It schedules corresponding phases of CUDA applications so that the total resource utilization is maximized. We demonstrate that the data-transfer aware nature of CuMAS framework improves the throughput of simultaneously executed CUDA applications by up to 44% when run on NVIDIA K40c GPU using applications from CUDA SDK and Rodinia benchmark suite.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127167444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}