Martin Winter, Daniel Mlakar, Mathias Parger, M. Steinberger
{"title":"Ouroboros: virtualized queues for dynamic memory management on GPUs","authors":"Martin Winter, Daniel Mlakar, Mathias Parger, M. Steinberger","doi":"10.1145/3392717.3392742","DOIUrl":"https://doi.org/10.1145/3392717.3392742","url":null,"abstract":"Dynamic memory allocation on a single instruction, multiple threads architecture, like the Graphics Processing Unit (GPU), is challenging and implementation guidelines caution against it. Data structures must rise to the challenge of thousands of concurrently active threads trying to allocate memory. Efficient queueing structures have been used in the past to allow for simple allocation and reuse of memory directly on the GPU but do not scale well to different allocation sizes, as each requires its own queue. In this work, we propose Ouroboros, a virtualized queueing structure, managing dynamically allocatable data chunks, whilst being built on top of these same chunks. Data chunks are interpreted on-the-fly either as building blocks for the virtualized queues or as paged user data. Re-usable user memory is managed in one of two ways, either as individual pages or as chunks containing pages. The queueing structures grow and shrink dynamically, only currently needed queue chunks are held in memory and freed up queue chunks can be reused within the system. Thus, we retain the performance benefits of an efficient, static queue design while keeping the memory requirements low. Performance evaluation on an NVIDIA TITAN V with the native device memory allocator in CUDA 10.1 shows speed-ups between 11X and 412X, with an average of 118X. For real-world testing, we integrate our allocator into faimGraph, a dynamic graph framework with proprietary memory management. Throughout all memory-intensive operations, such as graph initialization and edge updates, our allocator shows similar to improved performance. Additionally, we show improved algorithmic performance on PageRank and Static Triangle Counting. Overall, our memory allocator can be efficiently initialized, allows for high-throughput allocation and offers, with its per-thread allocation model, a drop-in replacement for comparable dynamic memory allocators.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122797295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Chunking loops with non-uniform workloads","authors":"Indu K. Prabhu, V. K. Nandivada","doi":"10.1145/3392717.3392763","DOIUrl":"https://doi.org/10.1145/3392717.3392763","url":null,"abstract":"Task-parallel languages such as X10 implement dynamic lightweight task-parallel execution model, where programmers are encouraged to express the ideal parallelism in the program. Prior work has used loop chunking to extract useful parallelism from ideal. Traditional loop chunking techniques assume that iterations in the loop are of similar workload, or the behavior of the first few iterations can be used to predict the load in later iterations. However, in loops with non-uniform work distribution, such assumptions do not hold. This problem becomes more complicated in the presence of atomic blocks (critical sections). In this paper, we propose a new optimization called deep-chunking that uses a mixed compile-time and runtime technique to chunk the iterations of the parallel-for-loops, based on the runtime workload of each iteration. We propose a parallel algorithm that is executed by individual threads to efficiently compute their respective chunks so that the overall execution time gets reduced. We prove that the algorithm is correct and is a 2-factor approximation. In addition to simple parallel-for-loops, the proposed deep-chunking can also handle loops with atomic blocks, which lead to exciting challenges. We have implemented deep-chunking in the X10 compiler and studied its performance on the benchmarks taken from IMSuite. We show that on an average, deep-chunking achieves 50.48%, 21.49%, 26.72%, 32.41%, and 28.84% better performance than un-chunked (same as work-stealing), cyclic-, block-, dynamic-, and guided-chunking versions of the code, respectively.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127307881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Isaac Sánchez Barrera, D. Black-Schaffer, Marc Casas, Miquel Moretó, Anastasiia Stupnikova, Mihail Popov
{"title":"Modeling and optimizing NUMA effects and prefetching with machine learning","authors":"Isaac Sánchez Barrera, D. Black-Schaffer, Marc Casas, Miquel Moretó, Anastasiia Stupnikova, Mihail Popov","doi":"10.1145/3392717.3392765","DOIUrl":"https://doi.org/10.1145/3392717.3392765","url":null,"abstract":"Both NUMA thread/data placement and hardware prefetcher configuration have significant impacts on HPC performance. Optimizing both together leads to a large and complex design space that has previously been impractical to explore at runtime. In this work we deliver the performance benefits of optimizing both NUMA thread/data placement and prefetcher configuration at runtime through careful modeling and online profiling. To address the large design space, we propose a prediction model that reduces the amount of input information needed and the complexity of the prediction required. We do so by selecting a subset of performance counters and application configurations that provide the richest profile information as inputs, and by limiting the output predictions to a subset of configurations that cover most of the performance. Our model is robust and can choose near-optimal NUMA+Pre-fetcher configurations for applications from only two profile runs. We further demonstrate how to profile online with low overhead, resulting in a technique that delivers an average of 1.68X performance improvement over a locality-optimized NUMA baseline with all prefetchers enabled.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125295959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhengchun Liu, Ryan Lewis, R. Kettimuthu, K. Harms, P. Carns, N. Rao, Ian T Foster, M. Papka
{"title":"Characterization and identification of HPC applications at leadership computing facility","authors":"Zhengchun Liu, Ryan Lewis, R. Kettimuthu, K. Harms, P. Carns, N. Rao, Ian T Foster, M. Papka","doi":"10.1145/3392717.3392774","DOIUrl":"https://doi.org/10.1145/3392717.3392774","url":null,"abstract":"High Performance Computing (HPC) is an important method for scientific discovery via large-scale simulation, data analysis, or artificial intelligence. Leadership-class supercomputers are expensive, but essential to run large HPC applications. The Petascale era of supercomputers began in 2008, with the first machines achieving performance in excess of one petaflops, and with the advent of new supercomputers in 2021 (e.g., Aurora, Frontier), the Exascale era will soon begin. However, the high theoretical computing capability (i.e., peak FLOPS) of a machine is not the only meaningful target when designing a supercomputer, as the resources demand of applications varies. A deep understanding of the characterization of applications that run on a leadership supercomputer is one of the most important ways for planning its design, development and operation. In order to improve our understanding of HPC applications, user demands and resource usage characteristics, we perform correlative analysis of various logs for different subsystems of a leadership supercomputer. This analysis reveals surprising, sometimes counter-intuitive patterns, which, in some cases, conflicts with existing assumptions, and have important implications for future system designs as well as supercomputer operations. For example, our analysis shows that while the applications spend significant time on MPI, most applications spend very little time on file I/O. Combined analysis of hardware event logs and task failure logs show that the probability of a hardware FATAL event causing task failure is low. Combined analysis of control system logs and file I/O logs reveals that pure POSIX I/O is used more widely than higher level parallel I/O. Based on holistic insights of the application gained through combined and co-analysis of multiple logs from different perspectives and general intuition, we engineer features to \"fingerprint\" HPC applications. We use t-SNE (a machine learning technique for dimensionality reduction) to validate the explainability of our features and finally train machine learning models to identify HPC applications or group those with similar characteristic. To the best of our knowledge, this is the first work that combines logs on file I/O, computing, and inter-node communication for insightful analysis of HPC applications in production.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129911505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiler aided checkpointing using crash-consistent data structures in NVMM systems","authors":"Tyler Coy, Shuibing He, Bin Ren, Xuechen Zhang","doi":"10.1145/3392717.3392755","DOIUrl":"https://doi.org/10.1145/3392717.3392755","url":null,"abstract":"Scientific applications use checkpointing for failure recovery. The existing checkpointing approaches were proposed for storing persistent states of applications as checkpoints in disk-based file systems via the block interface. As non-volatile main memory (NVMM) will be included in high-performance computing systems, storing the checkpoints in NVMM-based file systems can significantly waste the performance benefits of NVMM. This is because it under-utilizes memory resources and it does not take advantage of the byte-addressability of NVMM. In this paper, we propose an NVMM-aware checkpointing approach, named NV-Checkpoint. It uses a compiler-aided technique to automatically generate multi-version data structures, which consist of both the persistent version of data stored in NVMM for failure recovery and the ephemeral version of data placed across DRAM and NVMM. Because of the byte-addressability of NVMM, any versions can be accessed via the memory interface. The multiple versions may share data that are not mutated during the program's execution to reduce data redundancy. NV-Checkpoint provides the same level of guarantee of failure recovery compared to the conventional checkpointing approaches proposed for file systems. Furthermore, its runtime system manages the layout of the data structures to reduce the number of writes to NVMM. It also manages the checkpointing frequency to reduce persistence overhead using machine learning models. Our experimental results with real-world scientific applications show that the performance of annotated programs with NV-Checkpoint using a hybrid of DRAM and NVMM matches the performance of best-effort hand-written versions. It achieves similar scalability as those with ephemeral data structures using only DRAM. It offers up to 121X speedup of execution time compared to the conventional checkpointing approaches using the Atlas parallel file system on the Titan supercomputer.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127018515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bundlefly","authors":"Fei Lei, Dezun Dong, Xiangke Liao, J. Duato","doi":"10.1145/3392717.3392747","DOIUrl":"https://doi.org/10.1145/3392717.3392747","url":null,"abstract":"High-performance computing (HPC) systems keep increasing in size and bandwidth, thus requiring larger and higher-bandwidth interconnection networks. The race to exascale just exacerbated this trend. The resulting longer average distance and more links between modules makes the use of optical fiber mandatory. However, the system meets the challenge of cable packaging complexity, cable tolerance, and cable maintainability. Splitter cable, like multi-core fiber (MCF), is a new and cost-effective approach that has the potential to replace a bundle of fibers between any pairs of modules with a single cable, thus lowering the packaging complexity and enhancing the maintainability. To the best of our knowledge, we are the first to formally study the problem of building a cost-effective HPC network topology using multicore fiber. In this paper, a new diameter-3 topology is proposed, namely Bundlefly. It achieves a flexible tradeoff between intra-module radixes and inter-module radixes of routers with merely moderate radix to build a diameter-3 exascale interconnection network. It is suitable for the use of multi-core fiber for the requirement of inter-module bandwidth and cable packaging complexity. We analyze the properties of Bundlefly and present effective routing algorithms. We simulate and analyze the performance of Bundlefly against state-of-the-art topologies. The results show that Bundlefly with flexible configurations can achieve better performance than most existing topologies.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126670234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vladimir Dimic, Miquel Moretó, Marc Casas, Jan Ciesko, M. Valero
{"title":"RICH: implementing reductions in the cache hierarchy","authors":"Vladimir Dimic, Miquel Moretó, Marc Casas, Jan Ciesko, M. Valero","doi":"10.1145/3392717.3392736","DOIUrl":"https://doi.org/10.1145/3392717.3392736","url":null,"abstract":"Reductions constitute a frequent algorithmic pattern in high-performance and scientific computing. Sophisticated techniques are needed to ensure their correct and scalable concurrent execution on modern processors. Reductions on large arrays represent the most demanding case where traditional approaches are not always applicable due to low performance scalability. To address these challenges, we propose RICH, a runtime-assisted solution that relies on architectural and parallel programming model extensions. RICH updates the reduction variable directly in the cache hierarchy with the help of added in-cache functional units. Our programming model extensions fit with the most relevant parallel programming solutions for shared memory environments like OpenMP. RICH does not modify the ISA, which allows the use of algorithms with reductions from pre-compiled external libraries. Experiments show that our solution achieves the performance improvements of 11.2% on average, compared to the state-of-the-art hardware-based approaches, while it introduces 2.4% area and 3.8% power overhead.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"297 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116455345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Post-moore server architecture","authors":"B. Falsafi","doi":"10.1145/3392717.3400033","DOIUrl":"https://doi.org/10.1145/3392717.3400033","url":null,"abstract":"Cloud providers are building infrastructure at unprecedented speeds building the foundation for global IT services and cost-effective containerized apps. Unfortunately the silicon technologies we have relied on for the past several decades leading to the exponential growth in IT have slowed down in scaling and will soon come to a halt, resulting in diminishing returns in digital platform scalability in the post-Moore era of computing. Meanwhile the basic architecture of a modern server blade still dates back to the CPU-centric desktop PC of the 80âĂŹs managing memory at hardware speeds but accessing the network, storage and now discrete accelerators through the OS, legacy software stacks and peripheral interfaces. This talk will make the case for a clean slate co-design of server software and hardware for the post-Moore era.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121263516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marco Minutoli, M. Drocco, M. Halappanavar, Antonino Tumeo, A. Kalyanaraman
{"title":"cuRipples","authors":"Marco Minutoli, M. Drocco, M. Halappanavar, Antonino Tumeo, A. Kalyanaraman","doi":"10.1145/3392717.3392750","DOIUrl":"https://doi.org/10.1145/3392717.3392750","url":null,"abstract":"Influence maximization is an advanced graph-theoretic operation that aims to identify a set of k most influential nodes in a network. The problem is of immense interest in many network applications (e.g., information spread in a social network, or contagion spread in an infectious disease network). The problem is however computationally expensive, needing several hours of compute time on even modest sized networks. There are numerous challenges to parallelizing influence maximization including its mixed workloads of latency- and throughput-bound steps, frequent and irregular access to graph data, large memory footprint, and potential load imbalanced workloads. In this work, we present the design and development of a new hybrid CPU+GPU parallel influence maximization algorithm (CuRipples) that is also capable of running on multi-GPU systems. Our approach uses techniques for efficiently sharing and scheduling of work between CPU and GPU, and data access and synchronization schemes to efficiently map the different steps of sampling and seed selection on a heterogeneous system. Our experiments on state-of-the-art multi-GPU systems show that our implementation is able to achieve drastic reductions in the time to solution, from hours to under a minute, while also significantly enhancing the approximation quality.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"65 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117162864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Azin Heidarshenas, Tanmay Gangwani, Serif Yesil, Adam Morrison, J. Torrellas
{"title":"Snug: architectural support for relaxed concurrent priority queueing in chip multiprocessors","authors":"Azin Heidarshenas, Tanmay Gangwani, Serif Yesil, Adam Morrison, J. Torrellas","doi":"10.1145/3392717.3392740","DOIUrl":"https://doi.org/10.1145/3392717.3392740","url":null,"abstract":"Many parallel algorithms in domains such as graph analytics and simulations rely on priority-based task scheduling. In such environments, the data structure of choice is a concurrent priority queue (PQ). Unfortunately, PQ algorithms exhibit an undesirable tradeoff. On one hand, strict PQs always dequeue the highest-priority task, and thus fail to scale because of contention at the head of the queue. On the other hand, relaxed PQs avoid contention by dequeuing tasks that are sometimes so far from the head that the resulting schedule misses the benefit of priority-based scheduling. We propose a novel architecture for relaxing PQs without straying far from the priority-based schedule. Our chip-level architecture, called Snug, distributes the PQ into subqueues, and maintains a set of Work registers that point to the highest-priority task in each sub-queue. Snug provides an instruction that picks a high-quality task to execute. The instruction periodically switches between obtaining an accurate global snapshot, and visiting only local subqueues to reduce traffic. Overall, Snug dequeues high-quality tasks while avoiding both hotspots and excessive network traffic. We evaluate Snug on graph analytics and event simulation programs. On a simulated 64-core chip, Snug reduces the average execution time of the programs by 1.4X, 2.4X and 3.6X compared to state-of-the-art concurrent skip list, SprayList, and software-distributed PQs, respectively.","PeriodicalId":346687,"journal":{"name":"Proceedings of the 34th ACM International Conference on Supercomputing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114936559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}