{"title":"DStore: A Fast, Tailless, and Quiescent-Free Object Store for PMEM","authors":"Shashank Gugnani, Xiaoyi Lu","doi":"10.1145/3431379.3460649","DOIUrl":"https://doi.org/10.1145/3431379.3460649","url":null,"abstract":"The advent of fast, byte-addressable persistent memory (PMEM) has fueled a renaissance in re-evaluating storage system design. Unfortunately, prior work has been unable to provide both consistent and fast performance because they rely on traditional cached or uncached approaches to system design, compromising at least one of the requirements. This paper presents DStore, a fast, tailless, and quiescent-free object store for non-volatile memory. To fulfill all three requirements, we propose a novel two-level approach, called DIPPER, which fully decouples the volatile frontend and persistent backend by leveraging the byte addressability and performance of PMEM. The novelty of our approach is in allowing the frontend and backend to operate independently and in parallel without affecting crash consistency. This not only avoids the need to quiesce the system but also allows for increased concurrency in the frontend through the use of observational equivalency. Using this approach, DStore achieves optimal scalability and low latency without compromising on crash consistency. Evaluation on Intel's Optane DC Persistent Memory Module (DCPMM) demonstrates that DStore can simultaneously provide fast performance, uninterrupted service, and low tail latency. Moreover, DStore can deliver up to 6x lower tail latency service level objectives (SLO) and up to 5x higher throughput SLO compared to state-of-the-art PMEM optimized systems.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125376643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Jigsaw: A High-Utilization, Interference-Free Job Scheduler for Fat-Tree Clusters","authors":"Staci A. Smith, D. Lowenthal","doi":"10.1145/3431379.3460635","DOIUrl":"https://doi.org/10.1145/3431379.3460635","url":null,"abstract":"Jobs on HPC clusters can suffer significant performance degradation due to inter-job network interference. Approaches to mitigating this interference primarily focus on reactive routing schemes. A better approach---in that it completely eliminates inter-job interference---is to implement scheduling policies that proactively enforce network isolation for every job. However, existing schedulers that allocate isolated partitions lead to lowered system utilization, which creates a barrier to adoption. Accordingly, we design and implement Jigsaw, a new job-isolating scheduling approach for three-level fat-trees that overcomes this barrier. Jigsaw typically achieves system utilization of 95-96%, while guaranteeing dedicated network links to jobs. In scenarios where jobs experience even modest performance improvements from interference-freedom, Jigsaw typically leads to lower job turnaround times and higher throughput than traditional job scheduling. To the best of our knowledge, Jigsaw is the first scheduler to eliminate inter-job network interference while maintaining high system utilization, leading to improved job and system performance.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114463188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dakota Fulp, Alexandra Poulos, Robert Underwood, Jon C. Calhoun
{"title":"ARC: An Automated Approach to Resiliency for Lossy Compressed Data via Error Correcting Codes","authors":"Dakota Fulp, Alexandra Poulos, Robert Underwood, Jon C. Calhoun","doi":"10.1145/3431379.3460638","DOIUrl":"https://doi.org/10.1145/3431379.3460638","url":null,"abstract":"Progress in high-performance computing (HPC) systems has led to complex applications that stress the I/O subsystem by creating vast amounts of data. Lossy compression reduces data size considerably, but a single error renders lossy compressed data unusable. This sensitivity stems from the high information content per bit in compressed data and is a critical issue as soft errors that cause bit-flips have become increasingly commonplace in HPC systems. While many works have improved lossy compressor performance, few have sought to address this critical weakness. This paper presents ARC: Automated Resiliency for Compression. Given user-defined constraints on storage, throughput, and resiliency, ARC automatically determines the optimal error-correcting code (ECC) configuration before encoding data. We conduct an extensive fault injection study to fully understand the effects of soft errors on lossy compressed data and how to best protect it. We evaluate ARC's scalability, performance, resiliency, and ease of use. We find on a 40 core node that encoding and decoding demonstrate throughput up to 3730 MB/s and 3602 MB/s. ARC also detects and corrects multi-bit errors with a tunable overhead in terms of storage and throughput. Finally, we display the ease of using ARC and how to consider a systems failure rate when determining the constraints.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115278869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Computing Challenges for High Energy Physics","authors":"M. Girone","doi":"10.1145/3431379.3466719","DOIUrl":"https://doi.org/10.1145/3431379.3466719","url":null,"abstract":"High-energy physics faces unprecedented computing challenges in preparation for the 'high-luminosity' phase of the Large Hadron Collider, which will be known as the HL-LHC. The complexity of particle-collision events will increase, together with the data collection rate, substantially outstripping the gains expected from technology evolution. The LHC experiments, through the Worldwide LHC Computing Grid (WLCG), operate a distributed computing infrastructure at about 170 sites over more than 40 countries. This infrastructure has successfully exploited the exabyte of data collected and processed during the first 10 years of the program. During the HL-LHC regime, each experiment will collect an exabyte of data annually and additional computing resources will be needed. The efficient use of HPC facilities may be an important opportunity to address the anticipated resource gap. In this talk, I will discuss the future computing needs in high-energy physics and how these can be met combining our dedicated distributed computing infrastructure with large-scale HPC sites. As a community, we have identified common challenges for integrating these large facilities into our computing ecosystem. I will also discuss the current progress in addressing those challenges, focusing on software development for heterogeneous architectures, data management at scale, supporting services and opportunities for collaboration.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129450330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Superscalar Programming Models: A Perspective from Barcelona","authors":"Rosa M. Badia","doi":"10.1145/3431379.3466720","DOIUrl":"https://doi.org/10.1145/3431379.3466720","url":null,"abstract":"The importance of the programming model in the development of applications has been increasingly more important with the evolution of computing architectures and infrastructures. Aspects such as the number of cores and heterogeneity in the computing nodes, the increase in scale, and new highly distributed environments (the so-called computing continuum) make it even more critical. Superscalar programming models have been proposed as an alternative for the development of parallel and distributed applications. They are a family of task-based programming models that aim at offering a sequential programming interface while enabling a parallel execution in distributed programming environments. Generic aspects supported by the model are: task dependency analysis, parallelism exploitation, data renaming, and data management. Over the years, BSC has developed multiple instances of this family, each of them with some specific aspects depending on the needs and possibilities of the existing computing infrastructure. The talk will present a historical perspective of the superscalar programming models for distributed computing and the challenges that we foresee for the near future.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123459346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Q-adaptive: A Multi-Agent Reinforcement Learning Based Routing on Dragonfly Network","authors":"Yao Kang, Xin Wang, Z. Lan","doi":"10.1145/3431379.3460650","DOIUrl":"https://doi.org/10.1145/3431379.3460650","url":null,"abstract":"High-radix interconnects such as Dragonfly and its variants rely on adaptive routing to balance network traffic for optimum performance. Ideally, adaptive routing attempts to forward packets between minimal and non-minimal paths with the least congestion. In practice, current adaptive routing algorithms estimate routing path congestion based on local information such as output queue occupancy. Using local information to estimate global path congestion is inevitably inaccurate because a router has no precise knowledge of link states a few hops away. This inaccuracy could lead to interconnect congestion. In this study, we present Q-adaptive routing, a multi-agent reinforcement learning routing scheme for Dragonfly systems. Q-adaptive routing enables routers to learn to route autonomously by leveraging advanced reinforcement learning technology. The proposed Q-adaptive routing is highly scalable thanks to its fully distributed nature without using any shared information between routers. Furthermore, a new two-level Q-table is designed for Q-adaptive to make it computational lightly and saves 50% of router memory usage compared with the previous Q-routing. We implement the proposed Q-adaptive routing in SST/Merlin simulator. Our evaluation results show that Q-adaptive routing achieves up to 10.5% system throughput improvement and 5.2x average packet latency reduction compared with adaptive routing algorithms. Remarkably, Q-adaptive can even outperform the optimal VALn non-minimal routing under the ADV+1 adversarial traffic pattern with up to 3% system throughput improvement and 75% average packet latency reduction.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123741249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Achieving Scalable Consensus by Being Less Writey","authors":"M. Davis, H. Vandierendonck","doi":"10.1145/3431379.3464452","DOIUrl":"https://doi.org/10.1145/3431379.3464452","url":null,"abstract":"Modern consensus algorithms are required to service a large number of client requests quickly. Responding to these requirements, several algorithms have sought to reduce bottlenecks to consensus performance, such as network usage and reliance on a single leader process. While the use of leaderless algorithms resolves process and network imbalance, one resource has seen increased use - stable storage. Leaderless consensus algorithms require at best 3nover4 of n acceptors to write to stable storage, limiting the benefit of these algorithms in larger systems. Meanwhile, the use of a single leader incurs only ƒ + 1 writes per proposal, where ƒ is the desired number of tolerated liveness failures. Here, a leaderless consensus algorithm that requires only ƒ + 1 writes per proposal is described. It is shown also to improve throughput of commands executed as system size increases without a corresponding degradation to latency.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124741194","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tyler J. Skluzacek, Ryan Wong, Zhuozhao Li, Ryan Chard, K. Chard, Ian T Foster
{"title":"A Serverless Framework for Distributed Bulk Metadata Extraction","authors":"Tyler J. Skluzacek, Ryan Wong, Zhuozhao Li, Ryan Chard, K. Chard, Ian T Foster","doi":"10.1145/3431379.3460636","DOIUrl":"https://doi.org/10.1145/3431379.3460636","url":null,"abstract":"We introduce Xtract, an automated and scalable system for bulk metadata extraction from large, distributed research data repositories. Xtract orchestrates the application of metadata extractors to groups of files, determining which extractors to apply to each file and, for each extractor and file, where to execute. A hybrid computing model, built on the funcX federated FaaS platform, enables Xtract to balance tradeoffs between extraction time and data transfer costs by dispatching each extraction task to the most appropriate location. Experiments on a range of clouds and supercomputers show that Xtract can efficiently process multi-million-file repositories by orchestrating the concurrent execution of container-based extractors on thousands of nodes. We highlight the flexibility of Xtract by applying it to a large, semi-curated scientific data repository and to an uncurated scientific Google Drive repository. We show that by remotely orchestrating metadata extraction across decentralized storage and compute nodes, Xtract can process large repositories in 50% of the time it takes just to transfer the same data to a machine within the same computing facility. We also show that when transferring data is necessary (e.g., no local compute is available), Xtract can scale to process files as fast as they are received, even over a multi-GB/s network.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134451744","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware Specialization for Distributed Computing","authors":"G. Alonso","doi":"10.1145/3431379.3460654","DOIUrl":"https://doi.org/10.1145/3431379.3460654","url":null,"abstract":"Several trends in the IT industry are driving an increasing specialization of the hardware layers. On the one hand, demanding workloads, large data volumes, diversity in data types, etc. are all factors contributing to make general purpose computing too inefficient. On the other hand, cloud computing and its economies of scale allow vendors to invest on specialized hardware for particular tasks that otherwise would be too expensive or consume resources needed elsewhere. In this talk I will discuss the shift towards hardware acceleration and show with several examples why specialized systems are here to stay and are likely to dominate the computer landscape for years to come. I will also discuss Enzian, an open research platform developed at ETH to enable the exploration of hardware acceleration and present some preliminary results achieved with it. Biography: Gustavo Alonso is a professor in the Department of Computer Science of ETH Zurich where he is a member of the Systems Group. His research interests include data management, databases, distributed systems, cloud computing, and hardware acceleration. Gustavo is an ACM Fellow and an IEEE Fellow as well as a Distinguished Alumnus of the Department of Computer Science of UC Santa Barbara.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132049334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"File System Semantics Requirements of HPC Applications","authors":"Chen Wang, K. Mohror, M. Snir","doi":"10.1145/3431379.3460637","DOIUrl":"https://doi.org/10.1145/3431379.3460637","url":null,"abstract":"Most widely-deployed parallel file systems (PFSs) implement POSIX semantics, which implies sequential consistency for reads and writes. Strict adherence to POSIX semantics is known to impede performance and thus several new PFSs with relaxed consistency semantics and better performance have been introduced. Such PFSs are useful provided that applications can run correctly on a PFS with weaker semantics. While it is widely assumed that HPC applications do not require strict POSIX semantics, to our knowledge there has not been systematic work to support this assumption. In this paper, we address this gap with a categorization of the consistency semantics guarantees of PFSs and develop an algorithm to determine the consistency semantics requirements of a variety of HPC applications. We captured the I/O activity of 17 representative HPC applications and benchmarks as they performed I/O through POSIX or I/O libraries and examined the metadata operations used and their file access patterns. From this analysis, we find that 16 of the 17 applications can utilize PFSs with weaker semantics.","PeriodicalId":343991,"journal":{"name":"Proceedings of the 30th International Symposium on High-Performance Parallel and Distributed Computing","volume":"91-93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128486507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}