{"title":"Jitter-Trace: a low-overhead OS noise tracing tool based on Linux Perf","authors":"N. Gonzalez, Alessandro Morari, Fabio Checconi","doi":"10.1145/3095770.3095772","DOIUrl":"https://doi.org/10.1145/3095770.3095772","url":null,"abstract":"Operating System (OS) noise is a well-known phenomenon in which OS activities interfere with the execution of large-scale parallel applications. Due to OS noise, feature-rich software environments such as Linux can seriously affect scalability. Kernel tracing can be used to identify OS noise sources, but until recently it required substantial OS modifications. This paper presents Jitter-Trace, a low-overhead tool that identifies and quantifies jitter sources. Jitter-Trace calculates the jitter generated by each OS activity, providing a complete set of task profiles and histograms of OS noise. This data is essential to implement OS noise mitigation strategies and reduce its impact on scalability. Jitter-Trace leverages the tracing and profiling capabilities of Linux Perf, which is widely available in current Linux distributions. Perf is tightly integrated in the Linux kernel and features a lightweight implementation.","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"2673 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127033604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Jones, Michael J. Brim, Geoffroy R. Vallée, B. Mayer, A. Welch, Tonglin Li, M. Lang, Latchesar Ionkov, Douglas Otstott, Ada Gavrilovska, G. Eisenhauer, Thaleia Dimitra Doudali, Pradeep R. Fernando
{"title":"UNITY: Unified Memory and File Space","authors":"T. Jones, Michael J. Brim, Geoffroy R. Vallée, B. Mayer, A. Welch, Tonglin Li, M. Lang, Latchesar Ionkov, Douglas Otstott, Ada Gavrilovska, G. Eisenhauer, Thaleia Dimitra Doudali, Pradeep R. Fernando","doi":"10.1145/3095770.3095776","DOIUrl":"https://doi.org/10.1145/3095770.3095776","url":null,"abstract":"This paper describes the vision for UNITY, a new high-performance computing focused data storage abstraction that places the entire memory hierarchy, including both traditionally separated memory-and file-based data storage, into one storage continuum. Through the use of a novel API and a set of services centered around a smart runtime system, UNITY is able to provide a number of valuable and interesting benefits. The unified storage space provides a scalable and resilient data environment that dynamically manages the mapping of data onto available resources based on multiple factors, including desired persistence and energy budget considerations. By eliminating the need for high-performance computing domain scientists to develop architecture-dependent optimizations for rapidly evolving data storage technologies, UNITY addresses both ease-of-use and performance.","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"90 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124162972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Evans, Stephen L. Olivier, R. Barrett, George Stelle
{"title":"Scheduling Chapel Tasks with Qthreads on Manycore: A Tale of Two Schedulers","authors":"N. Evans, Stephen L. Olivier, R. Barrett, George Stelle","doi":"10.1145/3095770.3095774","DOIUrl":"https://doi.org/10.1145/3095770.3095774","url":null,"abstract":"This paper describes improvements in task scheduling for the Chapel parallel programming language provided in its default on-node tasking runtime, the Qthreads library. We describe a new scheduler distrib which builds on the approaches of two previous Qthreads schedulers, Sherwood and Nemesis, and combines the best aspects of both --work stealing and load balancing from Sherwood and a lock free queue access from Nemesis-- to make task queuing better suited for the use of Chapel in the manycore era. We demonstrate the efficacy of this new scheduler by showing improvements in various individual benchmarks of the Chapel test suite on the Intel Knights Landing architecture.","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129785903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Balazs Gerofi, R. Riesen, R. Wisniewski, Y. Ishikawa
{"title":"Toward Full Specialization of the HPC Software Stack: Reconciling Application Containers and Lightweight Multi-kernels","authors":"Balazs Gerofi, R. Riesen, R. Wisniewski, Y. Ishikawa","doi":"10.1145/3095770.3095777","DOIUrl":"https://doi.org/10.1145/3095770.3095777","url":null,"abstract":"Application containers enable users to have greater control of their user-space execution environment by bundling application code with all the necessary libraries in a single software package. Lightweight multi-kernels leverage multi-core CPUs to run separate operating system (OS) kernels on different CPU cores, usually a lightweight kernel (LWK) and Linux. A multi-kernel's primary goal is attaining LWK scalability and performance in combination with support for the Linux APIs and environment. Both of these technologies are designed to address the increasing hardware complexity and the growing software diversity of High Performance Computing (HPC) systems. While containers enable specialization of user-space components, the LWK part of a multi-kernel system is also a form of software specialization, but targeting kernel space. This paper proposes a framework for combining application containers with multi-kernel operating systems thereby enabling specialization across the software stack. We provide an overview of the Linux container technologies and the challenges we faced to bring these two technologies together. Results from previous work show that multi-kernels can achieve better isolation than Linux. In this work, we deployed our framework on 1,024 Intel Xeon Phi Knights Landing nodes. We highlight two important results obtained from running at a larger scale. First, we show that containers impose zero runtime overhead even at scale. Second, by taking advantage of our integrated framework, we demonstrate that users can transparently benefit from lightweight multi-kernels, attaining identical speedups to the native multi-kernel execution.","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116721842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Seastar: A Comprehensive Framework for Telemetry Data in HPC Environments","authors":"Ole Weidner, A. Barker, M. Atkinson","doi":"10.1145/3095770.3095775","DOIUrl":"https://doi.org/10.1145/3095770.3095775","url":null,"abstract":"A large number of 2nd generation high-performance computing applications and services rely on adaptive and dynamic architectures and execution strategies to run efficiently, resiliently, and at scale on today's HPC infrastructures. They require information about applications and their environment to steer and optimize execution. We define this information as telemetry data. Current HPC platforms do not provide the infrastructure, interfaces and conceptual models to collect, store, analyze, and access such data. Today, applications depend on application and platform specific techniques for collecting telemetry data; introducing significant development overheads that inhibit portability and mobility. The development and adoption of adaptive, context-aware strategies is thereby impaired. To facilitate 2nd generation applications, more efficient application development, and swift adoption of adaptive applications in production, a comprehensive framework for telemetry data management must be provided by future HPC systems and services. We introduce Seastar, a conceptual model and a software framework to collect, store, analyze, and exploit streams of telemetry data generated by HPC systems and their applications. We show how Seastar can be integrated with HPC platform architectures and how it enables common application execution strategies.","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126781709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Operating and Runtime Systems Challenges for HPC Systems","authors":"A. Maccabe","doi":"10.1145/3095770.3095771","DOIUrl":"https://doi.org/10.1145/3095770.3095771","url":null,"abstract":"Future HPC systems will be characterized by extreme heterogeneity. We will see increasing heterogeneity in virtually every aspect of node architecture from computational engines to memory systems. We will see increasing heterogeneity in applications, including heterogeneity within applications (as previously independent applications are composed to build new applications). We will see increasing heterogeneity in system usage models; in some cases, the HPC system is not the most precious resource being managed. We will also see increasing heterogeneity in the shared services (e.g., storage and visualization systems) that are connected to HPC systems. All of this increasing heterogeneity is certain to create new challenges in the design and implementation of operating and runtime systems. There will be new kinds of resources to manage and many resource management tactics will be invented (and some re-discovered and adapted) to address the new heterogeneity. In essence, we will tacitly agree that the operating and runtime systems need to adapt to enable the inevitable integration of new technologies, applications, usage models, and shared services. While this agreement is critical for our ability to make incremental progress, we, as a community, must step back and ask the relevant question: Does the OS or runtime system bear the brunt of the adaptation, or will we be able to insist on changes in the technologies, applications, and environment? In the past decade, we have seen a similar tradeoff play out between the application teams and the architects of computational engines: how much floating point precision is required and how is this precision implemented? How can we define similar tradeoffs that are important in the design and implementation of operating and runtime systems?","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121808004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Effect of Asymmetric Performance on Asynchronous Task Based Runtimes","authors":"D. Ganguly, J. Lange","doi":"10.1145/3095770.3095778","DOIUrl":"https://doi.org/10.1145/3095770.3095778","url":null,"abstract":"It is generally accepted that future supercomputing workloads will consist of application compositions made up of coupled simulations as well as in-situ analytics. While these components have commonly been deployed using a space-shared configuration to minimize cross-workload interference, it is likely that not all the workload components will require the full processing capacity of the CPU cores they are running on. For instance, an analytics workload often does not need to run continuously and is not generally considered to have the same priority as simulation codes. In a space-shared configuration, this arrangement would lead to wasted resources due to periodically idle CPUs, which are generally unusable by traditional bulk synchronous parallel (BSP) applications. As a result, many have started to reconsider task based runtimes owing to their ability to dynamically utilize available CPU resources. While the dynamic behavior of task-based runtimes had historically been targeted at application induced load imbalances, the same basic situation arises due to the asymmetric performance resulting from time sharing a CPU with other workloads. Many have assumed that task based runtimes would be able to adapt easily to these new environments without significant modifications. In this paper, we present a preliminary set of experiments that measured how well asynchronous task-based runtimes are able to respond to load imbalances caused by the asymmetric performance of time shared CPUs. Our work focuses on a set of experiments using benchmarks running on both Charm++ and HPX-5 in the presence of a competing workload. The results show that while these runtimes are better suited at handling the scenarios than traditional runtimes, they are not yet capable of effectively addressing anything other than a fairly minimal level of CPU contention.","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117287802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis","authors":"Soramichi Akiyama, Takahiro Hirofuchi","doi":"10.1145/3095770.3095773","DOIUrl":"https://doi.org/10.1145/3095770.3095773","url":null,"abstract":"Analyzing system-noise incurred to high-throughput systems (e.g., Spark, RDBMS) from the underlying machines must be in the granularity of the message- or request-level to find the root causes of performance anomalies, because messages are passed through many components in very short periods. To this end, we consider using Precise Event Based Sampling (PEBS) equipped in Intel CPUs at higher sampling rates than used normally is promising. It saves context information (e.g., the general purpose registers) at occurrences of various hardware events such as cache misses. The information can be used to associate performance anomalies caused by system noise with specific messages. One challenge is that quantitative analysis of PEBS overhead with high sampling rates has not yet been studied. This is critical because high sampling rates can cause severe overhead but performance problems are often reproducible only in real environments. In this paper, we evaluate the overhead of PEBS and show: (1) every time PEBS saves context information, the target workload slows down by 200-300 ns due to the CPU overhead of PEBS, (2) the CPU overhead can be used to predict actual overhead incurred with complex workloads including multi-threaded ones with high accuracy, and (3) PEBS incurs cache pollution and extra memory IO since PEBS writes data into the CPU cache, and the severity of cache pollution is affected both by the sampling rate and the buffer size allocated for PEBS. To the best of our knowledge, we are the first to quantitatively analyze the overhead of PEBS.","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130540289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","authors":"","doi":"10.1145/3095770","DOIUrl":"https://doi.org/10.1145/3095770","url":null,"abstract":"","PeriodicalId":205790,"journal":{"name":"Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers ROSS 2017","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114966610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}