{"title":"Design and implementation of a customizable work stealing scheduler","authors":"Jun Nakashima, Sho Nakatani, K. Taura","doi":"10.1145/2491661.2481433","DOIUrl":"https://doi.org/10.1145/2491661.2481433","url":null,"abstract":"An efficient scheduler is important for task parallelism. It should provide scalable dynamic load-balancing mechanism among CPU cores. To meet this requirement, most runtime systems for task parallelism use work stealing as scheduling strategy. Work stealing schedulers typically steal work randomly. This strategy does not consider hardware specific knowledge such as memory hierarchy or application specific knowledge such as cache usage. In order to execute tasks more efficiently, work stealing schedulers should take such knowledge into account. To this end, we propose an API that can customize scheduling strategies and take hardware and application specific knowledge into account while preserving the desirable properties of work stealing.\u0000 This paper describes the design of our proposed API. Specifically, it provides mechanisms to give scheduling hints for tasks and to implement user-defined work stealing functions. They enable programmers to implement a work stealing strategy optimized for their applications. This paper also presents preliminary evaluation results of the proposed API. A kernel of STREAM microbenchmark improved by 58.8% with a work stealing strategy utilizing data cached by the previous iteration. Performance of matrix multiply improved by 18.2% on 32 AMD cores by a work stealing strategy that tries to steal as a coarse grained task as possible.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125611543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enabling accurate power profiling of HPC applications on exascale systems","authors":"Gokcen Kestor, R. Gioiosa, D. Kerbyson, A. Hoisie","doi":"10.1145/2491661.2481429","DOIUrl":"https://doi.org/10.1145/2491661.2481429","url":null,"abstract":"Despite being one of the most important limiting factors on the road to exascale computing, power is not yet considered a \"first-class citizen\" among the system resources. As a result, there is no clear OS interface that exposes accurate resource power consumption to user-level runtimes that implement power-aware software algorithms.\u0000 In this work we propose a System Monitor Interface (SMI) between the OS and the user runtime that exposes accurate, per-core power consumption. To make up for the lack of reliable per-core power sensors, we implement a proxy power sensor, based on a regression analysis of core activity, that provides per-core information. SMI effectively hides the implementation details from the user, who has the perception of reading power information from a real sensor. This allows us these proxy sensors to be replaced with real hardware sensors when the latter becomes available, without the need to modify user-level software.\u0000 Using SMI and the proxy power sensors, we implement a power profiling runtime library and analyzed applications from the NPB benchmark suite and the Exascale Co-Design Centers. Our results show that accurate, per-core power information is necessary for the development of exascale system software and for comprehensively understanding the power characteristics of parallel scientific applications.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129487121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Brightwell, R. Oldfield, A. Maccabe, D. Bernholdt
{"title":"Hobbes: composition and virtualization as the foundations of an extreme-scale OS/R","authors":"R. Brightwell, R. Oldfield, A. Maccabe, D. Bernholdt","doi":"10.1145/2491661.2481427","DOIUrl":"https://doi.org/10.1145/2491661.2481427","url":null,"abstract":"This paper describes our vision for Hobbes, an operating system and runtime (OS/R) framework for extreme-scale systems. The Hobbes design explicitly supports application composition, which is emerging as a key approach for applications to address scalability and power concerns anticipated with coming extreme-scale architectures. We make use of virtualization technologies to provide the flexibility to support requirements of application components for different node-level operating systems and runtimes, as well as different mappings of the components onto the hardware. We describe the architecture of the Hobbes OS/R, how we will address the cross-cutting concerns of power/energy, scheduling of massive levels of parallelism, and resilience. We also outline how the \"users\" of the OS/R (programming models, applications, and tools) influence the design.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116404879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A gossip-based approach to exascale system services","authors":"Philip Soltero, P. Bridges, D. Arnold, M. Lang","doi":"10.1145/2491661.2481428","DOIUrl":"https://doi.org/10.1145/2491661.2481428","url":null,"abstract":"Large-scale server deployments in the commercial internet space have been using group based protocols such as peer-to-peer and gossip to allow coordination of services and data across global distributed data centers. Here we look at applying these methods, which are themselves derived from early work in distributed systems, to large-scale, tightly-coupled systems used in high performance computing.\u0000 In this paper, we study Gossip protocols and their ability to aggregate data across large-scale systems in support of system services. We report accuracy and performance of these estimated results and then focus on a simulated power-capping service to show the tradeoffs of this approach in practice.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130862741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data deduplication in a hybrid architecture for improving write performance","authors":"Chao Chen, Jonathan Bastnagel, Yong Chen","doi":"10.1145/2491661.2481435","DOIUrl":"https://doi.org/10.1145/2491661.2481435","url":null,"abstract":"Big Data computing provides a promising new opportunity for scientific discoveries and innovations. However, it also poses a significant challenge to the high-end computing community. An effective I/O solution is urgently required to support big data applications run on high-end computing systems. In this study, we propose a new approach namely DDiHA, Data Deduplication in Hybrid Architecture, to improve the write performance for write-intensive big data applications. The DDiHA approach utilizes data deduplications to reduce the size of data volumes before they are transfered and written to the storage. A hybrid architecture is introduced to facilitate data deduplications. Both theoretical study and prototyping verification were conducted to evaluate the DDiHA approach. The initial results have shown that, given the same compute resources, the DDiHA system outperformed the conventional architecture, even though it introduces additional computation workload from data deduplications. The DDiHA approach reduces the data size transferred across the network and improves the I/O system performance. It has a promising potential for write-intensive big data applications.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"96 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122541145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characteristics of adaptive runtime systems in HPC","authors":"L. Kalé","doi":"10.1145/2481425.2481426","DOIUrl":"https://doi.org/10.1145/2481425.2481426","url":null,"abstract":"The phrase \"Runtime System\" is somewhat broad and is used with differing meanings in differing contexts. The Java runtime and most of the MPI runtimes are focused on providing mechanisms. In contrast, adaptive runtime systems emphasize strategies, in addition to providing mechanisms. This talk will look at some characteristics that make HPC RTSs adaptive. These include dynamic load balancing, exploitation of the \"principle of persistence\" to learn from recent data, automatic allocation to heterogeneous processors, automatic optimization of communication, application reconfiguration via control-points, automated control and optimization of temperature/power/energy/execution-time, automated tolerance of component failures so as to maintain the rate of computational progress in presence of such failures, and adapting to memory availability. The talk will examine these characteristics, and what features are necessary and/or desirable to empower the runtime system. I will illustrate it using examples from the runtime system underlying Charm++ and Adaptive MPI.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116583890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scott Levy, P. Bridges, Kurt B. Ferreira, A. Thompson, C. Trott
{"title":"Evaluating the feasibility of using memory content similarity to improve system resilience","authors":"Scott Levy, P. Bridges, Kurt B. Ferreira, A. Thompson, C. Trott","doi":"10.1145/2491661.2481432","DOIUrl":"https://doi.org/10.1145/2491661.2481432","url":null,"abstract":"Building the next-generation of extreme-scale distributed systems will require overcoming several challenges related to system resilience. As the number of processors in these systems grows, the failure rate increases proportionally. One of the most common sources of failure in large-scale systems is memory errors. In this paper, we propose a novel run-time for transparently exploiting memory content similarity to improve system resilience by reducing the rate at which memory errors lead to node failure. We evaluate the feasibility of this approach by examining memory snapshots collected from eight HPC applications. Based on the characteristics of the similarity that we uncover in these applications, we conclude that our proposed approach shows promise for addressing system resilience in large-scale systems.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123147497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparently consistent asynchronous shared memory","authors":"Hakan Akkan, Latchesar Ionkov, M. Lang","doi":"10.1145/2491661.2481431","DOIUrl":"https://doi.org/10.1145/2491661.2481431","url":null,"abstract":"The advent of many-core processors is imposing many changes on the operating system. The resources that are under contention have changed; previously, CPU cycles were the resource in demand and required fair and precise sharing. Now compute cycles are plentiful, but the memory per core is decreasing. In the past, scientific applications used all the CPU cores to finish as fast as possible, with visualization and analysis of the data performed after the simulation finished. With decreasing memory available per core, as well as the higher price (in power and time) for storing data on disk or sending it over the network, it now makes sense to run visualization and analytics applications in-situ, while the application is running. Visualization and analytics applications then need to sample the simulation memory with as little interference and as little changes in the simulation code as possible.\u0000 We propose an asynchronous memory sharing facility that allows consistent states of the memory to be shared between processes without any implicit or explicit synchronization. We distinguish two types of processes; a single producer and one or more observers. The producer modifies the state of the data, making available consistent versions of the state to any observer. The observers, working at different sampling rates, can access the latest available consistent state.\u0000 Some applications that would benefit from this type of facility include check-pointing applications, processes monitoring, unobtrusive process debugging, and the sharing of data for visualization or analytics. To evaluate our ideas we have developed two kernel-level implementations for sharing data asynchronously and we compared these implementations to a traditional user-space synchronized multi-buffer method.\u0000 We have seen improvements of up to 3.5x in our tests over the traditional multi-buffer method with 20% of the data pages touched.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128574827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Huck, S. Shende, A. Malony, Hartmut Kaiser, Allan Porterfield, R. Fowler, R. Brightwell
{"title":"An early prototype of an autonomic performance environment for exascale","authors":"K. Huck, S. Shende, A. Malony, Hartmut Kaiser, Allan Porterfield, R. Fowler, R. Brightwell","doi":"10.1145/2491661.2481434","DOIUrl":"https://doi.org/10.1145/2491661.2481434","url":null,"abstract":"Extreme-scale computing requires a new perspective on the role of performance observation in the Exascale system software stack. Because of the anticipated high concurrency and dynamic operation in these systems, it is no longer reasonable to expect that a post-mortem performance measurement and analysis methodology will suffice. Rather, there is a strong need for performance observation that merges first-and third-person observation, in situ analysis, and introspection across stack layers that serves online dynamic feedback and adaptation. In this paper we describe the DOE-funded XPRESS project and the role of autonomic performance support in Exascale systems. XPRESS will build an integrated Exascale software stack (called OpenX) that supports the ParalleX execution model and is targeted towards future Exascale platforms. An initial version of an autonomic performance environment called APEX has been developed for OpenX using the current TAU performance technology and results are presented that highlight the challenges of highly integrative observation and runtime analysis.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128737256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A file I/O system for many-core based clusters","authors":"Yuki Matsuo, Taku Shimosawa, Y. Ishikawa","doi":"10.1145/2318916.2318920","DOIUrl":"https://doi.org/10.1145/2318916.2318920","url":null,"abstract":"A many-core based co-processor, such as the Intel Many Integrated Core (MIC) Architecture, connected to a server-level multi-core host processor via a PCI Express bus, has recently been the subject of a great deal of attention. In such a machine, because the many-core is separated from the host processor with disk I/O and it also has limited cache and memory bandwidth, performance degradation can results from cache pollution and data transfer latency caused by processing file operations.\u0000 Three types of file I/O mechanisms for the many-core in such a system are designed, implemented, and evaluated in this paper. One mechanism involves the file I/O system calls being performed by the kernel running on the same core that the application program is running on. Another is a mechanism whereby those system calls are offloaded to the kernel running on a dedicated core of the many-core that handles file I/O operations. In either case, the kernel requests file data transfer to the file system on the host processor and file data is cached on the many-core. The third mechanism involves the system calls being offloaded to the kernel running on the host processor so that the host kernel transfers data directly to the user buffer in the many-core.\u0000 The experimental results show that the first two mechanisms, performing in the many-core, are superior to offloading them to the host when the data size is relatively small because they are designed to conduct file I/O operations through a file cache and fewer of communications occur between the many-core and the host. With larger data sizes, however, file I/O system calls offloaded to the host, which transfer data directly to/from the user buffer, are better than those performed inside the many-core. In view of cache awareness, it is shown that the user code and part of the file I/O system calls can be performed efficiently when the user buffer data is small enough to be on the cache.","PeriodicalId":335825,"journal":{"name":"International Workshop on Runtime and Operating Systems for Supercomputers","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115247650","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}