ROSS@ICS最新文献

筛选
英文 中文
Reduction of operating system jitter caused by page reclaim 减少由页面回收引起的操作系统抖动
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612270
Y. Oyama, Shun Ishiguro, J. Murakami, Shin Sasaki, R. Matsumiya, O. Tatebe
{"title":"Reduction of operating system jitter caused by page reclaim","authors":"Y. Oyama, Shun Ishiguro, J. Murakami, Shin Sasaki, R. Matsumiya, O. Tatebe","doi":"10.1145/2612262.2612270","DOIUrl":"https://doi.org/10.1145/2612262.2612270","url":null,"abstract":"Operating system jitter is one of the major causes of runtime overhead in applications of high performance computing. Jitter results from the execution of services by the operating system kernel, such as interrupt handling and tasklets, or the execution of various daemon processes developed in order to provide operating system services, such as memory management daemons. This execution interrupts application computations and increases their execution time. Jitter significantly affects applications where many processes or threads frequently synchronize with each other. In this paper, we investigate the impact of jitter caused by reclaiming memory pages, and propose a method for reducing the impact. The target operating system is Linux. When the Linux kernel runs out of memory, the kernel awakens a special kernel thread to reclaim memory pages that are unlikely to be used in the near future. If the kernel thread is frequently awakened, application performance is degraded because of its resource consumption. The proposed method can reclaim memory pages in advance of the kernel thread. It reclaims more pages at one time than the kernel thread, thus reducing the frequency of page reclaim and the impact of jitter. We implement a system based on the proposed method and conduct an experiment using practical weather forecast software. Results of the experiment show that the proposed method minimizes performance degradation caused by jitter.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130392822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Overhead of a decentralized gossip algorithm on the performance of HPC applications 分散式八卦算法对高性能计算应用性能的影响
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612271
Ely Levy, A. Barak, A. Shiloh, Matthias Lieber, C. Weinhold, Hermann Härtig
{"title":"Overhead of a decentralized gossip algorithm on the performance of HPC applications","authors":"Ely Levy, A. Barak, A. Shiloh, Matthias Lieber, C. Weinhold, Hermann Härtig","doi":"10.1145/2612262.2612271","DOIUrl":"https://doi.org/10.1145/2612262.2612271","url":null,"abstract":"Gossip algorithms can provide online information about the availability and the state of the resources in supercomputers. These algorithms require minimal computing and storage capabilities at each node and when properly tuned, they are not expected to overload the nodes or the network that connects these nodes. These properties make gossip interesting for future exascale systems. This paper examines the overhead of a decentralized gossip algorithm on the performance of parallel MPI applications running on up to 8192 nodes of an IBM BlueGene/Q supercomputer. The applications that were used in the experiments include PTRANS and MPI-FFT from the HPCC benchmark suite as well as the coupled weather and cloud simulation model COSMO-SPECS+FD4. In most cases, no gossip overhead was observed when the gossip messages were sent at intervals of 256ms or more. As expected, the overhead that is observed at higher rates is sensitive to the communication pattern of the application and the amount of gossip information being circulated.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115900534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
VMM emulation of Intel hardware transactional memory 英特尔硬件事务性内存的VMM仿真
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612265
Maciej Swiech, Kyle C. Hale, P. Dinda
{"title":"VMM emulation of Intel hardware transactional memory","authors":"Maciej Swiech, Kyle C. Hale, P. Dinda","doi":"10.1145/2612262.2612265","DOIUrl":"https://doi.org/10.1145/2612262.2612265","url":null,"abstract":"We describe the design, implementation, and evaluation of emulated hardware transactional memory, specifically the Intel Haswell Restricted Transactional Memory (RTM) architectural extensions for x86/64, within a virtual machine monitor (VMM). Our system allows users to investigate RTM on hardware that does not provide it, debug their RTM-based transactional software, and stress test it on diverse emulated hardware configurations, including potential future configurations that might support arbitrary length transactions. Initial performance results suggest that we are able to accomplish this approximately 60 times faster than under a full emulator. A noteworthy aspect of our system is a novel page-flipping technique that allows us to completely avoid instruction emulation, and to limit instruction decoding to only that necessary to determine instruction length. This makes it possible to implement RTM emulation, and potentially other techniques, far more compactly than would otherwise be possible. We have implemented our system in the context of the Palacios VMM. Our techniques are not specific to Palacios, and could be implemented in other VMMs.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123591235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Hybrid MPI: a case study on the Xeon Phi platform 混合MPI: Xeon Phi平台的案例研究
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612267
U. Wickramasinghe, G. Bronevetsky, A. Lumsdaine, A. Friedley
{"title":"Hybrid MPI: a case study on the Xeon Phi platform","authors":"U. Wickramasinghe, G. Bronevetsky, A. Lumsdaine, A. Friedley","doi":"10.1145/2612262.2612267","DOIUrl":"https://doi.org/10.1145/2612262.2612267","url":null,"abstract":"New many-core architectures such as Intel Xeon Phi offer applications significantly higher power efficiency than conventional multi-core processors. However, while this processor's compute and communication performance is an excellent match for MPI applications, leveraging its potential in practice has proven difficult because of the mismatch between the MPI distributed memory model and this processor's shared memory communication hardware. Hybrid MPI is a high performance portable implementation of MPI designed for communication over shared memory hardware. It shares the heaps of all the MPI processes that run on the same node, enabling them to communicate directly without unnecessary copies. This paper describes our work to port Hybrid MPI to the Xeon Phi platform, demonstrating that Hybrid MPI offers better performance than the native Intel MPI implementation in terms of memory bandwidth, latency and benchmark performance.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"26 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121045113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor Intel Xeon Phi协处理器上OpenMP应用程序的自动SMT线程
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612268
W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout
{"title":"Automatic SMT threading for OpenMP applications on the Intel Xeon Phi co-processor","authors":"W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout","doi":"10.1145/2612262.2612268","DOIUrl":"https://doi.org/10.1145/2612262.2612268","url":null,"abstract":"Simultaneous multithreading is a technique that can improve performance when running parallel applications on the Intel Xeon Phi co-processor. Selecting the most efficient thread count is however non-trivial, as the potential increase in efficiency has to be balanced against other, potentially negative factors such as inter-thread competition for cache capacity and increased synchronization overheads.\u0000 In this paper, we extend CRUST (ClusteR-aware Undersubscribed Scheduling of Threads), a technique for finding the optimum thread count of OpenMP applications running on clustered cache architectures, to take the behavior of simultaneous multithreading on the Xeon Phi into account. CRUST can automatically find the optimum thread count at sub-application granularity by exploiting application phase behavior at OpenMP parallel section boundaries, and uses hardware performance counter information to gain insight into the application's behavior. We implement a CRUST prototype inside the Intel OpenMP runtime library and show its efficiency running on real Xeon Phi hardware.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123633837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Revisiting virtual memory for high performance computing on manycore architectures: a hybrid segmentation kernel approach 在多核架构上为高性能计算重新访问虚拟内存:一种混合分段核方法
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612264
Yuki Soma, Balazs Gerofi, Y. Ishikawa
{"title":"Revisiting virtual memory for high performance computing on manycore architectures: a hybrid segmentation kernel approach","authors":"Yuki Soma, Balazs Gerofi, Y. Ishikawa","doi":"10.1145/2612262.2612264","DOIUrl":"https://doi.org/10.1145/2612262.2612264","url":null,"abstract":"Page-based memory management (paging) is utilized by most of the current operating systems (OSs) due to its rich features such as prevention of memory fragmentation and fine-grained access control. Page-based virtual memory, however, stores virtual to physical mappings in page tables that also reside in main memory. Because translating virtual to physical addresses requires walking the page tables, which in turn implies additional memory accesses, modern CPUs employ translation lookaside buffers (TLBs) to cache the mappings. Nevertheless, TLBs are limited in size and applications that consume a large amount of memory and exhibit little or no locality in their memory access patterns, such as graph algorithms, suffer from the high overhead of TLB misses.\u0000 This paper proposes a new hybrid kernel design targeting many-core CPUs, which manages the application's memory space by segmentation and offloads kernel services to dedicated CPU cores where paging is utilized. The method enables applications to run on top of the low-cost segmented memory management while allows the kernel to use the rich features of paging. We present the design and implementation of our kernel and demonstrate that segmentation can provide superior performance compared to both regular and large page based virtual memory. For example, running Graph500 on top of our segmentation design over Intel's Xeon Phi chip can yield up to 81% and 9% improvement compared to utilizing 4kB and 2MB pages in MPSS Linux, respectively.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129130420","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
mOS: an architecture for extreme-scale operating systems mOS:用于超大规模操作系统的架构
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612263
R. Wisniewski, T. Inglett, Pardo Keppel, Ravi Murty, R. Riesen
{"title":"mOS: an architecture for extreme-scale operating systems","authors":"R. Wisniewski, T. Inglett, Pardo Keppel, Ravi Murty, R. Riesen","doi":"10.1145/2612262.2612263","DOIUrl":"https://doi.org/10.1145/2612262.2612263","url":null,"abstract":"Linux®, or more specifically, the Linux API, plays a key role in HPC computing. Even for extreme-scale computing, a known and familiar API is required for production machines. However, an off-the-shelf Linux distribution faces challenges at extreme scale. To date, two approaches have been used to address the challenges of providing an operating system (OS) at extreme scale. In the Full-Weight Kernel (FWK) approach, an OS, typically Linux, forms the starting point, and work is undertaken to remove features from the environment so that it will scale up across more cores and out across a large cluster. A Light-Weight Kernel (LWK) approach often starts with a new kernel and work is undertaken to add functionality to provide a familiar API, typically Linux. Either approach however, results in an execution environment that is not fully Linux compatible.\u0000 mOS (multi Operating System) runs both an FWK (Linux), and an LWK, simultaneously as kernels on the same compute node. mOS thereby achieves the scalability and reliability of LWKs, while providing the full Linux functionality of an FWK. Further, mOS works in concert with Operating System Nodes (OSNs) to offload system calls, e.g., I/O, that are too invasive to run on the compute nodes at extreme-scale. Beyond providing full Linux capability with LWK performance, other advantages of mOS include the ability to effectively manage different types of compute and memory resources, interface easily with proposed asynchronous and fine-grained runtimes, and nimbly manage new technologies.\u0000 This paper is an architectural description of mOS. As a prototype is not yet finished, the contributions of this work are a description of mOS's architecture, an exploration of the tradeoffs and value of this approach for the purposes listed above, and a detailed architecture description of each of the six components of mOS, including the tradeoffs we considered. The uptick of OS research work indicates that many view this as an important area for getting to extreme scale. Thus, most importantly, the goal of the paper is to generate discussion in this area at the workshop.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126007180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 69
An evaluation of BitTorrent's performance in HPC environments BitTorrent在高性能计算环境下的性能评估
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612269
Matthew G. F. Dosanjh, P. Bridges, S. M. Kelly, J. Laros, C. Vaughan
{"title":"An evaluation of BitTorrent's performance in HPC environments","authors":"Matthew G. F. Dosanjh, P. Bridges, S. M. Kelly, J. Laros, C. Vaughan","doi":"10.1145/2612262.2612269","DOIUrl":"https://doi.org/10.1145/2612262.2612269","url":null,"abstract":"A number of novel decentralized systems have recently been developed to address challenges of scale in large distributed systems. The suitability of such systems for meeting the challenges of scale in high performance computing (HPC) systems is unclear, however. In this paper, we begin to answer this question by examining the suitability of the popular BitTorrent protocol to handle dynamic shared library distribution in HPC systems. To that end, we describe the architecture and implementation of a system that uses BitTorrent to distribute shared libraries in HPC systems, evaluate and optimize BitTorrent protocol usage for the HPC environment, and measure the performance of the resulting system. Our results demonstrate the potential viability of BitTorrent-style protocols in HPC systems, but also highlight the challenges of these protocols. In particular, our results show that the protocol mechanisms meant to enforce fairness in a distributed computing environment can have a significant impact on system performance if not properly taken into account in system design and implementation.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"195 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114444785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Building blocks for an exa-scale operating system 超大规模操作系统的构建块
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2627355
Hermann Härtig
{"title":"Building blocks for an exa-scale operating system","authors":"Hermann Härtig","doi":"10.1145/2612262.2627355","DOIUrl":"https://doi.org/10.1145/2612262.2627355","url":null,"abstract":"Currently, high performance systems are mostly used by splitting them into fixed-size partitions which are completely owned and operated by applications. Hardware architecture designs strive to remove the operating system from the critical path, for example using techniques such as RDMA and busy waiting for synchronisation. Operating system functionality is restricted to batch schedulers that load and start applications and to I/O. Applications take over traditional operating system functionality such as balancing load over resources.\u0000 In exa-scale computing, new challenges and opportunities may put an end to that mode of operation. These developments include applications too complex and too dynamic to do application-level balancing and hardware too diverse to maintain an application-level view of a fixed number of reliable and predictable resources. The talk will discuss examples of operating system building blocks at various system levels that may receive new appreciation in exa-scale supercomputing. These building blocks include schedulers, microkernels, library OSes, virtualization, execution time predictors and gossip algorithms that need to be combined into a coherent architecture.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125827359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PICS: a performance-analysis-based introspective control system to steer parallel applications PICS:一种基于性能分析的内省控制系统,用于引导并行应用
ROSS@ICS Pub Date : 2014-06-10 DOI: 10.1145/2612262.2612266
Yanhua Sun, J. Lifflander, L. Kalé
{"title":"PICS: a performance-analysis-based introspective control system to steer parallel applications","authors":"Yanhua Sun, J. Lifflander, L. Kalé","doi":"10.1145/2612262.2612266","DOIUrl":"https://doi.org/10.1145/2612262.2612266","url":null,"abstract":"Parallel programming has always been difficult due to the complexity of hardware and the diversity of applications. Although significant progress has been achieved with the remarkable efforts of researchers in academia and industry, attaining high parallel efficiency on large supercomputers with millions of cores for various applications remains challenging. Therefore, performance tuning has become even more important and challenging than ever before. In this paper, we describe the design and implementation of PICS: Performance-analysis-based Introspective Control System, which is used to tune parallel programs. PICS provides a generic set of abstractions to the applications to expose the application-specific knowledge to the runtime system. The abstractions are called control points, which are tunable parameters affecting application performance. The application behaviors are observed, measured and automatically analyzed by the PICS. Based on the analysis results and expert knowledge rules, program characteristics are extracted to assist the search for optimal configurations of the control points. We have implemented the PICS control system in Charm++, an asynchronous message-driven parallel programming model. We demonstrate the utility of PICS with several benchmarks and a real-world application and show its effectiveness.","PeriodicalId":216902,"journal":{"name":"ROSS@ICS","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115221091","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信