Balazs Gerofi, A. Shimada, A. Hori, Masamichi Takagi, Y. Ishikawa
{"title":"CMCP: a novel page replacement policy for system level hierarchical memory management on many-cores","authors":"Balazs Gerofi, A. Shimada, A. Hori, Masamichi Takagi, Y. Ishikawa","doi":"10.1145/2600212.2600231","DOIUrl":"https://doi.org/10.1145/2600212.2600231","url":null,"abstract":"The increasing prevalence of co-processors such as the Intel Xeon Phi, has been reshaping the high performance computing (HPC) landscape. The Xeon Phi comes with a large number of power efficient CPU cores, but at the same time, it's a highly memory constraint environment leaving the task of memory management entirely up to application developers. To reduce programming complexity, we are focusing on application transparent, operating system (OS) level hierarchical memory management.\u0000 In particular, we first show that state of the art page replacement policies, such as approximations of the least recently used (LRU) policy, are not good candidates for massive many-cores due to their inherent cost of remote translation lookaside buffer (TLB) invalidations, which are inevitable for collecting page usage statistics. The price of concurrent remote TLB invalidations grows rapidly with the number of CPU cores in many-core systems and outpace the benefits of the page replacement algorithm itself. Building upon our previous proposal, per-core Partially Separated Page Tables (PSPT), in this paper we propose Core-Map Count based Priority (CMCP) page replacement policy, which exploits the auxiliary knowledge of the number of mapping CPU cores of each page and prioritizes them accordingly. In turn, it can avoid TLB invalidations for page usage statistic purposes altogether. Additionally, we describe and provide an implementation of the experimental 64kB page support of the Intel Xeon Phi and reveal some intriguing insights regarding its performance. We evaluate our proposal on various applications and find that CMCP can outperform state of the art page replacement policies by up to 38%. We also show that the choice of appropriate page size depends primarily on the degree of memory constraint in the system.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126292660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparent checkpoint-restart over infiniband","authors":"Jiajun Cao, Gregory Kerr, K. Arya, G. Cooperman","doi":"10.1145/2600212.2600219","DOIUrl":"https://doi.org/10.1145/2600212.2600219","url":null,"abstract":"Transparently saving the state of the InfiniBand network as part of distributed checkpointing has been a long-standing challenge for researchers. The lack of a solution has forced typical MPI implementations to include custom checkpoint-restart services that \"tear down\" the network, checkpoint each node in isolation, and then re-connect the network again. This work presents the first example of transparent, system-initiated checkpoint-restart that directly supports InfiniBand. The new approach simplifies current practice by avoiding the need for a privileged kernel module. The generality of this approach is demonstrated by applying it both to MPI and to Berkeley UPC (Unified Parallel C), in its native mode (without MPI). Scalability is shown by checkpointing 2,048 MPI processes across 128 nodes (with 16 cores per node). The run-time overhead varies between 0.8% and 1.7%. While checkpoint times dominate, the network-only portion of the implementation is shown to require less than 100 milliseconds (not including the time to locally write application memory to stable storage).","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117042084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Eric R. Schendel, Saurabh V. Pendse, John Jenkins, David A. Boyuka, Zhenhuan Gong, Sriram Lakshminarasimhan, Qing Liu, H. Kolla, Jackie H. Chen, S. Klasky, R. Ross, N. Samatova
{"title":"ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization","authors":"Eric R. Schendel, Saurabh V. Pendse, John Jenkins, David A. Boyuka, Zhenhuan Gong, Sriram Lakshminarasimhan, Qing Liu, H. Kolla, Jackie H. Chen, S. Klasky, R. Ross, N. Samatova","doi":"10.1145/2287076.2287086","DOIUrl":"https://doi.org/10.1145/2287076.2287086","url":null,"abstract":"Current peta-scale data analytics frameworks suffer from a significant performance bottleneck due to an imbalance between their enormous computational power and limited I/O bandwidth. Using data compression schemes to reduce the amount of I/O activity is a promising approach to addressing this problem. In this paper, we propose a hybrid framework for interleaving I/O with data compression to achieve improved I/O throughput side-by-side with reduced dataset size. We evaluate several interleaving strategies, present theoretical models, and evaluate the efficiency and scalability of our approach through comparative analysis. With our theoretical model, considering 19 real-world scientific datasets both from the public domain and peta-scale simulations, we estimate that the hybrid method can result in a 12 to 46 increase in throughput on hard-to-compress scientific datasets. At the reported peak bandwidth of 60 GB/s of uncompressed data for a current, leadership-class parallel I/O system, this translates into an effective gain of 7 to 28 GB/s in aggregate throughput.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114920494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Phull, Cheng-Hong Li, Kunal Rao, S. Cadambi, S. Chakradhar
{"title":"Interference-driven resource management for GPU-based heterogeneous clusters","authors":"R. Phull, Cheng-Hong Li, Kunal Rao, S. Cadambi, S. Chakradhar","doi":"10.1145/2287076.2287091","DOIUrl":"https://doi.org/10.1145/2287076.2287091","url":null,"abstract":"GPU-based clusters are increasingly being deployed in HPC environments to accelerate a variety of scientific applications. Despite their growing popularity, the GPU devices themselves are under-utilized even for many computationally-intensive jobs. This stems from the fact that the typical GPU usage model is one in which a host processor periodically offloads computationally intensive portions of an application to the coprocessor. Since some portions of code cannot be offloaded to the GPU (for example, code performing network communication in MPI applications), this usage model results in periods of time when the GPU is idle. GPUs could be time-shared across jobs to \"fill\" these idle periods, but unlike CPU resources such as the cache, the effects of sharing the GPU are not well understood. Specifically, two jobs that time-share a single GPU will experience resource contention and interfere with each other. The resulting slow-down could lead to missed job deadlines. Current cluster managers do not support GPU-sharing, but instead dedicate GPUs to a job for the job's lifetime.\u0000 In this paper, we present a framework to predict and handle interference when two or more jobs time-share GPUs in HPC clusters. Our framework consists of an analysis model, and a dynamic interference detection and response mechanism to detect excessive interference and restart the interfering jobs on different nodes. We implement our framework in Torque, an open-source cluster manager, and using real workloads on an HPC cluster, show that interference-aware two-job colocation (although our method is applicable to colocating more than two jobs) improves GPU utilization by 25%, reduces a job's waiting time in the queue by 39% and improves job latencies by around 20%.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"738 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133661980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hybrid local storage transfer scheme for live migration of I/O intensive workloads","authors":"Bogdan Nicolae, F. Cappello","doi":"10.1145/2287076.2287088","DOIUrl":"https://doi.org/10.1145/2287076.2287088","url":null,"abstract":"Live migration of virtual machines (VMs) is key feature of virtualization that is extensively leveraged in IaaS cloud environments: it is the basic building block of several important features, such as load balancing, pro-active fault tolerance, power management, online maintenance, etc. While most live migration efforts concentrate on how to transfer the memory from source to destination during the migration process, comparatively little attention has been devoted to the transfer of storage. This problem is gaining increasing importance: due to performance reasons, virtual machines that run large-scale, data-intensive applications tend to rely on local storage, which poses a difficult challenge on live migration: it needs to handle storage transfer in addition to memory transfer. This paper proposes a memory migration independent approach that addresses this challenge. It relies on a hybrid active push / prioritized prefetch strategy, which makes it highly resilient to rapid changes of disk state exhibited by I/O intensive workloads. At the same time, it is minimally intrusive in order to ensure a maximum of portability with a wide range of hypervisors. Large scale experiments that involve multiple simultaneous migrations of both synthetic benchmarks and a real scientific application show improvements of up to 10x faster migration time, 10x less bandwidth consumption and 8x less performance degradation over state-of-art.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131861370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Xia, Zheng Cui, J. Lange, Yuan Tang, P. Dinda, P. Bridges
{"title":"VNET/P: bridging the cloud and high performance computing through fast overlay networking","authors":"Lei Xia, Zheng Cui, J. Lange, Yuan Tang, P. Dinda, P. Bridges","doi":"10.1145/2287076.2287116","DOIUrl":"https://doi.org/10.1145/2287076.2287116","url":null,"abstract":"It is now possible to allow VMs hosting HPC applications to seamlessly bridge distributed cloud resources and tightly-coupled supercomputing and cluster resources. However, to achieve the application performance that the tightly-coupled resources are capable of, it is important that the overlay network not introduce significant overhead relative to the native hardware, which is not the case for current user-level tools, including our own existing VNET/U system. In response, we describe the design, implementation, and evaluation of a layer 2 virtual networking system that has negligible latency and bandwidth overheads in 1--10 Gbps networks. Our system, VNET/P, is directly embedded into our publicly available Palacios virtual machine monitor (VMM). VNET/P achieves native performance on 1 Gbps Ethernet networks and very high performance on 10 Gbps Ethernet networks and InfiniBand. The NAS benchmarks generally achieve over 95% of their native performance on both 1 and 10 Gbps. These results suggest it is feasible to extend a software-based overlay network designed for computing at wide-area scales into tightly-coupled environments.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124230738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Massively-parallel stream processing under QoS constraints with Nephele","authors":"Björn Lohrmann, Daniel Warneke, O. Kao","doi":"10.1145/2287076.2287117","DOIUrl":"https://doi.org/10.1145/2287076.2287117","url":null,"abstract":"Today, a growing number of commodity devices, like mobile phones or smart meters, is equipped with rich sensors and capable of producing continuous data streams. The sheer amount of these devices and the resulting overall data volumes of the streams raise new challenges with respect to the scalability of existing stream processing systems.\u0000 At the same time, massively-parallel data processing systems like MapReduce have proven that they scale to large numbers of nodes and efficiently organize data transfers between them. Many of these systems also provide streaming capabilities. However, unlike traditional stream processors, these systems have disregarded QoS requirements of prospective stream processing applications so far.\u0000 In this paper we address this gap. First, we analyze common design principles of today's parallel data processing frameworks and identify those principles that provide degrees of freedom in trading off the QoS goals latency and throughput. Second, we propose a scheme which allows these frameworks to detect violations of user-defined latency constraints and optimize the job execution without manual interaction in order to meet these constraints while keeping the throughput as high as possible. As a proof of concept, we implemented our approach for our parallel data processing framework Nephele and evaluated its effectiveness through a comparison with Hadoop Online.\u0000 For a multimedia streaming application we can demonstrate an improved processing latency by factor of at least 15 while preserving high data throughput when needed.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed approximate spectral clustering for large-scale datasets","authors":"M. Hefeeda, Fei Gao, W. Abd-Almageed","doi":"10.1145/2287076.2287111","DOIUrl":"https://doi.org/10.1145/2287076.2287111","url":null,"abstract":"Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernel-based machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128583633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Georgakoudis, S. Lalis, Dimitrios S. Nikolopoulos
{"title":"Dynamic binary rewriting and migration for shared-ISA asymmetric, multicore processors: summary","authors":"G. Georgakoudis, S. Lalis, Dimitrios S. Nikolopoulos","doi":"10.1145/2287076.2287096","DOIUrl":"https://doi.org/10.1145/2287076.2287096","url":null,"abstract":"","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133469735","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Highly scalable graph search for the Graph500 benchmark","authors":"Koji Ueno, T. Suzumura","doi":"10.1145/2287076.2287104","DOIUrl":"https://doi.org/10.1145/2287076.2287104","url":null,"abstract":"Graph500 is a new benchmark to rank supercomputers with a large-scale graph search problem. We found that the provided reference implementations are not scalable in a large distributed environment. We devised an optimized method based on 2D partitioning and other methods such as communication compression and vertex sorting. Our optimized implementation can handle BFS (Breadth First Search) of a large graph with 236 (68.7 billion vertices) and 240 (1.1 trillion) edges in 10.58 seconds while using 1366 nodes and 16,392 CPU cores. This performance corresponds to 103.9 GE/s. We also studied the performance characteristics of our optimized implementation and reference implementations on a large distributed memory supercomputer with a Fat-Tree-based Infiniband network.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123834162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}