IEEE International Symposium on High-Performance Parallel Distributed Computing最新文献_第4页

CMCP: a novel page replacement policy for system level hierarchical memory management on many-cores CMCP:一种用于多核系统级分层内存管理的新颖页面替换策略

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2014-06-23 DOI: 10.1145/2600212.2600231

Balazs Gerofi, A. Shimada, A. Hori, Masamichi Takagi, Y. Ishikawa

{"title":"CMCP: a novel page replacement policy for system level hierarchical memory management on many-cores","authors":"Balazs Gerofi, A. Shimada, A. Hori, Masamichi Takagi, Y. Ishikawa","doi":"10.1145/2600212.2600231","DOIUrl":"https://doi.org/10.1145/2600212.2600231","url":null,"abstract":"The increasing prevalence of co-processors such as the Intel Xeon Phi, has been reshaping the high performance computing (HPC) landscape. The Xeon Phi comes with a large number of power efficient CPU cores, but at the same time, it's a highly memory constraint environment leaving the task of memory management entirely up to application developers. To reduce programming complexity, we are focusing on application transparent, operating system (OS) level hierarchical memory management.\u0000 In particular, we first show that state of the art page replacement policies, such as approximations of the least recently used (LRU) policy, are not good candidates for massive many-cores due to their inherent cost of remote translation lookaside buffer (TLB) invalidations, which are inevitable for collecting page usage statistics. The price of concurrent remote TLB invalidations grows rapidly with the number of CPU cores in many-core systems and outpace the benefits of the page replacement algorithm itself. Building upon our previous proposal, per-core Partially Separated Page Tables (PSPT), in this paper we propose Core-Map Count based Priority (CMCP) page replacement policy, which exploits the auxiliary knowledge of the number of mapping CPU cores of each page and prioritizes them accordingly. In turn, it can avoid TLB invalidations for page usage statistic purposes altogether. Additionally, we describe and provide an implementation of the experimental 64kB page support of the Intel Xeon Phi and reveal some intriguing insights regarding its performance. We evaluate our proposal on various applications and find that CMCP can outperform state of the art page replacement policies by up to 38%. We also show that the choice of appropriate page size depends primarily on the degree of memory constraint in the system.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126292660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Transparent checkpoint-restart over infiniband 透明的检查点重新启动infiniband

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2013-12-13 DOI: 10.1145/2600212.2600219

Jiajun Cao, Gregory Kerr, K. Arya, G. Cooperman

引用次数: 30

ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization 面向大规模并行I/O优化的ISOBAR混合压缩-I/O交织

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI: 10.1145/2287076.2287086

Eric R. Schendel, Saurabh V. Pendse, John Jenkins, David A. Boyuka, Zhenhuan Gong, Sriram Lakshminarasimhan, Qing Liu, H. Kolla, Jackie H. Chen, S. Klasky, R. Ross, N. Samatova

{"title":"ISOBAR hybrid compression-I/O interleaving for large-scale parallel I/O optimization","authors":"Eric R. Schendel, Saurabh V. Pendse, John Jenkins, David A. Boyuka, Zhenhuan Gong, Sriram Lakshminarasimhan, Qing Liu, H. Kolla, Jackie H. Chen, S. Klasky, R. Ross, N. Samatova","doi":"10.1145/2287076.2287086","DOIUrl":"https://doi.org/10.1145/2287076.2287086","url":null,"abstract":"Current peta-scale data analytics frameworks suffer from a significant performance bottleneck due to an imbalance between their enormous computational power and limited I/O bandwidth. Using data compression schemes to reduce the amount of I/O activity is a promising approach to addressing this problem. In this paper, we propose a hybrid framework for interleaving I/O with data compression to achieve improved I/O throughput side-by-side with reduced dataset size. We evaluate several interleaving strategies, present theoretical models, and evaluate the efficiency and scalability of our approach through comparative analysis. With our theoretical model, considering 19 real-world scientific datasets both from the public domain and peta-scale simulations, we estimate that the hybrid method can result in a 12 to 46 increase in throughput on hard-to-compress scientific datasets. At the reported peak bandwidth of 60 GB/s of uncompressed data for a current, leadership-class parallel I/O system, this translates into an effective gain of 7 to 28 GB/s in aggregate throughput.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114920494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Interference-driven resource management for GPU-based heterogeneous clusters 基于gpu的异构集群干扰驱动资源管理

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI: 10.1145/2287076.2287091

R. Phull, Cheng-Hong Li, Kunal Rao, S. Cadambi, S. Chakradhar

{"title":"Interference-driven resource management for GPU-based heterogeneous clusters","authors":"R. Phull, Cheng-Hong Li, Kunal Rao, S. Cadambi, S. Chakradhar","doi":"10.1145/2287076.2287091","DOIUrl":"https://doi.org/10.1145/2287076.2287091","url":null,"abstract":"GPU-based clusters are increasingly being deployed in HPC environments to accelerate a variety of scientific applications. Despite their growing popularity, the GPU devices themselves are under-utilized even for many computationally-intensive jobs. This stems from the fact that the typical GPU usage model is one in which a host processor periodically offloads computationally intensive portions of an application to the coprocessor. Since some portions of code cannot be offloaded to the GPU (for example, code performing network communication in MPI applications), this usage model results in periods of time when the GPU is idle. GPUs could be time-shared across jobs to \"fill\" these idle periods, but unlike CPU resources such as the cache, the effects of sharing the GPU are not well understood. Specifically, two jobs that time-share a single GPU will experience resource contention and interfere with each other. The resulting slow-down could lead to missed job deadlines. Current cluster managers do not support GPU-sharing, but instead dedicate GPUs to a job for the job's lifetime.\u0000 In this paper, we present a framework to predict and handle interference when two or more jobs time-share GPUs in HPC clusters. Our framework consists of an analysis model, and a dynamic interference detection and response mechanism to detect excessive interference and restart the interfering jobs on different nodes. We implement our framework in Torque, an open-source cluster manager, and using real workloads on an HPC cluster, show that interference-aware two-job colocation (although our method is applicable to colocating more than two jobs) improves GPU utilization by 25%, reduces a job's waiting time in the queue by 39% and improves job latencies by around 20%.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"738 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133661980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

A hybrid local storage transfer scheme for live migration of I/O intensive workloads 一种混合本地存储传输方案，用于I/O密集型工作负载的实时迁移

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI: 10.1145/2287076.2287088

Bogdan Nicolae, F. Cappello

{"title":"A hybrid local storage transfer scheme for live migration of I/O intensive workloads","authors":"Bogdan Nicolae, F. Cappello","doi":"10.1145/2287076.2287088","DOIUrl":"https://doi.org/10.1145/2287076.2287088","url":null,"abstract":"Live migration of virtual machines (VMs) is key feature of virtualization that is extensively leveraged in IaaS cloud environments: it is the basic building block of several important features, such as load balancing, pro-active fault tolerance, power management, online maintenance, etc. While most live migration efforts concentrate on how to transfer the memory from source to destination during the migration process, comparatively little attention has been devoted to the transfer of storage. This problem is gaining increasing importance: due to performance reasons, virtual machines that run large-scale, data-intensive applications tend to rely on local storage, which poses a difficult challenge on live migration: it needs to handle storage transfer in addition to memory transfer. This paper proposes a memory migration independent approach that addresses this challenge. It relies on a hybrid active push / prioritized prefetch strategy, which makes it highly resilient to rapid changes of disk state exhibited by I/O intensive workloads. At the same time, it is minimally intrusive in order to ensure a maximum of portability with a wide range of hypervisors. Large scale experiments that involve multiple simultaneous migrations of both synthetic benchmarks and a real scientific application show improvements of up to 10x faster migration time, 10x less bandwidth consumption and 8x less performance degradation over state-of-art.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131861370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32

VNET/P: bridging the cloud and high performance computing through fast overlay networking VNET/P:通过快速覆盖网络架起云和高性能计算的桥梁

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI: 10.1145/2287076.2287116

Lei Xia, Zheng Cui, J. Lange, Yuan Tang, P. Dinda, P. Bridges

{"title":"VNET/P: bridging the cloud and high performance computing through fast overlay networking","authors":"Lei Xia, Zheng Cui, J. Lange, Yuan Tang, P. Dinda, P. Bridges","doi":"10.1145/2287076.2287116","DOIUrl":"https://doi.org/10.1145/2287076.2287116","url":null,"abstract":"It is now possible to allow VMs hosting HPC applications to seamlessly bridge distributed cloud resources and tightly-coupled supercomputing and cluster resources. However, to achieve the application performance that the tightly-coupled resources are capable of, it is important that the overlay network not introduce significant overhead relative to the native hardware, which is not the case for current user-level tools, including our own existing VNET/U system. In response, we describe the design, implementation, and evaluation of a layer 2 virtual networking system that has negligible latency and bandwidth overheads in 1--10 Gbps networks. Our system, VNET/P, is directly embedded into our publicly available Palacios virtual machine monitor (VMM). VNET/P achieves native performance on 1 Gbps Ethernet networks and very high performance on 10 Gbps Ethernet networks and InfiniBand. The NAS benchmarks generally achieve over 95% of their native performance on both 1 and 10 Gbps. These results suggest it is feasible to extend a software-based overlay network designed for computing at wide-area scales into tightly-coupled environments.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124230738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Massively-parallel stream processing under QoS constraints with Nephele 基于Nephele的QoS约束下的大规模并行流处理

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI: 10.1145/2287076.2287117

Björn Lohrmann, Daniel Warneke, O. Kao

{"title":"Massively-parallel stream processing under QoS constraints with Nephele","authors":"Björn Lohrmann, Daniel Warneke, O. Kao","doi":"10.1145/2287076.2287117","DOIUrl":"https://doi.org/10.1145/2287076.2287117","url":null,"abstract":"Today, a growing number of commodity devices, like mobile phones or smart meters, is equipped with rich sensors and capable of producing continuous data streams. The sheer amount of these devices and the resulting overall data volumes of the streams raise new challenges with respect to the scalability of existing stream processing systems.\u0000 At the same time, massively-parallel data processing systems like MapReduce have proven that they scale to large numbers of nodes and efficiently organize data transfers between them. Many of these systems also provide streaming capabilities. However, unlike traditional stream processors, these systems have disregarded QoS requirements of prospective stream processing applications so far.\u0000 In this paper we address this gap. First, we analyze common design principles of today's parallel data processing frameworks and identify those principles that provide degrees of freedom in trading off the QoS goals latency and throughput. Second, we propose a scheme which allows these frameworks to detect violations of user-defined latency constraints and optimize the job execution without manual interaction in order to meet these constraints while keeping the throughput as high as possible. As a proof of concept, we implemented our approach for our parallel data processing framework Nephele and evaluated its effectiveness through a comparison with Hadoop Online.\u0000 For a multimedia streaming application we can demonstrate an improved processing latency by factor of at least 15 while preserving high data throughput when needed.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116873677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Distributed approximate spectral clustering for large-scale datasets 大规模数据集的分布近似谱聚类

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI: 10.1145/2287076.2287111

M. Hefeeda, Fei Gao, W. Abd-Almageed

{"title":"Distributed approximate spectral clustering for large-scale datasets","authors":"M. Hefeeda, Fei Gao, W. Abd-Almageed","doi":"10.1145/2287076.2287111","DOIUrl":"https://doi.org/10.1145/2287076.2287111","url":null,"abstract":"Data-intensive applications are becoming important in many science and engineering fields, because of the high rates in which data are being generated and the numerous opportunities offered by the sheer amount of these data. Large-scale datasets, however, are challenging to process using many of the current machine learning algorithms due to their high time and space complexities. In this paper, we propose a novel approximation algorithm that enables kernel-based machine learning algorithms to efficiently process very large-scale datasets. While important in many applications, current kernel-based algorithms suffer from a scalability problem as they require computing a kernel matrix which takes O(N2) in time and space to compute and store. The proposed algorithm yields substantial reduction in computation and memory overhead required to compute the kernel matrix, and it does not significantly impact the accuracy of the results. In addition, the level of approximation can be controlled to tradeoff some accuracy of the results with the required computing resources. The algorithm is designed such that it is independent of the subsequently used kernel-based machine learning algorithm, and thus can be used with many of them. To illustrate the effect of the approximation algorithm, we developed a variant of the spectral clustering algorithm on top of it. Furthermore, we present the design of a MapReduce-based implementation of the proposed algorithm. We have implemented this design and run it on our own Hadoop cluster as well as on the Amazon Elastic MapReduce service. Experimental results on synthetic and real datasets demonstrate that significant time and memory savings can be achieved using our algorithm.","PeriodicalId":330072,"journal":{"name":"IEEE International Symposium on High-Performance Parallel Distributed Computing","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128583633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Dynamic binary rewriting and migration for shared-ISA asymmetric, multicore processors: summary 共享isa非对称多核处理器的动态二进制重写和迁移:摘要

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI: 10.1145/2287076.2287096

G. Georgakoudis, S. Lalis, Dimitrios S. Nikolopoulos

引用次数: 0

Highly scalable graph search for the Graph500 benchmark Graph500基准的高度可伸缩图搜索

IEEE International Symposium on High-Performance Parallel Distributed Computing Pub Date : 2012-06-18 DOI: 10.1145/2287076.2287104

Koji Ueno, T. Suzumura

引用次数: 77