2011 18th International Conference on High Performance Computing最新文献

Adaptive memory power management techniques for HPC workloads HPC工作负载的自适应内存电源管理技术

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152740

Karthik Elangovan, I. Rodero, M. Parashar, F. Guim, I. Hernandez

{"title":"Adaptive memory power management techniques for HPC workloads","authors":"Karthik Elangovan, I. Rodero, M. Parashar, F. Guim, I. Hernandez","doi":"10.1109/HiPC.2011.6152740","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152740","url":null,"abstract":"The memory subsystem is responsible for a large fraction of the energy consumed by compute nodes in High Performance Computing (HPC) systems. The rapid increase in the number of cores has been accompanied by a corresponding increase in the DRAM capacity and bandwidth, and as a result, the memory system consumes a significant amount of the power budget available to a compute node. Consequently, there is a broad research effort focused on power management techniques using DRAM low-power modes. However, memory power management continues to present many challenges. In this paper, we study the potential of Dynamic Voltage and Frequency Scaling (DVFS) of the memory subsystems, and consider the ability to select different frequencies for different memory channels. Our approach is based on tuning voltage and frequency dynamically to maximize the energy savings while maintaining performance degradation within tolerable limits. We assume that HPC applications do not demand maximum bandwidth throughout the entire period of execution. We can use these low memory demand intervals to tune down the frequency and, as a result, applications can tolerate a reduction in bandwidth to save energy. In this paper, we study application channel access patterns, and use these patterns to determine potential additional energy savings that can be achieved by accordingly controlling the channels independently. We then evaluate the proposed DVFS algorithm using a novel hybrid evaluation methodology that includes simulation as well as executions on real hardware. Our results demonstrate the large potential of adaptive memory power management techniques based on DVFS for HPC workloads.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115671814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

STEAMEngine: Driving MapReduce provisioning in the cloud STEAMEngine:驱动云中的MapReduce配置

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152649

Michael Cardosa, Piyush Narang, A. Chandra, Himabindu Pucha, Aameek Singh

{"title":"STEAMEngine: Driving MapReduce provisioning in the cloud","authors":"Michael Cardosa, Piyush Narang, A. Chandra, Himabindu Pucha, Aameek Singh","doi":"10.1109/HiPC.2011.6152649","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152649","url":null,"abstract":"MapReduce has gained in popularity as a distributed data analysis paradigm, particularly in the cloud, where MapReduce jobs are run on virtual clusters. The provisioning of MapReduce jobs in the cloud is an important problem for optimizing several user as well as provider-side metrics, such as runtime, cost, throughput, energy, and load. In this paper, we present an intelligent provisioning framework called STEAMEngine that consists of provisioning algorithms to optimize these metrics through a set of common building blocks. These building blocks enable spatio-temporal tradeoffs unique to MapReduce provisioning: along with their resource requirements (spatial component), a MapReduce job runtime (temporal component) is a critical element for any provisioning algorithm. We also describe tw o novel provisioning algorithms — a user-driven performance optimization and a provider-driven energy optimization — that leverage these building blocks. Our experimental results based on an Amazon EC2 cluster and a local Xen/Hadoop cluster show the benefits of STEAMEngine through improvements in performance and energy via the use of these algorithms and building blocks.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133215926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Comparing archival policies for Blue Waters 比较蓝水的档案政策

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152428

Franck Cappello, Mathias Jacquelin, L. Marchal, Yves Robert, Marc Snir

{"title":"Comparing archival policies for Blue Waters","authors":"Franck Cappello, Mathias Jacquelin, L. Marchal, Yves Robert, Marc Snir","doi":"10.1109/HiPC.2011.6152428","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152428","url":null,"abstract":"This paper introduces two new tape archival policies that can improve tape archive performance in certain regimes, compared to the classical RAIT (Redundant Array of Independent Tapes) policy. The first policy, PARALLEL, still requires as many parallel tape drives as RAIT but pre-computes large data stripes that are written contiguously on tapes to increase write/read performance. The second policy, VERTICAL, writes contiguous data into a single tape, while updating error correcting information on the fly and delaying its archival until enough data has been archived. This second approach reduces the number of tape drives used for every user request to one. The performance of the three RAIT, PARALLEL and VE RTICAL policies is assessed through extensive simulations, using a hardware configuration and a distribution of I/O requests similar to these expected on the Blue Waters system. These simulations show that VERTICAL is the most suitable policy for small files, whereas PARALLEL must be used for files larger than 1 GB. We also demonstrate that RAIT never outperforms both proposed policies, and that a heterogeneous policies mixing VERTICAL and PARALLEL performs 10 times better than any other policy.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131676393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The impact of hyper-threading on processor resource utilization in production applications 超线程对生产应用程序中处理器资源利用率的影响

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152743

S. Saini, Haoqiang Jin, R. Hood, David Barker, P. Mehrotra, R. Biswas

{"title":"The impact of hyper-threading on processor resource utilization in production applications","authors":"S. Saini, Haoqiang Jin, R. Hood, David Barker, P. Mehrotra, R. Biswas","doi":"10.1109/HiPC.2011.6152743","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152743","url":null,"abstract":"Intel provides Hyper-Threading (HT) in processors based on its Pentium and Nehalem micro-architecture such as the Westmere-EP. HT enables two threads to execute on each core in order to hide latencies related to data access. These two threads can execute simultaneously, filling unused stages in the functional unit pipelines. To aid better understanding of HT-related issues, we collect Performance Monitoring Unit (PMU) data (instructions retired; unhalted core cycles; L2 and L3 cache hits and misses; vector and scalar floating-point operations, etc.). We then use the PMU data to calculate a new metric of efficiency in order to quantify processor resource utilization and make comparisons of that utilization between single-threading (ST) and HT modes. We also study performance gain using unhalted core cycles, code efficiency of using vector units of the processor, and the impact of HT mode on various shared resources like L2 and L3 cache. Results using four full-scale, production-quality scientific applications from computational fluid dynamics (CFD) used by NASA scientists indicate that HT generally improves processor resource utilization efficiency, but does not necessarily translate into overall application performance gain.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117161658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 48

Enabling CUDA acceleration within virtual machines using rCUDA 在使用rCUDA的虚拟机中启用CUDA加速

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152718

J. Duato, Antonio J. Peña, F. Silla, J. C. Fernández, R. Mayo, E. S. Quintana‐Ortí

{"title":"Enabling CUDA acceleration within virtual machines using rCUDA","authors":"J. Duato, Antonio J. Peña, F. Silla, J. C. Fernández, R. Mayo, E. S. Quintana‐Ortí","doi":"10.1109/HiPC.2011.6152718","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152718","url":null,"abstract":"The hardware and software advances of Graphics Processing Units (GPUs) have favored the development of GPGPU (General-Purpose Computation on GPUs) and its adoption in many scientific, engineering, and industrial areas. Thus, GPUs are increasingly being introduced in high-performance computing systems as well as in datacenters. On the other hand, virtualization technologies are also receiving rising interest in these domains, because of their many benefits on acquisition and maintenance savings. There are currently several works on GPU virtualization. However, there is no standard solution allowing access to GPGPU capabilities from virtual machine environments like, e.g., VMware, Xen, VirtualBox, or KVM. Such lack of a standard solution is delaying the integration of GPGPU into these domains. In this paper, we propose a first step towards a general and open source approach for using GPGPU features within VMs. In particular, we describe the use of rCUDA, a GPGPU (General-Purpose Computation on GPUs) virtualization framework, to permit the execution of GPU-accelerated applications within virtual machines (VMs), thus enabling GPGPU capabilities on any virtualized environment. Our experiments with rCUDA in the context of KVM and VirtualBox on a system equipped with two NVIDIA GeForce 9800 GX2 cards illustrate the overhead introduced by the rCUDA middleware and prove the feasibility and scalability of this general virtualizing solution. Experimental results show that the overhead is proportional to the dataset size, while the scalability is similar to that of the native environment.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128162132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 80

Spectral evolution simulation on leading multi-socket, multicore platforms 领先的多套接字、多核平台上的频谱演化仿真

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HIPC.2011.6152730

S. Tabik, P. Mimica, O. Plata, E. Zapata, L. F. Romero

引用次数: 0

High-level template for the task-based parallel wavefront pattern 基于任务的并行波前模式的高级模板

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152717

Antonio J. Dios, R. Asenjo, A. Navarro, F. Corbera, E. Zapata

{"title":"High-level template for the task-based parallel wavefront pattern","authors":"Antonio J. Dios, R. Asenjo, A. Navarro, F. Corbera, E. Zapata","doi":"10.1109/HiPC.2011.6152717","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152717","url":null,"abstract":"Given the arrival of multicore processors, it has become a matter of urgency to introduce parallel programming into mainstream computing. In emerging applications, a class of computational problem that poses a challenge to the programmers is the wavefront pattern. A particular characteristic of this pattern is multi-dimensional streaming of the computations that must follow a dependence pattern. The modern software stack for multicore systems offers task-based programming libraries like TBB (Threading Building Blocks), that allow an execution model based on lightweight asynchronous tasks. We suggest that TBB provides useful features to improve the scalability of these kinds of codes but at the cost of leaving some low-level task management details to the programmer. In this paper, we discuss such low-level task management issues and incorporate them into a high-level TBB-based template. The goal of the template is to improve the programmer's productivity such that a nonexpert user can easily code complex wavefront problems without having to deal with task creation, synchronization or scheduling mechanisms. With our template, the user only has to specify a definition file with the wavefront dependence pattern and the function that each task has to execute. In addition, we describe our experience with the TBB template when coding four complex real wavefront problems. In these experiments, we found that the template implementations reduced the programming effort from 25% to 50% at a cost of increasing the overhead up to 5% when compared to manual implementations of the same problem.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130155114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Parallel implementation of MOPSO on GPU using OpenCL and CUDA 利用OpenCL和CUDA在GPU上并行实现MOPSO

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152719

J. Arun, Manoj Mishra, Sheshasayee V. Subramaniam

{"title":"Parallel implementation of MOPSO on GPU using OpenCL and CUDA","authors":"J. Arun, Manoj Mishra, Sheshasayee V. Subramaniam","doi":"10.1109/HiPC.2011.6152719","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152719","url":null,"abstract":"GPUs have brought supercomputing-at-desk by offering hundreds of processing cores at a very cheap cost. This has motivated researchers to implement and test parallel solutions to compute-intensive problems on GPU. Most real-world optimization problems are NP-hard and therefore compute intensive. Meta-heuristics are frequently used to solve these optimization problems. Multi-Objective particle swarm optimization (MOPSO) is one of the Meta-heuristic that has attracted many researchers due to its accuracy and simplicity. In last couple of years, many parallel implementations of MOPSO have been proposed in literature. However none of the researchers have implemented and tested performance of MOPSO on GPU. In this paper, we describe our implementation of MOPSO on GPU using CUDA and OpenCL, two of the most popular GPU frameworks for writing parallel applications. The performance of both implementations has been compared with sequential implementation of MOPSO through simulations. Resul ts show that performance can be improved by 90 percent using these parallel implementations. We then present a parallel archi ving technique and implement MOPSO in GPU with the proposed archiving technique using CUDA. Simulation results show that the parallel archiving technique further improves the speedup.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132830133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

A multiresolution data model for improving simulation I/O performance 一种提高仿真I/O性能的多分辨率数据模型

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152747

A. Foulks, R. Bergeron

引用次数: 0

Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures 带有网络端点的多线程UPC运行时:多核架构的设计选择和评估

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152734

Miao Luo, Jithin Jose, S. Sur, D. Panda

{"title":"Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures","authors":"Miao Luo, Jithin Jose, S. Sur, D. Panda","doi":"10.1109/HiPC.2011.6152734","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152734","url":null,"abstract":"Multi-core architectures are becoming more and more popular in HEC (High End Computing) era. Recent trends of high-productivity computing in conjunction with advanced multi-core and network architectures have increased the interest in Global Address Space (PGAS) languages, due to its high-productivity feature and better applicability. Unified Parallel C (UPC) is an emerging PGAS language. In this paper, we compare different design alternatives for a high-performance and scalable UPC runtime on multi-core nodes, from several aspects: performance, portability, interoperability and support for irregular parallelism. Based on our analysis, we present a novel design of a multi-threaded UPC runtime that supports multi-endpoints. Our runtime is able to dramatically decrease network access contention resulting in 80% lower latency for fine-grained memget/memput operations and almost doubling the bandwidth for medium size messages, compared to multi-threaded Berkeley UPC Runtime. Furthermore, the multi-endpoint design opens up new doors for runtime optimizations — such as support for irregular parallelism. We utilize true network helper threads and load-balancing via work stealing in the runtime. Our evaluation with novel benchmarks shows that our runtime can achieve 90% of the peak efficiency, which is a factor of 1.3 times better than existing Berkeley UPC Runtime. To the best of our knowledge, this is the first work in which multi-network endpoint capable UPC runtime design is proposed for modern multi-core systems.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126112204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15