2011 18th International Conference on High Performance Computing最新文献

筛选
英文 中文
Adaptive memory power management techniques for HPC workloads HPC工作负载的自适应内存电源管理技术
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152740
Karthik Elangovan, I. Rodero, M. Parashar, F. Guim, I. Hernandez
{"title":"Adaptive memory power management techniques for HPC workloads","authors":"Karthik Elangovan, I. Rodero, M. Parashar, F. Guim, I. Hernandez","doi":"10.1109/HiPC.2011.6152740","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152740","url":null,"abstract":"The memory subsystem is responsible for a large fraction of the energy consumed by compute nodes in High Performance Computing (HPC) systems. The rapid increase in the number of cores has been accompanied by a corresponding increase in the DRAM capacity and bandwidth, and as a result, the memory system consumes a significant amount of the power budget available to a compute node. Consequently, there is a broad research effort focused on power management techniques using DRAM low-power modes. However, memory power management continues to present many challenges. In this paper, we study the potential of Dynamic Voltage and Frequency Scaling (DVFS) of the memory subsystems, and consider the ability to select different frequencies for different memory channels. Our approach is based on tuning voltage and frequency dynamically to maximize the energy savings while maintaining performance degradation within tolerable limits. We assume that HPC applications do not demand maximum bandwidth throughout the entire period of execution. We can use these low memory demand intervals to tune down the frequency and, as a result, applications can tolerate a reduction in bandwidth to save energy. In this paper, we study application channel access patterns, and use these patterns to determine potential additional energy savings that can be achieved by accordingly controlling the channels independently. We then evaluate the proposed DVFS algorithm using a novel hybrid evaluation methodology that includes simulation as well as executions on real hardware. Our results demonstrate the large potential of adaptive memory power management techniques based on DVFS for HPC workloads.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115671814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
STEAMEngine: Driving MapReduce provisioning in the cloud STEAMEngine:驱动云中的MapReduce配置
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152649
Michael Cardosa, Piyush Narang, A. Chandra, Himabindu Pucha, Aameek Singh
{"title":"STEAMEngine: Driving MapReduce provisioning in the cloud","authors":"Michael Cardosa, Piyush Narang, A. Chandra, Himabindu Pucha, Aameek Singh","doi":"10.1109/HiPC.2011.6152649","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152649","url":null,"abstract":"MapReduce has gained in popularity as a distributed data analysis paradigm, particularly in the cloud, where MapReduce jobs are run on virtual clusters. The provisioning of MapReduce jobs in the cloud is an important problem for optimizing several user as well as provider-side metrics, such as runtime, cost, throughput, energy, and load. In this paper, we present an intelligent provisioning framework called STEAMEngine that consists of provisioning algorithms to optimize these metrics through a set of common building blocks. These building blocks enable spatio-temporal tradeoffs unique to MapReduce provisioning: along with their resource requirements (spatial component), a MapReduce job runtime (temporal component) is a critical element for any provisioning algorithm. We also describe tw o novel provisioning algorithms — a user-driven performance optimization and a provider-driven energy optimization — that leverage these building blocks. Our experimental results based on an Amazon EC2 cluster and a local Xen/Hadoop cluster show the benefits of STEAMEngine through improvements in performance and energy via the use of these algorithms and building blocks.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133215926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Comparing archival policies for Blue Waters 比较蓝水的档案政策
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152428
Franck Cappello, Mathias Jacquelin, L. Marchal, Yves Robert, Marc Snir
{"title":"Comparing archival policies for Blue Waters","authors":"Franck Cappello, Mathias Jacquelin, L. Marchal, Yves Robert, Marc Snir","doi":"10.1109/HiPC.2011.6152428","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152428","url":null,"abstract":"This paper introduces two new tape archival policies that can improve tape archive performance in certain regimes, compared to the classical RAIT (Redundant Array of Independent Tapes) policy. The first policy, PARALLEL, still requires as many parallel tape drives as RAIT but pre-computes large data stripes that are written contiguously on tapes to increase write/read performance. The second policy, VERTICAL, writes contiguous data into a single tape, while updating error correcting information on the fly and delaying its archival until enough data has been archived. This second approach reduces the number of tape drives used for every user request to one. The performance of the three RAIT, PARALLEL and VE RTICAL policies is assessed through extensive simulations, using a hardware configuration and a distribution of I/O requests similar to these expected on the Blue Waters system. These simulations show that VERTICAL is the most suitable policy for small files, whereas PARALLEL must be used for files larger than 1 GB. We also demonstrate that RAIT never outperforms both proposed policies, and that a heterogeneous policies mixing VERTICAL and PARALLEL performs 10 times better than any other policy.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131676393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The impact of hyper-threading on processor resource utilization in production applications 超线程对生产应用程序中处理器资源利用率的影响
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152743
S. Saini, Haoqiang Jin, R. Hood, David Barker, P. Mehrotra, R. Biswas
{"title":"The impact of hyper-threading on processor resource utilization in production applications","authors":"S. Saini, Haoqiang Jin, R. Hood, David Barker, P. Mehrotra, R. Biswas","doi":"10.1109/HiPC.2011.6152743","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152743","url":null,"abstract":"Intel provides Hyper-Threading (HT) in processors based on its Pentium and Nehalem micro-architecture such as the Westmere-EP. HT enables two threads to execute on each core in order to hide latencies related to data access. These two threads can execute simultaneously, filling unused stages in the functional unit pipelines. To aid better understanding of HT-related issues, we collect Performance Monitoring Unit (PMU) data (instructions retired; unhalted core cycles; L2 and L3 cache hits and misses; vector and scalar floating-point operations, etc.). We then use the PMU data to calculate a new metric of efficiency in order to quantify processor resource utilization and make comparisons of that utilization between single-threading (ST) and HT modes. We also study performance gain using unhalted core cycles, code efficiency of using vector units of the processor, and the impact of HT mode on various shared resources like L2 and L3 cache. Results using four full-scale, production-quality scientific applications from computational fluid dynamics (CFD) used by NASA scientists indicate that HT generally improves processor resource utilization efficiency, but does not necessarily translate into overall application performance gain.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117161658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
Enabling CUDA acceleration within virtual machines using rCUDA 在使用rCUDA的虚拟机中启用CUDA加速
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152718
J. Duato, Antonio J. Peña, F. Silla, J. C. Fernández, R. Mayo, E. S. Quintana‐Ortí
{"title":"Enabling CUDA acceleration within virtual machines using rCUDA","authors":"J. Duato, Antonio J. Peña, F. Silla, J. C. Fernández, R. Mayo, E. S. Quintana‐Ortí","doi":"10.1109/HiPC.2011.6152718","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152718","url":null,"abstract":"The hardware and software advances of Graphics Processing Units (GPUs) have favored the development of GPGPU (General-Purpose Computation on GPUs) and its adoption in many scientific, engineering, and industrial areas. Thus, GPUs are increasingly being introduced in high-performance computing systems as well as in datacenters. On the other hand, virtualization technologies are also receiving rising interest in these domains, because of their many benefits on acquisition and maintenance savings. There are currently several works on GPU virtualization. However, there is no standard solution allowing access to GPGPU capabilities from virtual machine environments like, e.g., VMware, Xen, VirtualBox, or KVM. Such lack of a standard solution is delaying the integration of GPGPU into these domains. In this paper, we propose a first step towards a general and open source approach for using GPGPU features within VMs. In particular, we describe the use of rCUDA, a GPGPU (General-Purpose Computation on GPUs) virtualization framework, to permit the execution of GPU-accelerated applications within virtual machines (VMs), thus enabling GPGPU capabilities on any virtualized environment. Our experiments with rCUDA in the context of KVM and VirtualBox on a system equipped with two NVIDIA GeForce 9800 GX2 cards illustrate the overhead introduced by the rCUDA middleware and prove the feasibility and scalability of this general virtualizing solution. Experimental results show that the overhead is proportional to the dataset size, while the scalability is similar to that of the native environment.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128162132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
Spectral evolution simulation on leading multi-socket, multicore platforms 领先的多套接字、多核平台上的频谱演化仿真
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HIPC.2011.6152730
S. Tabik, P. Mimica, O. Plata, E. Zapata, L. F. Romero
{"title":"Spectral evolution simulation on leading multi-socket, multicore platforms","authors":"S. Tabik, P. Mimica, O. Plata, E. Zapata, L. F. Romero","doi":"10.1109/HIPC.2011.6152730","DOIUrl":"https://doi.org/10.1109/HIPC.2011.6152730","url":null,"abstract":"Spectral evolution simulations based on the observed Very Long Baseline Interferometry (VLBI) radio-maps are of paramount importance to understand the nature of extragalactic objects in astrophysics. This work analyzes the performance and scaling of a spectral evolution algorithm on three leading multi-socket, multi-core architectures. We evaluate three parallel models with different levels of data-sharing: a sharing approach, a privatizing approach and a hybrid approach. Our experiments show that the data-privatizing model is reasonably efficient on medium scale multi-socket, multi-core systems (up to 48 cores) while regardless algorithmic and scheduling optimizations, sharing approach is unable to reach acceptable scalability on more than one socket. However, the hybrid model with a specific level of data-sharing gives the best scalability over all the considered multi-socket, multi-core systems.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114248488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High-level template for the task-based parallel wavefront pattern 基于任务的并行波前模式的高级模板
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152717
Antonio J. Dios, R. Asenjo, A. Navarro, F. Corbera, E. Zapata
{"title":"High-level template for the task-based parallel wavefront pattern","authors":"Antonio J. Dios, R. Asenjo, A. Navarro, F. Corbera, E. Zapata","doi":"10.1109/HiPC.2011.6152717","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152717","url":null,"abstract":"Given the arrival of multicore processors, it has become a matter of urgency to introduce parallel programming into mainstream computing. In emerging applications, a class of computational problem that poses a challenge to the programmers is the wavefront pattern. A particular characteristic of this pattern is multi-dimensional streaming of the computations that must follow a dependence pattern. The modern software stack for multicore systems offers task-based programming libraries like TBB (Threading Building Blocks), that allow an execution model based on lightweight asynchronous tasks. We suggest that TBB provides useful features to improve the scalability of these kinds of codes but at the cost of leaving some low-level task management details to the programmer. In this paper, we discuss such low-level task management issues and incorporate them into a high-level TBB-based template. The goal of the template is to improve the programmer's productivity such that a nonexpert user can easily code complex wavefront problems without having to deal with task creation, synchronization or scheduling mechanisms. With our template, the user only has to specify a definition file with the wavefront dependence pattern and the function that each task has to execute. In addition, we describe our experience with the TBB template when coding four complex real wavefront problems. In these experiments, we found that the template implementations reduced the programming effort from 25% to 50% at a cost of increasing the overhead up to 5% when compared to manual implementations of the same problem.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130155114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Parallel implementation of MOPSO on GPU using OpenCL and CUDA 利用OpenCL和CUDA在GPU上并行实现MOPSO
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152719
J. Arun, Manoj Mishra, Sheshasayee V. Subramaniam
{"title":"Parallel implementation of MOPSO on GPU using OpenCL and CUDA","authors":"J. Arun, Manoj Mishra, Sheshasayee V. Subramaniam","doi":"10.1109/HiPC.2011.6152719","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152719","url":null,"abstract":"GPUs have brought supercomputing-at-desk by offering hundreds of processing cores at a very cheap cost. This has motivated researchers to implement and test parallel solutions to compute-intensive problems on GPU. Most real-world optimization problems are NP-hard and therefore compute intensive. Meta-heuristics are frequently used to solve these optimization problems. Multi-Objective particle swarm optimization (MOPSO) is one of the Meta-heuristic that has attracted many researchers due to its accuracy and simplicity. In last couple of years, many parallel implementations of MOPSO have been proposed in literature. However none of the researchers have implemented and tested performance of MOPSO on GPU. In this paper, we describe our implementation of MOPSO on GPU using CUDA and OpenCL, two of the most popular GPU frameworks for writing parallel applications. The performance of both implementations has been compared with sequential implementation of MOPSO through simulations. Resul ts show that performance can be improved by 90 percent using these parallel implementations. We then present a parallel archi ving technique and implement MOPSO in GPU with the proposed archiving technique using CUDA. Simulation results show that the parallel archiving technique further improves the speedup.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132830133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
A multiresolution data model for improving simulation I/O performance 一种提高仿真I/O性能的多分辨率数据模型
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152747
A. Foulks, R. Bergeron
{"title":"A multiresolution data model for improving simulation I/O performance","authors":"A. Foulks, R. Bergeron","doi":"10.1109/HiPC.2011.6152747","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152747","url":null,"abstract":"Numerical simulations running on very large High Performance Computer clusters still suffer from the I/O bottleneck. The cost of communication can overwhelm the cost of computation, and scales inversely with the number of processors used in the cluster. In previous work we have developed a multiresolution data model to help improve performance for visualizations of very large multi dimensional scientific data sets. In our approach, the data is represented as a multi level hierarchy. Reconstructive error analysis is used to identify regions in the data where the data loss is greatest. We have incorporated this data model into the OpenGGCM solar wind simulation environment. In this paper, we demonstrate that this approach can reduce the I/O and improve the overall performance of a large numerical simulation environment.1","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121849079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures 带有网络端点的多线程UPC运行时:多核架构的设计选择和评估
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152734
Miao Luo, Jithin Jose, S. Sur, D. Panda
{"title":"Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures","authors":"Miao Luo, Jithin Jose, S. Sur, D. Panda","doi":"10.1109/HiPC.2011.6152734","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152734","url":null,"abstract":"Multi-core architectures are becoming more and more popular in HEC (High End Computing) era. Recent trends of high-productivity computing in conjunction with advanced multi-core and network architectures have increased the interest in Global Address Space (PGAS) languages, due to its high-productivity feature and better applicability. Unified Parallel C (UPC) is an emerging PGAS language. In this paper, we compare different design alternatives for a high-performance and scalable UPC runtime on multi-core nodes, from several aspects: performance, portability, interoperability and support for irregular parallelism. Based on our analysis, we present a novel design of a multi-threaded UPC runtime that supports multi-endpoints. Our runtime is able to dramatically decrease network access contention resulting in 80% lower latency for fine-grained memget/memput operations and almost doubling the bandwidth for medium size messages, compared to multi-threaded Berkeley UPC Runtime. Furthermore, the multi-endpoint design opens up new doors for runtime optimizations — such as support for irregular parallelism. We utilize true network helper threads and load-balancing via work stealing in the runtime. Our evaluation with novel benchmarks shows that our runtime can achieve 90% of the peak efficiency, which is a factor of 1.3 times better than existing Berkeley UPC Runtime. To the best of our knowledge, this is the first work in which multi-network endpoint capable UPC runtime design is proposed for modern multi-core systems.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126112204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信