2011 18th International Conference on High Performance Computing最新文献

筛选
英文 中文
High performance cache block replication using re-reference probability in CMPs 在cmp中使用重引用概率的高性能缓存块复制
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152739
Jinglei Wang, Dongsheng Wang, Haixia Wang, Y. Xue
{"title":"High performance cache block replication using re-reference probability in CMPs","authors":"Jinglei Wang, Dongsheng Wang, Haixia Wang, Y. Xue","doi":"10.1109/HiPC.2011.6152739","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152739","url":null,"abstract":"In a Chip Multiprocessor(CMP) with shared caches, the last level cache (LLC) is distributed across all the cores. This increases the on-chip communication delay and thus influence the pr ocessor's performance. The LLC is also quite inefficient due to plenty of dead blocks. Replication can be provided in shared caches by replicating cache blocks evicted from cores to the local LLC slices to minimize access latency through utilizing the cache space of dead blocks which will not be referenced again before they are evicted. However, naively allowing all evicted blocks to be replicated have limited performance benefit as such replicating does not take into account reuse probability of replicated blocks. This paper proposes Adaptive Probability Replication (APR), a mechanism that counts each block's accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and insert them at appropriate position, according to their re-reference probability. We evaluate APR for a 16-core tiled CMP using splash-2 and parsec benchmarks. APR improves performance by 21% on average compared to conventional shared cache design, by 17% over Victim Replication (VR), by 10% over Adaptive Selective Replication (ASR), and by 15% over Reactive NUCA (R-NUCA). The additional hardware cost of APR is well under 1% of L2 cache slice.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126429898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Parallel multiple precision division by a single precision divisor 用单个精确除数并行多个精确除法
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152712
Niall Emmart, C. Weems
{"title":"Parallel multiple precision division by a single precision divisor","authors":"Niall Emmart, C. Weems","doi":"10.1109/HiPC.2011.6152712","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152712","url":null,"abstract":"We report an algorithm for division of a multi-precision integer by a single-precision value using a graphics processing unit (GPU). Our algorithm combines a parallel version of Jebelean's exact division algorithm with a left-to-right algorithm for computing the borrow chain, to relax the requirement of exactness. We also employ Takahashi's recently reported cyclic reduction technique [10] for GPU division to further enhance performance. The result is that our algorithm is asymptotically faster, at O(n/p + log p), than Takahashi's algorithm at O(n/p log p). We report results for dividends with precisions of 1024, 2048, and 4096 bits running on an NVIDIA GTX 480, and show that, for non-constant divisors, our algorithm is 20% slower at 1024 bits (due to startup overhead), by 2048 we are 40% faster, and at 4096 bits we are able to run 2.5 times faster. For division by constants, with precomputed tables, our algorithm is faster at all sizes with a speedup ranging from 2.3 to 6 times faster.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114148600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A machine learning-based approach for thread mapping on transactional memory applications 基于机器学习的事务内存应用程序线程映射方法
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152736
M. Castro, L. F. Góes, Christiane Pousa Ribeiro, M. Cole, Marcelo H. Cintra, J. Méhaut
{"title":"A machine learning-based approach for thread mapping on transactional memory applications","authors":"M. Castro, L. F. Góes, Christiane Pousa Ribeiro, M. Cole, Marcelo H. Cintra, J. Méhaut","doi":"10.1109/HiPC.2011.6152736","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152736","url":null,"abstract":"Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching application behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and resolution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile several STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46% compared to the worst case and up to 6.37% over the Linux default thread mapping strategy.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115967546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Optimizing multicore performance with message driven execution: A case study 使用消息驱动的执行优化多核性能:一个案例研究
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152744
Pritish Jetley, L. Kalé
{"title":"Optimizing multicore performance with message driven execution: A case study","authors":"Pritish Jetley, L. Kalé","doi":"10.1109/HiPC.2011.6152744","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152744","url":null,"abstract":"With the growing amount of parallelism available on today's multicore processors, achieving good performance at scale is challenging. We approach this issue through an alternative to traditional thread-based paradigms for writing shared memory programs, namely message driven multicore programming. We study a number of optimizations that improve the efficiency of message driven programs on multicore architectures. In particular, we focus on the following runtime system-enabled optimizations: (i) grainsize control to effect a good concurrency-overhead tradeoff, (ii) dynamic balancing of processor load, (iii) low-overhead, asynchronous communication for lock-free and message-driven execution and (iv) communication-reduction through a novel chunked shared array abstraction. The practical impact of these optimizations is quantified through a parallel kd-tree construction program written in the message-driven paradigm. A comparison of the optimized code with a state-of-the-art parallel kd-tree construction program is also presented.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116212552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Dynamic selection of tile sizes 动态选择瓷砖大小
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152742
Sanket Tavarageri, L. Pouchet, J. Ramanujam, A. Rountev, P. Sadayappan
{"title":"Dynamic selection of tile sizes","authors":"Sanket Tavarageri, L. Pouchet, J. Ramanujam, A. Rountev, P. Sadayappan","doi":"10.1109/HiPC.2011.6152742","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152742","url":null,"abstract":"Tiling is a key program transformation to achieve effective data reuse. But the performance of tiled programs can vary considerably with different tile sizes. Hence the selection of good tile sizes is crucial. Although there has been considerable research on analytical models for selecting tile sizes, they have not been shown to be effective in finding optimal tile sizes across a range of programs and target architectures. Auto-tuning is a viable alternative that is often used in practice, and involves the execution of different combinations of tile sizes in a systematic fashion to find the best ones. But this is sometimes infeasible — for instance when the program is to be run on unknown platforms (e.g., cloud environments). We propose a novel approach for generating code to enable dynamic tile size selection, based on monitoring the performance of a few loop iterations. The selection operates at run time on the “production” run, without any a priori knowledge of the execution environment. We discuss the theory and implementation of a parametric tiled code generator that enables run-time tile size tuning and describe a search strategy to determine effective tile sizes. Experimental results demonstrate the effectiveness of the approach.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123596973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Scheduling diverse high performance computing systems with the goal of maximizing utilization 调度不同的高性能计算系统,目标是最大化利用率
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152723
Tabitha K. Samuel, Troy Baer, R. G. Brook, M. Ezell, P. Kovatch
{"title":"Scheduling diverse high performance computing systems with the goal of maximizing utilization","authors":"Tabitha K. Samuel, Troy Baer, R. G. Brook, M. Ezell, P. Kovatch","doi":"10.1109/HiPC.2011.6152723","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152723","url":null,"abstract":"High performance computing resources attract a wide range of computational users and corresponding job widths and lengths. For example, on the petaflop Cray XT5 machine, Kraken, users submit jobs ranging from a few hundred cores (capacity computing) to over hundred thousand cores (capability computing). Traditionally it has been difficult to maintain high utilization while juggling such a diverse job mix. This paper explores four unique approaches to achieve our scheduling goals of maximizing utilization on four distinct resources at the National Institute for Computational Sciences. The resources include the petaflop machine, Kraken, Athena — a 166 TF Cray XT4, a 4 TB shared memory NUMA machine called Nautilus, and a GPU cluster called Keeneland.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114514030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Building algorithmically nonstop fault tolerant MPI programs 构建算法不间断容错MPI程序
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152716
Rui Wang, Erlin Yao, Mingyu Chen, Guangming Tan, P. Balaji, Darius Buntinas
{"title":"Building algorithmically nonstop fault tolerant MPI programs","authors":"Rui Wang, Erlin Yao, Mingyu Chen, Guangming Tan, P. Balaji, Darius Buntinas","doi":"10.1109/HiPC.2011.6152716","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152716","url":null,"abstract":"With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more-or-less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait algorithm-based fault tolerance (ABFT) recovery technique, we propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation. When failure occurs during the execution of applications, we do not stop to wait for the recovery of the corrupted node; instead, we replace it with the corresponding redundant node and continue the execution. At the end of execution, the correct solution can be recovered algorithmically at a very low cost. In order to implement the scheme, some new fault-tolerant features of the Message Passing Interface (MPI) have been investigated and utilized in the MPICH implementation of MPI. We also describe a case study using High Performance Linpack (HPL) with these new features and evaluate the performance of both our new scheme and ABFT recovery. Experimental results show the advantage of our new scheme over ABFT recovery even in a small scale.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121655396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Supporting computational data model representation with high-performance I/O in parallel netCDF 支持在并行netCDF中使用高性能I/O的计算数据模型表示
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152746
Kui Gao, Chen Jin, A. Choudhary, W. Liao
{"title":"Supporting computational data model representation with high-performance I/O in parallel netCDF","authors":"Kui Gao, Chen Jin, A. Choudhary, W. Liao","doi":"10.1109/HiPC.2011.6152746","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152746","url":null,"abstract":"Parallel computational scientific applications have been described by their computation and communication patterns. From a storage and I/O perspective, these applications can also be grouped into separate data models based on the way data is organized and accessed during simulation, analysis, and visualization. Parallel netCDF is a popular library used in many scientific applications to store scientific datasets and provides high-performance parallel I/O. Although the metadata-rich netCDF file format can effectively store and describe regular multi-dimensional array datasets, it does not address the full range of current and future computational science data models. In this paper, we present a new storage scheme in Parallel netCDF to represent a broad variety of data models used in modern computational scientific applications. This scheme also allows concurrent metadata construction for different data objects from multiple groups of application processes, an important feature in obtaining a high degree of I/O parallelism for data models exhibiting irregular data distribution. Furthermore, we employ non-blocking I/O functions to aggregate irregularly distributed data requests into large, contiguous data requests, to achieve high-performance I/O. Using an example of adaptive mesh refinement data model, we demonstrate the proposed scheme can produce scalable performance results for both data and metadata creation and access.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129234948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Hybrid algorithms for list ranking and graph connected components 列表排序和图连通组件的混合算法
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152655
D. Banerjee, Kishore Kothapalli
{"title":"Hybrid algorithms for list ranking and graph connected components","authors":"D. Banerjee, Kishore Kothapalli","doi":"10.1109/HiPC.2011.6152655","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152655","url":null,"abstract":"The advent of multicore and many-core architectures saw them being deployed to speed-up computations across several disciplines and application areas. Prominent examples include semi-numerical algorithms such as sorting, graph algorithms, image processing, scientific computations, and the like. In particular, using GPUs for general purpose computations has attracted a lot of attention given that GPUs can deliver more than one TFLOP of computing power at very low prices. In this work, we use a new model of multicore computing called hybrid multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an accelerator such as a GPU. To this end, we use two case studies to explore the algorithmic and analytical issues in hybrid multicore computing. Our case studies involve two different ways of designing hybrid multicore algorithms. The main contribution of this paper is to address the issues related to the design of hybrid solutions. We show our hybrid algorithm for list ranking is faster by 50% compared to the best known implementation [Z. Wei, J. JaJa; IPDPS 2010]. Similarly, our hybrid algorithm for graph connected components is faster by 25% compared to the best known GPU implementation [26].","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116165684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Maximizing throughput of jobs with multiple resource requirements 最大化具有多种资源需求的作业的吞吐量
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152721
Venkatesan T. Chakaravarthy, Sambuddha Roy, Yogish Sabharwal, Neha Sengupta
{"title":"Maximizing throughput of jobs with multiple resource requirements","authors":"Venkatesan T. Chakaravarthy, Sambuddha Roy, Yogish Sabharwal, Neha Sengupta","doi":"10.1109/HiPC.2011.6152721","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152721","url":null,"abstract":"We consider the problem of scheduling jobs that require multiple resources such as memory, bandwidth and processors. For each job, the input specifies start time, finish time and profit; the input also specifies the job's requirement for each resource. Each resource has a fixed capacity (called bandwidth). A feasible solution is a subset of jobs such that for any timeslot and any resource, the total requirement of the jobs active at the timeslot does not exceed the capacity of the resource. The goal is to maximize the profit of the jobs selected. We present an approximation algorithm with provable guarantees and effective heuristics for this problem. The algorithm has an approximation ratio of O(r), where r is the number of resources. We present an experimental evaluation of our algorithms that exhibit their effectiveness.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122572064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信