2011 18th International Conference on High Performance Computing最新文献_第3页

High performance cache block replication using re-reference probability in CMPs 在cmp中使用重引用概率的高性能缓存块复制

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152739

Jinglei Wang, Dongsheng Wang, Haixia Wang, Y. Xue

{"title":"High performance cache block replication using re-reference probability in CMPs","authors":"Jinglei Wang, Dongsheng Wang, Haixia Wang, Y. Xue","doi":"10.1109/HiPC.2011.6152739","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152739","url":null,"abstract":"In a Chip Multiprocessor(CMP) with shared caches, the last level cache (LLC) is distributed across all the cores. This increases the on-chip communication delay and thus influence the pr ocessor's performance. The LLC is also quite inefficient due to plenty of dead blocks. Replication can be provided in shared caches by replicating cache blocks evicted from cores to the local LLC slices to minimize access latency through utilizing the cache space of dead blocks which will not be referenced again before they are evicted. However, naively allowing all evicted blocks to be replicated have limited performance benefit as such replicating does not take into account reuse probability of replicated blocks. This paper proposes Adaptive Probability Replication (APR), a mechanism that counts each block's accesses in L2 cache slices, and monitors the number of evicted blocks with different number of accesses, to estimate the Re-Reference Probability of blocks in their lifetime at runtime. Using predicted re-reference probability, APR adopts probability replication policy and probability insertion policy to replicate blocks at corresponding probabilities, and insert them at appropriate position, according to their re-reference probability. We evaluate APR for a 16-core tiled CMP using splash-2 and parsec benchmarks. APR improves performance by 21% on average compared to conventional shared cache design, by 17% over Victim Replication (VR), by 10% over Adaptive Selective Replication (ASR), and by 15% over Reactive NUCA (R-NUCA). The additional hardware cost of APR is well under 1% of L2 cache slice.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126429898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Parallel multiple precision division by a single precision divisor 用单个精确除数并行多个精确除法

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152712

Niall Emmart, C. Weems

引用次数: 4

A machine learning-based approach for thread mapping on transactional memory applications 基于机器学习的事务内存应用程序线程映射方法

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152736

M. Castro, L. F. Góes, Christiane Pousa Ribeiro, M. Cole, Marcelo H. Cintra, J. Méhaut

{"title":"A machine learning-based approach for thread mapping on transactional memory applications","authors":"M. Castro, L. F. Góes, Christiane Pousa Ribeiro, M. Cole, Marcelo H. Cintra, J. Méhaut","doi":"10.1109/HiPC.2011.6152736","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152736","url":null,"abstract":"Thread mapping has been extensively used as a technique to efficiently exploit memory hierarchy on modern chip-multiprocessors. It places threads on cores in order to amortize memory latency and/or to reduce memory contention. However, efficient thread mapping relies upon matching application behavior with system characteristics. Particularly, Software Transactional Memory (STM) applications introduce another dimension due to its runtime system support. Existing STM systems implement several conflict detection and resolution mechanisms, which leads STM applications to behave differently for each combination of these mechanisms. In this paper we propose a machine learning-based approach to automatically infer a suitable thread mapping strategy for transactional memory applications. First, we profile several STM applications from the STAMP benchmark suite considering application, STM system and platform features to build a set of input instances. Then, such data feeds a machine learning algorithm, which produces a decision tree able to predict the most suitable thread mapping strategy for new unobserved instances. Results show that our approach improves performance up to 18.46% compared to the worst case and up to 6.37% over the Linux default thread mapping strategy.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115967546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

Optimizing multicore performance with message driven execution: A case study 使用消息驱动的执行优化多核性能:一个案例研究

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152744

Pritish Jetley, L. Kalé

引用次数: 1

Dynamic selection of tile sizes 动态选择瓷砖大小

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152742

Sanket Tavarageri, L. Pouchet, J. Ramanujam, A. Rountev, P. Sadayappan

{"title":"Dynamic selection of tile sizes","authors":"Sanket Tavarageri, L. Pouchet, J. Ramanujam, A. Rountev, P. Sadayappan","doi":"10.1109/HiPC.2011.6152742","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152742","url":null,"abstract":"Tiling is a key program transformation to achieve effective data reuse. But the performance of tiled programs can vary considerably with different tile sizes. Hence the selection of good tile sizes is crucial. Although there has been considerable research on analytical models for selecting tile sizes, they have not been shown to be effective in finding optimal tile sizes across a range of programs and target architectures. Auto-tuning is a viable alternative that is often used in practice, and involves the execution of different combinations of tile sizes in a systematic fashion to find the best ones. But this is sometimes infeasible — for instance when the program is to be run on unknown platforms (e.g., cloud environments). We propose a novel approach for generating code to enable dynamic tile size selection, based on monitoring the performance of a few loop iterations. The selection operates at run time on the “production” run, without any a priori knowledge of the execution environment. We discuss the theory and implementation of a parametric tiled code generator that enables run-time tile size tuning and describe a search strategy to determine effective tile sizes. Experimental results demonstrate the effectiveness of the approach.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123596973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

Scheduling diverse high performance computing systems with the goal of maximizing utilization 调度不同的高性能计算系统，目标是最大化利用率

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152723

Tabitha K. Samuel, Troy Baer, R. G. Brook, M. Ezell, P. Kovatch

引用次数: 8

Building algorithmically nonstop fault tolerant MPI programs 构建算法不间断容错MPI程序

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152716

Rui Wang, Erlin Yao, Mingyu Chen, Guangming Tan, P. Balaji, Darius Buntinas

{"title":"Building algorithmically nonstop fault tolerant MPI programs","authors":"Rui Wang, Erlin Yao, Mingyu Chen, Guangming Tan, P. Balaji, Darius Buntinas","doi":"10.1109/HiPC.2011.6152716","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152716","url":null,"abstract":"With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more-or-less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait algorithm-based fault tolerance (ABFT) recovery technique, we propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation. When failure occurs during the execution of applications, we do not stop to wait for the recovery of the corrupted node; instead, we replace it with the corresponding redundant node and continue the execution. At the end of execution, the correct solution can be recovered algorithmically at a very low cost. In order to implement the scheme, some new fault-tolerant features of the Message Passing Interface (MPI) have been investigated and utilized in the MPICH implementation of MPI. We also describe a case study using High Performance Linpack (HPL) with these new features and evaluate the performance of both our new scheme and ABFT recovery. Experimental results show the advantage of our new scheme over ABFT recovery even in a small scale.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121655396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Supporting computational data model representation with high-performance I/O in parallel netCDF 支持在并行netCDF中使用高性能I/O的计算数据模型表示

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152746

Kui Gao, Chen Jin, A. Choudhary, W. Liao

{"title":"Supporting computational data model representation with high-performance I/O in parallel netCDF","authors":"Kui Gao, Chen Jin, A. Choudhary, W. Liao","doi":"10.1109/HiPC.2011.6152746","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152746","url":null,"abstract":"Parallel computational scientific applications have been described by their computation and communication patterns. From a storage and I/O perspective, these applications can also be grouped into separate data models based on the way data is organized and accessed during simulation, analysis, and visualization. Parallel netCDF is a popular library used in many scientific applications to store scientific datasets and provides high-performance parallel I/O. Although the metadata-rich netCDF file format can effectively store and describe regular multi-dimensional array datasets, it does not address the full range of current and future computational science data models. In this paper, we present a new storage scheme in Parallel netCDF to represent a broad variety of data models used in modern computational scientific applications. This scheme also allows concurrent metadata construction for different data objects from multiple groups of application processes, an important feature in obtaining a high degree of I/O parallelism for data models exhibiting irregular data distribution. Furthermore, we employ non-blocking I/O functions to aggregate irregularly distributed data requests into large, contiguous data requests, to achieve high-performance I/O. Using an example of adaptive mesh refinement data model, we demonstrate the proposed scheme can produce scalable performance results for both data and metadata creation and access.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"158 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129234948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Hybrid algorithms for list ranking and graph connected components 列表排序和图连通组件的混合算法

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152655

D. Banerjee, Kishore Kothapalli

{"title":"Hybrid algorithms for list ranking and graph connected components","authors":"D. Banerjee, Kishore Kothapalli","doi":"10.1109/HiPC.2011.6152655","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152655","url":null,"abstract":"The advent of multicore and many-core architectures saw them being deployed to speed-up computations across several disciplines and application areas. Prominent examples include semi-numerical algorithms such as sorting, graph algorithms, image processing, scientific computations, and the like. In particular, using GPUs for general purpose computations has attracted a lot of attention given that GPUs can deliver more than one TFLOP of computing power at very low prices. In this work, we use a new model of multicore computing called hybrid multicore computing where the computation is performed simultaneously a control device, such as a CPU, and an accelerator such as a GPU. To this end, we use two case studies to explore the algorithmic and analytical issues in hybrid multicore computing. Our case studies involve two different ways of designing hybrid multicore algorithms. The main contribution of this paper is to address the issues related to the design of hybrid solutions. We show our hybrid algorithm for list ranking is faster by 50% compared to the best known implementation [Z. Wei, J. JaJa; IPDPS 2010]. Similarly, our hybrid algorithm for graph connected components is faster by 25% compared to the best known GPU implementation [26].","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116165684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Maximizing throughput of jobs with multiple resource requirements 最大化具有多种资源需求的作业的吞吐量

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152721

Venkatesan T. Chakaravarthy, Sambuddha Roy, Yogish Sabharwal, Neha Sengupta

引用次数: 0