2011 18th International Conference on High Performance Computing最新文献_第2页

Multi-model prediction for enhancing content locality in elastic server infrastructures 弹性服务器基础设施中增强内容局部性的多模型预测

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152728

Juan M. Tirado, Daniel Higuero, Florin Isaila, J. Carretero

引用次数: 25

Robust thread-level speculation 健壮的线程级推测

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152737

Álvaro García-Yágüez, D. Ferraris, Arturo González-Escribano

{"title":"Robust thread-level speculation","authors":"Álvaro García-Yágüez, D. Ferraris, Arturo González-Escribano","doi":"10.1109/HiPC.2011.6152737","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152737","url":null,"abstract":"Robustness is a key issue on any runtime system that aims to speed up the execution of a program. However, robustness considerations are commonly overlooked when new software-based, thread-level speculation (STLS) systems are proposed. This paper highlights the relevance of the problem, showing different situations when the use of incorrect data can irreversibly alter the speculative execution of an algorithm, despite the efforts of a given STLS system to maintain sequential consistency. We show that the management of speculative exceptions is a common factor to these problems. Based on this fact, we propose a novel solution to handle speculative exceptions. Our solution eagerly tries to solve the issue before the non-speculative thread arrives to the instruction that rose the exception. We compare our solution to a more conservative approach found in the bibliography. The comparison is done both qualitatively, through a detailed analysis of the tradeoffs involved, and quantitatively, evaluating the effects of both solutions in the execution of three different benchmarks on a real system. Both studies conclude that our solution handles the occurrence of speculative exceptions more efficiently. Under heavy loads intended to push to its limits a STLS system, our solution leads to execution times reduced by up to 52.02% with respect to earlier proposals. Our solution does not affect the performance when speculative exceptions do not appear. We believe that our proposal makes STLS systems robust enough to be used in production environments.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129537387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A fast centralized computation routing algorithm for self-configuring NoC systems 一种自配置NoC系统的快速集中计算路由算法

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152732

F. Triviño, F. J. Alfaro, J. L. Sánchez, J. Flich

引用次数: 2

Partial globalization of partitioned address spaces for zero-copy communication with shared memory 与共享内存进行零拷贝通信的分区地址空间的部分全球化

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152733

Fangzhou Jiao, N. Mahajan, Jeremiah Willcock, A. Chauhan, A. Lumsdaine

{"title":"Partial globalization of partitioned address spaces for zero-copy communication with shared memory","authors":"Fangzhou Jiao, N. Mahajan, Jeremiah Willcock, A. Chauhan, A. Lumsdaine","doi":"10.1109/HiPC.2011.6152733","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152733","url":null,"abstract":"We have developed a high-level language, called Kanor, for declaratively specifying communication in parallel programs. Designed as an extension of C++, it serves to coordinate partitioned address space programs written in the bulk synchronous parallel (BSP) style. Kanor's declarative semantics enable the programmers to write correct and maintainable parallel applications. The communication abstraction has been carefully designed to be amenable to compiler optimizations. While partitioned address space programming has several advantages, it needs special compiler optimizations to effectively leverage the shared memory hardware when running on multicore machines. In this paper, we introduce such shared-memory optimizations in the context of Kanor. One major way we achieve these optimizations is by selectively moving some of the variables into a globally shared address space — a process that we term partial globalization. We identify scenarios in which such a transformation is beneficial, and present an algorithm to identify and correctly transform Kanor communication steps into zero-copy communication using hardware shared memory, by introducing minimal synchronization. We then present a runtime strategy that complements the compiler algorithm to eliminate most of the runtime synchronization overheads by using a copy-on-conflict technique. Finally, we show that our solution often performs much better than shared-memory optimized MPI, and ne ver performs significantly worse than MPI even in the presence of dependencies introduced due to buffer sharing. The techniques in this paper demonstrate that it is possible to program in a partitioned address space style, without sacrificing the performance advantages of hardware shared memory. To the best of our knowledge no other automatic compiler techniques have been developed so far that achieve zero-copy communication from a partitioned address space program. We expect out results to be applicable beyond Kanor, to other partitioned address space programming environments, such as MPI.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"309 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114098679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A dynamic scheduling framework for emerging heterogeneous systems 新兴异构系统的动态调度框架

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152724

Vignesh T. Ravi, G. Agrawal

{"title":"A dynamic scheduling framework for emerging heterogeneous systems","authors":"Vignesh T. Ravi, G. Agrawal","doi":"10.1109/HiPC.2011.6152724","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152724","url":null,"abstract":"A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Recently, it has become very common for a desktop or a notebook computer to be equipped with both a multi-core CPU and a GPU. Application development for exploiting the aggregate computing power of such an environment is a major challenge today. Particularly, we need dynamic work distribution schemes that are adaptable to different computation and communication patterns in applications, and to various heterogeneous configurations. This paper describes a general dynamic scheduling framework for mapping applications with different communication patterns to heterogeneous architectures. We first make key observations about the architectural tradeoffs among heterogeneous resources and the communication pattern of an application, and then infer constraints for the dynamic scheduler. We then present a novel cost model for choosing the optimal chunk size in a heterogeneous configuration. Finally, based on general framework and cost model we provide optimized work distribution schemes to further improve the performance.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"100 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116290873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Dynamic hosting management of web based applications over clouds 基于云的网络应用程序动态托管管理

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152731

Z. Abbasi, T. Mukherjee, G. Varsamopoulos, S. Gupta

引用次数: 22

Implementing a hybrid SRAM / eDRAM NUCA architecture 实现混合SRAM / eDRAM NUCA架构

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152738

Javier Lira, Carlos Molina, D. Brooks, Antonio González

{"title":"Implementing a hybrid SRAM / eDRAM NUCA architecture","authors":"Javier Lira, Carlos Molina, D. Brooks, Antonio González","doi":"10.1109/HiPC.2011.6152738","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152738","url":null,"abstract":"Advances in technology allowed for integrating DRAM-like structures into the chip, called embedded DRAM (eDRAM). This technology has already been successfully implemented in some GPUs and other graphic-intensive SoC, like game consoles. The most recent processor from IBM®, POWER7, is the first general-purpose processor that integrates an eDRAM module on the chip. In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are evicted. Based on that observation, we propose a placement scheme where re-accessed data blocks are stored in fast, but costly in terms of area and power, SRAM banks, while eDRAM banks store data blo cks that just arrive to the NUCA cache or were demoted from a SRAM bank. We show that a well-balanced SRAM / eDRAM NUCA cache can achieve similar performance results than using a NUCA cache composed of only SRAM banks, but reduces area by 15% and power consumed by 10%. Furthermore, we also explore several alternatives to exploit the area reduction we gain by using the hybrid architecture, resulting in an overall performance improvement of 4%.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124624830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

GVT algorithms and discrete event dynamics on 129K+ processor cores 129K+处理器核上的GVT算法和离散事件动力学

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152725

K. Perumalla, Alfred Park, V. Tipparaju

{"title":"GVT algorithms and discrete event dynamics on 129K+ processor cores","authors":"K. Perumalla, Alfred Park, V. Tipparaju","doi":"10.1109/HiPC.2011.6152725","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152725","url":null,"abstract":"Parallel discrete event simulation (PDES) represents a class of codes that are challenging to scale to large number of processors due to tight global timestamp-ordering and fine-grained event execution. One of the critical factors in scaling PDES is the efficiency of the underlying global virtual time (GVT) algorithm needed for correctness of parallel execution and speed of progress. Although many GVT algorithms have been proposed previously, few have been proposed for scalable asynchronous execution and none customized to exploit one-sided communication. Moreover, the detailed performance effects of actual GVT algorithm implementations on large platforms are unknown. Here, three major GVT algorithms intended for scalable execution on high-performance systems are studied: (1) a synchronous GVT algorithm that affords ease of implementation, (2) an asynchronous GVT algorithm that is more complex to implement but can relieve blocking latencies, and (3) a variant of the asynchronous GVT algorithm, proposed and studied for the first time here, to exploit one-sided communication in extant supercomputing platforms. Performance results are presented of implementations of these algorithms on up to 129,024 cores of a Cray XT5 system, exercised on a range of parameters: optimistic and conservative synchronization, fine-to medium-grained event computation, synthetic and non-synthetic applications, and different lookahead values. Performance to the tune of tens of billions of events executed per second are registered, exceeding the speeds of any known PDES engine, and showing asynchronous GVT algorithms to outperform state-of-the-art synchronous GVT algorithms. Detailed PDES-specific runtime metrics are presented to further the understanding of tightly-coupled discrete event dynamics on massively parallel platforms.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124009834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Hybrid implementation of error diffusion dithering 误差扩散抖动的混合实现

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152714

A. Deshpande, Ishan Misra, P J Narayanan

引用次数: 15

Compute & memory optimizations for high-quality speech recognition on low-end GPU processors 在低端GPU处理器上进行高质量语音识别的计算和内存优化

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152741

Kshitij Gupta, John Douglas Owens

{"title":"Compute & memory optimizations for high-quality speech recognition on low-end GPU processors","authors":"Kshitij Gupta, John Douglas Owens","doi":"10.1109/HiPC.2011.6152741","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152741","url":null,"abstract":"Gaussian Mixture Model (GMM) computations in modern Automatic Speech Recognition systems are known to dominate the total processing time, and are both memory bandwidth and compute intensive. Graphics processors (GPU), are well suited for applications exhibiting data- and thread-level parallelism, as that exhibited by GMM score computations. By exploiting temporal locality over successive frames of speech, we have previously presented a theoretical framework for modifying the traditional speech processing pipeline and obtaining significant savings in compute and memory bandwidth requirements, especially on resource-constrained devices like those found in mobile devices. In this paper we discuss in detail our implementation for two of the three techniques we previously proposed, and suggest a set of guidelines of which technique is suitable for a given condition. For a medium-vocabulary, dictation task consisting of 5k words, we are able to reduce memory bandwidth by 80% for a 20% overhead in compute without loss in accuracy by applying the first technique, and memory and compute savings of 90% and 35% respectively for a 15% degradation in accuracy by using the second technique. We are able to achieve a 4× speed-up (to 6 tim es real-time performance), over the baseline on a low-end 9400M Nvidia GPU.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132558231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12