2011 18th International Conference on High Performance Computing最新文献

筛选
英文 中文
Multi-model prediction for enhancing content locality in elastic server infrastructures 弹性服务器基础设施中增强内容局部性的多模型预测
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152728
Juan M. Tirado, Daniel Higuero, Florin Isaila, J. Carretero
{"title":"Multi-model prediction for enhancing content locality in elastic server infrastructures","authors":"Juan M. Tirado, Daniel Higuero, Florin Isaila, J. Carretero","doi":"10.1109/HiPC.2011.6152728","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152728","url":null,"abstract":"Infrastructures serving on-line applications experience dynamic workload variations depending on diverse factors such as popularity, marketing, periodic patterns, fads, trends, events, etc. Some predictable factors such as trends, periodicity or scheduled events allow for proactive resource provisioning in order to meet fluctuations in workloads. However, proactive resource provisioning requires prediction models forecasting future workload patterns. This paper proposes a multi-model prediction approach, in which data are grouped into bins based on content locality, and an autoregressive prediction model is assigned to each locality-preserving bin. The prediction models are shown to be identified and fitted in a computationally efficient way. We demonstrate experimentally that our multi-model approach improves locality over the uni-model approach, while achieving efficient resource provisioning and preserving a high resource utilization and load balance.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"61 7","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120933117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Robust thread-level speculation 健壮的线程级推测
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152737
Álvaro García-Yágüez, D. Ferraris, Arturo González-Escribano
{"title":"Robust thread-level speculation","authors":"Álvaro García-Yágüez, D. Ferraris, Arturo González-Escribano","doi":"10.1109/HiPC.2011.6152737","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152737","url":null,"abstract":"Robustness is a key issue on any runtime system that aims to speed up the execution of a program. However, robustness considerations are commonly overlooked when new software-based, thread-level speculation (STLS) systems are proposed. This paper highlights the relevance of the problem, showing different situations when the use of incorrect data can irreversibly alter the speculative execution of an algorithm, despite the efforts of a given STLS system to maintain sequential consistency. We show that the management of speculative exceptions is a common factor to these problems. Based on this fact, we propose a novel solution to handle speculative exceptions. Our solution eagerly tries to solve the issue before the non-speculative thread arrives to the instruction that rose the exception. We compare our solution to a more conservative approach found in the bibliography. The comparison is done both qualitatively, through a detailed analysis of the tradeoffs involved, and quantitatively, evaluating the effects of both solutions in the execution of three different benchmarks on a real system. Both studies conclude that our solution handles the occurrence of speculative exceptions more efficiently. Under heavy loads intended to push to its limits a STLS system, our solution leads to execution times reduced by up to 52.02% with respect to earlier proposals. Our solution does not affect the performance when speculative exceptions do not appear. We believe that our proposal makes STLS systems robust enough to be used in production environments.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129537387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A fast centralized computation routing algorithm for self-configuring NoC systems 一种自配置NoC系统的快速集中计算路由算法
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152732
F. Triviño, F. J. Alfaro, J. L. Sánchez, J. Flich
{"title":"A fast centralized computation routing algorithm for self-configuring NoC systems","authors":"F. Triviño, F. J. Alfaro, J. L. Sánchez, J. Flich","doi":"10.1109/HiPC.2011.6152732","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152732","url":null,"abstract":"As technology evolves, networks-on-chip will need to survive to manufacturing faults in order to sustain yield. An effective configuration strategy implies the design of an efficient routing infrastructure, that enables a fast and efficient configuration of the NoC system to go around faulty links and switches. The strategy must minimize the overhead in resources and guarantee the entire system to be deadlock free. A centralized approach, through a monitoring controller is appealing as will get global network visibility. This paper proposes a centralized routing configuration strategy that meets the requirements by means of a fast configuration algorithm for the most common failure patterns. The strategy is designed towards the goals of reduced configuration time and high coverage support (maximum number of supported failure patterns). No extra resources (virtual channels) are needed for the effective final configuration of the system. Results show the effectiveness of the proposed configuration algorithm.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127094971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Partial globalization of partitioned address spaces for zero-copy communication with shared memory 与共享内存进行零拷贝通信的分区地址空间的部分全球化
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152733
Fangzhou Jiao, N. Mahajan, Jeremiah Willcock, A. Chauhan, A. Lumsdaine
{"title":"Partial globalization of partitioned address spaces for zero-copy communication with shared memory","authors":"Fangzhou Jiao, N. Mahajan, Jeremiah Willcock, A. Chauhan, A. Lumsdaine","doi":"10.1109/HiPC.2011.6152733","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152733","url":null,"abstract":"We have developed a high-level language, called Kanor, for declaratively specifying communication in parallel programs. Designed as an extension of C++, it serves to coordinate partitioned address space programs written in the bulk synchronous parallel (BSP) style. Kanor's declarative semantics enable the programmers to write correct and maintainable parallel applications. The communication abstraction has been carefully designed to be amenable to compiler optimizations. While partitioned address space programming has several advantages, it needs special compiler optimizations to effectively leverage the shared memory hardware when running on multicore machines. In this paper, we introduce such shared-memory optimizations in the context of Kanor. One major way we achieve these optimizations is by selectively moving some of the variables into a globally shared address space — a process that we term partial globalization. We identify scenarios in which such a transformation is beneficial, and present an algorithm to identify and correctly transform Kanor communication steps into zero-copy communication using hardware shared memory, by introducing minimal synchronization. We then present a runtime strategy that complements the compiler algorithm to eliminate most of the runtime synchronization overheads by using a copy-on-conflict technique. Finally, we show that our solution often performs much better than shared-memory optimized MPI, and ne ver performs significantly worse than MPI even in the presence of dependencies introduced due to buffer sharing. The techniques in this paper demonstrate that it is possible to program in a partitioned address space style, without sacrificing the performance advantages of hardware shared memory. To the best of our knowledge no other automatic compiler techniques have been developed so far that achieve zero-copy communication from a partitioned address space program. We expect out results to be applicable beyond Kanor, to other partitioned address space programming environments, such as MPI.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"309 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114098679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A dynamic scheduling framework for emerging heterogeneous systems 新兴异构系统的动态调度框架
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152724
Vignesh T. Ravi, G. Agrawal
{"title":"A dynamic scheduling framework for emerging heterogeneous systems","authors":"Vignesh T. Ravi, G. Agrawal","doi":"10.1109/HiPC.2011.6152724","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152724","url":null,"abstract":"A trend that has materialized, and has given rise to much attention, is of the increasingly heterogeneous computing platforms. Recently, it has become very common for a desktop or a notebook computer to be equipped with both a multi-core CPU and a GPU. Application development for exploiting the aggregate computing power of such an environment is a major challenge today. Particularly, we need dynamic work distribution schemes that are adaptable to different computation and communication patterns in applications, and to various heterogeneous configurations. This paper describes a general dynamic scheduling framework for mapping applications with different communication patterns to heterogeneous architectures. We first make key observations about the architectural tradeoffs among heterogeneous resources and the communication pattern of an application, and then infer constraints for the dynamic scheduler. We then present a novel cost model for choosing the optimal chunk size in a heterogeneous configuration. Finally, based on general framework and cost model we provide optimized work distribution schemes to further improve the performance.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"100 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116290873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Dynamic hosting management of web based applications over clouds 基于云的网络应用程序动态托管管理
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152731
Z. Abbasi, T. Mukherjee, G. Varsamopoulos, S. Gupta
{"title":"Dynamic hosting management of web based applications over clouds","authors":"Z. Abbasi, T. Mukherjee, G. Varsamopoulos, S. Gupta","doi":"10.1109/HiPC.2011.6152731","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152731","url":null,"abstract":"Dynamic Application Hosting Management (DAHM) allows clouds to dynamically host applications in data centers at different locations based on: (i) spatio-temporal variation of energy price, (ii) data center computing and cooling energy efficiency, (iii) Virtual Machine (VM) migration cost for the applications, and (iv) any SLA violations due to migration overhead or network delay. DAHM is complementary to dynamic workload distribution problem and is modeled as mixed integer programming; online algorithms are developed to solve the problem. The algorithms are evaluated in a simulation study using realistic data and compared with performance-oriented application assignment, i.e., hosting the application at a data center whose delay is the least. Our simulations results indicate that DAHM can potentially save up to 20% cost while incurring only a nominal increase in SLA violations. The savings are obtained by exploiting the cost efficiency variation as well as reducing the total number of VMs employed to host applications.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129231342","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Implementing a hybrid SRAM / eDRAM NUCA architecture 实现混合SRAM / eDRAM NUCA架构
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152738
Javier Lira, Carlos Molina, D. Brooks, Antonio González
{"title":"Implementing a hybrid SRAM / eDRAM NUCA architecture","authors":"Javier Lira, Carlos Molina, D. Brooks, Antonio González","doi":"10.1109/HiPC.2011.6152738","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152738","url":null,"abstract":"Advances in technology allowed for integrating DRAM-like structures into the chip, called embedded DRAM (eDRAM). This technology has already been successfully implemented in some GPUs and other graphic-intensive SoC, like game consoles. The most recent processor from IBM®, POWER7, is the first general-purpose processor that integrates an eDRAM module on the chip. In this paper, we propose a hybrid cache architecture that exploits the main features of both memory technologies, speed of SRAM and high density of eDRAM. We demonstrate, that due to the high locality found in emerging applications, a high percentage of data that enters to the on-chip last-level cache are not accessed again before they are evicted. Based on that observation, we propose a placement scheme where re-accessed data blocks are stored in fast, but costly in terms of area and power, SRAM banks, while eDRAM banks store data blo cks that just arrive to the NUCA cache or were demoted from a SRAM bank. We show that a well-balanced SRAM / eDRAM NUCA cache can achieve similar performance results than using a NUCA cache composed of only SRAM banks, but reduces area by 15% and power consumed by 10%. Furthermore, we also explore several alternatives to exploit the area reduction we gain by using the hybrid architecture, resulting in an overall performance improvement of 4%.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124624830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
GVT algorithms and discrete event dynamics on 129K+ processor cores 129K+处理器核上的GVT算法和离散事件动力学
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152725
K. Perumalla, Alfred Park, V. Tipparaju
{"title":"GVT algorithms and discrete event dynamics on 129K+ processor cores","authors":"K. Perumalla, Alfred Park, V. Tipparaju","doi":"10.1109/HiPC.2011.6152725","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152725","url":null,"abstract":"Parallel discrete event simulation (PDES) represents a class of codes that are challenging to scale to large number of processors due to tight global timestamp-ordering and fine-grained event execution. One of the critical factors in scaling PDES is the efficiency of the underlying global virtual time (GVT) algorithm needed for correctness of parallel execution and speed of progress. Although many GVT algorithms have been proposed previously, few have been proposed for scalable asynchronous execution and none customized to exploit one-sided communication. Moreover, the detailed performance effects of actual GVT algorithm implementations on large platforms are unknown. Here, three major GVT algorithms intended for scalable execution on high-performance systems are studied: (1) a synchronous GVT algorithm that affords ease of implementation, (2) an asynchronous GVT algorithm that is more complex to implement but can relieve blocking latencies, and (3) a variant of the asynchronous GVT algorithm, proposed and studied for the first time here, to exploit one-sided communication in extant supercomputing platforms. Performance results are presented of implementations of these algorithms on up to 129,024 cores of a Cray XT5 system, exercised on a range of parameters: optimistic and conservative synchronization, fine-to medium-grained event computation, synthetic and non-synthetic applications, and different lookahead values. Performance to the tune of tens of billions of events executed per second are registered, exceeding the speeds of any known PDES engine, and showing asynchronous GVT algorithms to outperform state-of-the-art synchronous GVT algorithms. Detailed PDES-specific runtime metrics are presented to further the understanding of tightly-coupled discrete event dynamics on massively parallel platforms.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124009834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Hybrid implementation of error diffusion dithering 误差扩散抖动的混合实现
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152714
A. Deshpande, Ishan Misra, P J Narayanan
{"title":"Hybrid implementation of error diffusion dithering","authors":"A. Deshpande, Ishan Misra, P J Narayanan","doi":"10.1109/HiPC.2011.6152714","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152714","url":null,"abstract":"Many image filtering operations provide ample parallelism, but progressive non-linear processing of images is among the hardest to parallelize due to long, sequential, and non-linear data dependency. A typical example of such an operation is error diffusion dithering, exemplified by the Floyd-Steinberg algorithm. In this paper, we present its parallelization on multicore CPUs using a block-based approach and on the GPU using a pixel based approach. We also present a hybrid approach in which the CPU and the GPU operate in parallel during the computation. High Performance Computing has traditionally been associated with high end CPUs and GPUs. Our focus is on everyday computers such as laptops and desktops, where significant compute power is available on the GPU as on the CPU. Our implementation can dither an 8K × 8K image on an off-the-shelf laptop with an Nvidia 8600M GPU in about 400 milliseconds when the sequential implementation on its CPU took about 4 seconds.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121340695","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Compute & memory optimizations for high-quality speech recognition on low-end GPU processors 在低端GPU处理器上进行高质量语音识别的计算和内存优化
2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI: 10.1109/HiPC.2011.6152741
Kshitij Gupta, John Douglas Owens
{"title":"Compute & memory optimizations for high-quality speech recognition on low-end GPU processors","authors":"Kshitij Gupta, John Douglas Owens","doi":"10.1109/HiPC.2011.6152741","DOIUrl":"https://doi.org/10.1109/HiPC.2011.6152741","url":null,"abstract":"Gaussian Mixture Model (GMM) computations in modern Automatic Speech Recognition systems are known to dominate the total processing time, and are both memory bandwidth and compute intensive. Graphics processors (GPU), are well suited for applications exhibiting data- and thread-level parallelism, as that exhibited by GMM score computations. By exploiting temporal locality over successive frames of speech, we have previously presented a theoretical framework for modifying the traditional speech processing pipeline and obtaining significant savings in compute and memory bandwidth requirements, especially on resource-constrained devices like those found in mobile devices. In this paper we discuss in detail our implementation for two of the three techniques we previously proposed, and suggest a set of guidelines of which technique is suitable for a given condition. For a medium-vocabulary, dictation task consisting of 5k words, we are able to reduce memory bandwidth by 80% for a 20% overhead in compute without loss in accuracy by applying the first technique, and memory and compute savings of 90% and 35% respectively for a 15% degradation in accuracy by using the second technique. We are able to achieve a 4× speed-up (to 6 tim es real-time performance), over the baseline on a low-end 9400M Nvidia GPU.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132558231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信