2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing最新文献

筛选
英文 中文
BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies 内存层次中能量测量和建模的框架
I. Manousakis, Dimitrios S. Nikolopoulos
{"title":"BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies","authors":"I. Manousakis, Dimitrios S. Nikolopoulos","doi":"10.1109/SBAC-PAD.2012.38","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.38","url":null,"abstract":"Understanding the energy efficiency of computing systems is paramount. Although processors remain dominant energy consumers and the focal target of energy-aware optimization in computing systems, the memory subsystem dissipates substantial amounts of power, which at high densities may exceed50% of total system power. The failure of DRAM to keep up with increasing processor speeds, creates a two-pronged bottleneck for overall system energy efficiency. This paper presents a high-performance, autonomic power instrumentation setup to measure energy consumption in computing systems and accurately attribute energy to processors and components of the memory hierarchy. We provide a set of carefully engineered micro benchmarks that reveal the energy efficiency under different memory access patterns and stress the importance of minimizing costly data transfers that involve multiple levels of the system's memory hierarchy. Lastly, we present BTL (Bottom line), a processor specific model for deriving lower bounds of energy consumption. BTL predicts the minimum dynamic energy consumption for any workload, thus uncovering opportunities for energy optimization.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131900307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Efficient Sorting on the Tilera Manycore Architecture 基于Tilera多核架构的高效排序
Alessandro Morari, Antonino Tumeo, Oreste Villa, Simone Secchi, M. Valero
{"title":"Efficient Sorting on the Tilera Manycore Architecture","authors":"Alessandro Morari, Antonino Tumeo, Oreste Villa, Simone Secchi, M. Valero","doi":"10.1109/SBAC-PAD.2012.41","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.41","url":null,"abstract":"We present an efficient implementation of the radix sort algorithm for the Tilera TILEPro64 processor. The TILEPro64 is one of the first successful commercial manycore processors. It is composed of 64 tiles interconnected through multiple fast Networks-on-chip and features a fully coherent, shared distributed cache. The architecture has a large degree of flexibility, and allows various optimization strategies. We describe how we mapped the algorithm to this architecture. We present an in-depth analysis of the optimizations for each phase of the algorithm with respect to the processor's sustained performance. We discuss the overall throughput reached by our radix sort implementation (up to 132 MK/s) and show that it provides comparable or better performance-per-watt with respect to state-of-the art implementations on x86 processors and graphic processing units.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125470146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Beyond CPU Frequency Scaling for a Fine-grained Energy Control of HPC Systems 超越CPU频率缩放的HPC系统细粒度能量控制
Ghislain Landry Tsafack Chetsa, L. Lefèvre, J. Pierson, P. Stolf, Georges Da Costa
{"title":"Beyond CPU Frequency Scaling for a Fine-grained Energy Control of HPC Systems","authors":"Ghislain Landry Tsafack Chetsa, L. Lefèvre, J. Pierson, P. Stolf, Georges Da Costa","doi":"10.1109/SBAC-PAD.2012.32","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.32","url":null,"abstract":"Modern high performance computing subsystems (HPC) - including processor, network, memory, and IO - are provided with power management mechanisms. These include dynamic speed scaling and dynamic resource sleeping. Understanding the behavioral patterns of high performance computing systems at runtime can lead to a multitude of optimization opportunities including controlling and limiting their energy usage. In this paper, we present a general purpose methodology for optimizing energy performance of HPC systems considering processor, disk and network. We rely on the concept of execution vector along with a partial phase recognition technique for on-the-fly dynamic management without any a priori knowledge of the workload. We demonstrate the effectiveness of our management policy under two real-life workloads. Experimental results show that our management policy in comparison with baseline unmanaged execution saves up to 24% of energy with less than 4% performance overhead for our real-life workloads.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"2006 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125619228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Divergence Analysis with Affine Constraints 仿射约束下的散度分析
Diogo Sampaio, R. M. Souza, Caroline Collange, Fernando Magno Quintão Pereira
{"title":"Divergence Analysis with Affine Constraints","authors":"Diogo Sampaio, R. M. Souza, Caroline Collange, Fernando Magno Quintão Pereira","doi":"10.1109/SBAC-PAD.2012.22","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.22","url":null,"abstract":"The rising popularity of graphics processing units is bringing renewed interest in code optimization techniques for SIMD processors. Many of these optimizations rely on divergence analyses, which classify variables as uniform, if they have the same value on every thread, or divergent, if they might not. This paper introduces a new kind of divergence analysis, that is able to represent variables as affine functions of thread identifiers. We have implemented this analysis in Ocelot, an open source compiler, and use it to analyze a suite of 177 CUDA kernels from well-known benchmarks. We can mark about one fourth of all program variables as affine functions of thread identifiers. In addition to the novel divergence analysis, we also introduce the notion of a divergence aware register allocator. This allocator uses information from our analysis to either rematerialize affine variables, or to move uniform variables to shared memory. As a testimony of its effectiveness, our divergence aware allocator produces GPU code that is 29.70% faster than the code produced by Ocelot's register allocator. Divergence analysis with affine constraints is publicly available in the Ocelot compiler since June/2012.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130359186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs 利用并发GPU操作在多GPU上高效窃取工作
J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean
{"title":"Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs","authors":"J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean","doi":"10.1109/SBAC-PAD.2012.28","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.28","url":null,"abstract":"The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"29 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120823293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Data and Instruction Uniformity in Minimal Multi-threading 最小多线程中的数据和指令一致性
Teo Milanez, Caroline Collange, Fernando Magno Quintão Pereira, Wagner Meira Jr, R. Ferreira
{"title":"Data and Instruction Uniformity in Minimal Multi-threading","authors":"Teo Milanez, Caroline Collange, Fernando Magno Quintão Pereira, Wagner Meira Jr, R. Ferreira","doi":"10.1109/SBAC-PAD.2012.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.21","url":null,"abstract":"Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same instruction fetching unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is a technique recently proposed to share instructions and execution between threads in a SMT machine. In this paper we propose new ways to explore redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristics that handles function calls better than previous approaches. Second, we demonstrate the existence of substantial regularity in inter-thread memory access patterns. We validate our results on the four data-parallel applications present in the PARSEC benchmark suite. The new thread reconvergence heuristics is, on the average, 82% more efficient than MMT's original reconvergence method. Furthermore, about 69% to 87% of all the memory addresses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122768777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
ACCGen: An Automatic ArchC Compiler Generator ACCGen:一个自动ArchC编译器生成器
R. Auler, P. Centoducatte, E. Borin
{"title":"ACCGen: An Automatic ArchC Compiler Generator","authors":"R. Auler, P. Centoducatte, E. Borin","doi":"10.1109/SBAC-PAD.2012.33","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.33","url":null,"abstract":"The current level of circuit integration led to complex designs encompassing full systems on a single chip, known as System-on-a-Chip (SoC). In order to predict the best design options and reduce the design costs, designers are required to perform a large design space exploration on early stages of the design. To speed up this process, Electronic Design Automation (EDA) tools are employed to model and experiment with the system. ArchC is an \"Architecture Description Language\" (ADL) and a set of tools that can be leveraged to automatically build SoC simulators based on high-level system models, enabling easy and fast design space exploration in early stages of the design. Currently, ArchC is capable of automatically generating hardware simulators, assemblers, and linkers for a given architecture model. In this work, we present ACCGen, an automatic Compiler Generator for ArchC, the missing link on the automatic generation of compiler tool chains for ArchC. Our experimental results show that compilers generated by ACCGen are correct for Mibench applications. They compare, as well, the generated code quality with LLVM and gcc, two well-known open-source compilers. We also show that ACCGen is fast and has little impact on the design space exploration turnaround time, allowing the designer to, using an easy and fully automated workflow, completely assess the outcome of architectural changes in less than 2 minutes.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132622425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters VPC:可扩展,低停机检查点的虚拟集群
Peng Lu, B. Ravindran, Changsoo Kim
{"title":"VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters","authors":"Peng Lu, B. Ravindran, Changsoo Kim","doi":"10.1109/SBAC-PAD.2012.31","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.31","url":null,"abstract":"A virtual cluster (VC) consists of multiple virtual machines (VMs) running on different physical hosts, inter-connected by a virtual network. A fault-tolerant protocol and mechanism are essential to the VC's availability and usability. We present Virtual Predict Check pointing (or VPC), a lightweight, globally consistent check pointing mechanism, which checkpoints the VC for immediate restoration after VM failures. By predicting the checkpoint-caused page faults during each check pointing interval, VPC further reduces the solo VM downtime than traditional incremental check pointing approaches. Besides, VPC uses a globally consistent check-pointing algorithm, which preserves the global consistency of the VMs' execution and communication states, and only saves the updated memory pages during each check pointing interval to reduce the entire VC downtime. Our implementation reveals that, compared with past VC check pointing/migration solutions including VNsnap, VPC reduces the solo VM downtime by as much as 45%, under the NPB benchmark, and reduces the entire VC downtime by as much as 50%, under the NPB distributed program. Additionally, VPC incurs a memory overhead of no more than 9%. In all cases, VPC's performance overhead is less than 16%.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127521185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs 利用异构网络提高多核cmp直接相干协议的能量效率
Alberto Ros, Ricardo Fernández Pascual, M. Acacio
{"title":"Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs","authors":"Alberto Ros, Ricardo Fernández Pascual, M. Acacio","doi":"10.1109/SBAC-PAD.2012.23","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.23","url":null,"abstract":"Direct coherence protocols have been recently proposed as an alternative to directory-based protocols to keep cache coherence in many-core CMPs. Differently from directory-based protocols, in direct coherence the responsible for providing the requested data in case of a cache miss (i.e., the owner cache) is also tasked with keeping the updated directory information and serializing the different accesses to the block by all cores. This way, these protocols send requests directly to the owner cache, thus avoiding the indirection caused by accessing a separate directory (usually in the home node). A hints mechanism ensures a high hit rate when predicting the current owner of a block for sending requests, but at the price of significantly increasing network traffic, and consequently, energy consumption. In this work, we show how using a heterogeneous interconnection network composed of two kinds of links is enough to drastically reduce the energy consumed by hint messages, obtaining significant improvements in energy efficiency.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114392761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Parallelizing Information Set Generation for Game Tree Search Applications 游戏树搜索应用的并行信息集生成
M. Richards, Abhishek K. Gupta, O. Sarood, L. Kalé
{"title":"Parallelizing Information Set Generation for Game Tree Search Applications","authors":"M. Richards, Abhishek K. Gupta, O. Sarood, L. Kalé","doi":"10.1109/SBAC-PAD.2012.42","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.42","url":null,"abstract":"Information Set Generation (ISG) is the identification of the set of paths in an imperfect information game tree that are consistent with a player's observations. The ability to reason about the possible a history is critical to the performance of game-playing agents. ISG represents a class of combinatorial search problems which is computationally intensive but challenging to efficiently parallelize. In this paper, we address the parallelization of information set generation in the context of Kriegspiel (partially observable chess). We implement the algorithm on top of a general purpose combinatorial search engine and discuss its performance using datasets from real game instances in addition to benchmarks. Further, we demonstrate the effect of load balancing strategies, problem sizes and computational granularity (grain size parameters) on performance. We achieve speedups of over 500 on 1,024 processors, far exceeding previous scalability results for game tree search applications.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130258471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信