2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing最新文献_第2页

BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies 内存层次中能量测量和建模的框架

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.38

I. Manousakis, Dimitrios S. Nikolopoulos

{"title":"BTL: A Framework for Measuring and Modeling Energy in Memory Hierarchies","authors":"I. Manousakis, Dimitrios S. Nikolopoulos","doi":"10.1109/SBAC-PAD.2012.38","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.38","url":null,"abstract":"Understanding the energy efficiency of computing systems is paramount. Although processors remain dominant energy consumers and the focal target of energy-aware optimization in computing systems, the memory subsystem dissipates substantial amounts of power, which at high densities may exceed50% of total system power. The failure of DRAM to keep up with increasing processor speeds, creates a two-pronged bottleneck for overall system energy efficiency. This paper presents a high-performance, autonomic power instrumentation setup to measure energy consumption in computing systems and accurately attribute energy to processors and components of the memory hierarchy. We provide a set of carefully engineered micro benchmarks that reveal the energy efficiency under different memory access patterns and stress the importance of minimizing costly data transfers that involve multiple levels of the system's memory hierarchy. Lastly, we present BTL (Bottom line), a processor specific model for deriving lower bounds of energy consumption. BTL predicts the minimum dynamic energy consumption for any workload, thus uncovering opportunities for energy optimization.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131900307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Efficient Sorting on the Tilera Manycore Architecture 基于Tilera多核架构的高效排序

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.41

Alessandro Morari, Antonino Tumeo, Oreste Villa, Simone Secchi, M. Valero

引用次数: 12

Beyond CPU Frequency Scaling for a Fine-grained Energy Control of HPC Systems 超越CPU频率缩放的HPC系统细粒度能量控制

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.32

Ghislain Landry Tsafack Chetsa, L. Lefèvre, J. Pierson, P. Stolf, Georges Da Costa

引用次数: 19

Divergence Analysis with Affine Constraints 仿射约束下的散度分析

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.22

Diogo Sampaio, R. M. Souza, Caroline Collange, Fernando Magno Quintão Pereira

{"title":"Divergence Analysis with Affine Constraints","authors":"Diogo Sampaio, R. M. Souza, Caroline Collange, Fernando Magno Quintão Pereira","doi":"10.1109/SBAC-PAD.2012.22","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.22","url":null,"abstract":"The rising popularity of graphics processing units is bringing renewed interest in code optimization techniques for SIMD processors. Many of these optimizations rely on divergence analyses, which classify variables as uniform, if they have the same value on every thread, or divergent, if they might not. This paper introduces a new kind of divergence analysis, that is able to represent variables as affine functions of thread identifiers. We have implemented this analysis in Ocelot, an open source compiler, and use it to analyze a suite of 177 CUDA kernels from well-known benchmarks. We can mark about one fourth of all program variables as affine functions of thread identifiers. In addition to the novel divergence analysis, we also introduce the notion of a divergence aware register allocator. This allocator uses information from our analysis to either rematerialize affine variables, or to move uniform variables to shared memory. As a testimony of its effectiveness, our divergence aware allocator produces GPU code that is 29.70% faster than the code produced by Ocelot's register allocator. Divergence analysis with affine constraints is publicly available in the Ocelot compiler since June/2012.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130359186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs 利用并发GPU操作在多GPU上高效窃取工作

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.28

J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean

{"title":"Exploiting Concurrent GPU Operations for Efficient Work Stealing on Multi-GPUs","authors":"J. F. Lima, T. Gautier, N. Maillard, Vincent Danjean","doi":"10.1109/SBAC-PAD.2012.28","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.28","url":null,"abstract":"The race for Exascale computing has naturally led the current technologies to converge to multi-CPU/multi-GPU computers, based on thousands of CPUs and GPUs interconnected by PCI-Express buses or interconnection networks. To exploit this high computing power, programmers have to solve the issue of scheduling parallel programs on hybrid architectures. And, since the performance of a GPU increases at a much faster rate than the throughput of a PCI bus, data transfers must be managed efficiently by the scheduler. This paper targets multi-GPU compute nodes, where several GPUs are connected to the same machine. To overcome the data transfer limitations on such platforms, the available soft wares compute, usually before the execution, a mapping of the tasks that respects their dependencies and minimizes the global data transfers. Such an approach is too rigid and it cannot adapt the execution to possible variations of the system or to the application's load. We propose a solution that is orthogonal to the above mentioned: extensions of the Xkaapi software stack that enable to exploit full performance of a multi-GPUs system through asynchronous GPU tasks. Xkaapi schedules tasks by using a standard Work Stealing algorithm and the runtime efficiently exploits concurrent GPU operations. The runtime extensions make it possible to overlap the data transfers and the task executions on current generation of GPUs. We demonstrate that the overlapping capability is at least as important as computing a scheduling decision to reduce completion time of a parallel program. Our experiments on two dense linear algebra problems (Matrix Product and Cholesky factorization) show that our solution is highly competitive with other soft wares based on static scheduling. Moreover, we are able to sustain the peak performance (approx. 310 GFlop/s) on DGEMM, even for matrices that cannot be stored entirely in one GPU memory. With eight GPUs, we archive a speed-up of 6.74 with respect to single-GPU. The performance of our Cholesky factorization, with more complex dependencies between tasks, outperforms the state of the art single-GPU MAGMA code.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"29 10","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120823293","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Data and Instruction Uniformity in Minimal Multi-threading 最小多线程中的数据和指令一致性

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.21

Teo Milanez, Caroline Collange, Fernando Magno Quintão Pereira, Wagner Meira Jr, R. Ferreira

{"title":"Data and Instruction Uniformity in Minimal Multi-threading","authors":"Teo Milanez, Caroline Collange, Fernando Magno Quintão Pereira, Wagner Meira Jr, R. Ferreira","doi":"10.1109/SBAC-PAD.2012.21","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.21","url":null,"abstract":"Simultaneous Multi-Threading (SMT) is a hardware model in which different threads share the same instruction fetching unit. This model is a compromise between high parallelism and low hardware cost. Minimal Multi-Threading (MMT) is a technique recently proposed to share instructions and execution between threads in a SMT machine. In this paper we propose new ways to explore redundancies in the MMT execution model. First, we propose and evaluate a new thread reconvergence heuristics that handles function calls better than previous approaches. Second, we demonstrate the existence of substantial regularity in inter-thread memory access patterns. We validate our results on the four data-parallel applications present in the PARSEC benchmark suite. The new thread reconvergence heuristics is, on the average, 82% more efficient than MMT's original reconvergence method. Furthermore, about 69% to 87% of all the memory addresses are either the same for all the threads, or are affine expressions of the thread identifier. This observation motivates the design of newly proposed hardware that benefits from regularity in inter-thread memory accesses.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122768777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ACCGen: An Automatic ArchC Compiler Generator ACCGen:一个自动ArchC编译器生成器

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.33

R. Auler, P. Centoducatte, E. Borin

{"title":"ACCGen: An Automatic ArchC Compiler Generator","authors":"R. Auler, P. Centoducatte, E. Borin","doi":"10.1109/SBAC-PAD.2012.33","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.33","url":null,"abstract":"The current level of circuit integration led to complex designs encompassing full systems on a single chip, known as System-on-a-Chip (SoC). In order to predict the best design options and reduce the design costs, designers are required to perform a large design space exploration on early stages of the design. To speed up this process, Electronic Design Automation (EDA) tools are employed to model and experiment with the system. ArchC is an \"Architecture Description Language\" (ADL) and a set of tools that can be leveraged to automatically build SoC simulators based on high-level system models, enabling easy and fast design space exploration in early stages of the design. Currently, ArchC is capable of automatically generating hardware simulators, assemblers, and linkers for a given architecture model. In this work, we present ACCGen, an automatic Compiler Generator for ArchC, the missing link on the automatic generation of compiler tool chains for ArchC. Our experimental results show that compilers generated by ACCGen are correct for Mibench applications. They compare, as well, the generated code quality with LLVM and gcc, two well-known open-source compilers. We also show that ACCGen is fast and has little impact on the design space exploration turnaround time, allowing the designer to, using an easy and fully automated workflow, completely assess the outcome of architectural changes in less than 2 minutes.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"194 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132622425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters VPC:可扩展，低停机检查点的虚拟集群

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.31

Peng Lu, B. Ravindran, Changsoo Kim

{"title":"VPC: Scalable, Low Downtime Checkpointing for Virtual Clusters","authors":"Peng Lu, B. Ravindran, Changsoo Kim","doi":"10.1109/SBAC-PAD.2012.31","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.31","url":null,"abstract":"A virtual cluster (VC) consists of multiple virtual machines (VMs) running on different physical hosts, inter-connected by a virtual network. A fault-tolerant protocol and mechanism are essential to the VC's availability and usability. We present Virtual Predict Check pointing (or VPC), a lightweight, globally consistent check pointing mechanism, which checkpoints the VC for immediate restoration after VM failures. By predicting the checkpoint-caused page faults during each check pointing interval, VPC further reduces the solo VM downtime than traditional incremental check pointing approaches. Besides, VPC uses a globally consistent check-pointing algorithm, which preserves the global consistency of the VMs' execution and communication states, and only saves the updated memory pages during each check pointing interval to reduce the entire VC downtime. Our implementation reveals that, compared with past VC check pointing/migration solutions including VNsnap, VPC reduces the solo VM downtime by as much as 45%, under the NPB benchmark, and reduces the entire VC downtime by as much as 50%, under the NPB distributed program. Additionally, VPC incurs a memory overhead of no more than 9%. In all cases, VPC's performance overhead is less than 16%.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127521185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Using Heterogeneous Networks to Improve Energy Efficiency in Direct Coherence Protocols for Many-Core CMPs 利用异构网络提高多核cmp直接相干协议的能量效率

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.23

Alberto Ros, Ricardo Fernández Pascual, M. Acacio

引用次数: 2

Parallelizing Information Set Generation for Game Tree Search Applications 游戏树搜索应用的并行信息集生成

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing Pub Date : 2012-10-24 DOI: 10.1109/SBAC-PAD.2012.42

M. Richards, Abhishek K. Gupta, O. Sarood, L. Kalé

引用次数: 1