2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

Energy Aware Dynamic Provisioning for Heterogeneous Data Centers 异构数据中心的能源感知动态供应

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-26 DOI: 10.1109/SBAC-PAD.2016.34

Violaine Villebonnet, Georges Da Costa, L. Lefèvre, J. Pierson, P. Stolf

{"title":"Energy Aware Dynamic Provisioning for Heterogeneous Data Centers","authors":"Violaine Villebonnet, Georges Da Costa, L. Lefèvre, J. Pierson, P. Stolf","doi":"10.1109/SBAC-PAD.2016.34","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.34","url":null,"abstract":"The huge amount of energy consumed by data centers represents a limiting factor in their operation. Many of these infrastructures are over-provisioned, thus a significant portion of this energy is consumed by inactive servers staying powered on even if the load is low. Although servers have become more energy-efficient over time, their idle power consumption remains still high. To tackle this issue, we consider a data center with an heterogeneous infrastructure composed of different machine types – from low power processors to classical powerful servers – in order to enhance its energy proportionality. We develop a dynamic provisioning algorithm which takes into account the various characteristics of the architectures composing the infrastructure: their performance, energy consumption and on/off reactivity. Based on future load information, it makes intelligent decisions of resource reconfiguration that impact the infrastructure at multiple terms. Our algorithm is reactive to load evolutions and is able to respect a perfect Quality of Service (QoS) while being energy-efficient. We evaluate our original approach with profiling data from real hardware and the experiments show that our dynamic provisioning brings significant energy savings compared to classical data centers operation.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115472305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP 动态线程间矢量化架构:从TLP中提取DLP

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-26 DOI: 10.1109/SBAC-PAD.2016.11

Sajith Kalathingal, Caroline Collange, B. N. Swamy, André Seznec

{"title":"Dynamic Inter-Thread Vectorization Architecture: Extracting DLP from TLP","authors":"Sajith Kalathingal, Caroline Collange, B. N. Swamy, André Seznec","doi":"10.1109/SBAC-PAD.2016.11","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.11","url":null,"abstract":"Threads of Single-Program Multiple-Data (SPMD) applications often execute the same instructions on different data. We propose the Dynamic Inter-Thread Vectorization Architecture (DITVA) to leverage this implicit data-level parallelism in SPMD applications by assembling dynamic vector instructions at runtime. DITVA extends an SIMD-enabled in-order SMT processor with an inter-thread vectorization execution mode. In this mode, multiple scalar threads running in lockstep share a single instruction stream and their respective instruction instances are aggregated into SIMD instructions. To balance thread-and data-level parallelism, threads are statically grouped into fixed-size independently scheduled warps. DITVA leverages existing SIMD units and maintains binary compatibility with existing CPU architectures. Our evaluation on the SPMD applications from the PARSEC and Rodinia OpenMP benchmarks shows that a 4-warp × 4-lane 4-issue DITVA architecture with a realistic bank-interleaved cache achieves 1.55× higher performance than a 4-thread 4-issue SMT architecture with AVX instructions while fetching and issuing 51% fewer instructions, achieving an overall 24% energy reduction.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133499477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers MR-Advisor:一个建议HPC用户在超级计算机上加速MapReduce应用程序的综合调优工具

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI: 10.1109/SBAC-PAD.2016.33

Md. Wasi-ur-Rahman, Nusrat S. Islam, Xiaoyi Lu, D. Shankar, D. Panda

{"title":"MR-Advisor: A Comprehensive Tuning Tool for Advising HPC Users to Accelerate MapReduce Applications on Supercomputers","authors":"Md. Wasi-ur-Rahman, Nusrat S. Islam, Xiaoyi Lu, D. Shankar, D. Panda","doi":"10.1109/SBAC-PAD.2016.33","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.33","url":null,"abstract":"MapReduce is the most popular parallel computing framework for big data processing which allows massive scalability across distributed computing environment. Advanced RDMA-based design of Hadoop MapReduce has been proposed that alleviates the performance bottlenecks in default Hadoop MapReduce by leveraging the benefits from RDMA. On the other hand, data processing engine, Spark, provides fast execution of MapReduce applications through in-memory processing. Performance optimization for these contemporary big data processing frameworks on modern High-Performance Computing (HPC) systems is a formidable task because of the numerous configuration possibilities in each of them. In this paper, we propose MR-Advisor, a comprehensive tuning tool for MapReduce. MR-Advisor is generalized to provide performance optimizations for Hadoop, Spark, and RDMA-enhanced Hadoop MapReduce designs over different file systems such as HDFS, Lustre, and Tachyon. Performance evaluations reveal that, with MR-Advisor's suggested values, the job execution performance can be enhanced by a maximum of 58% over the current best-practice values for user-level configuration parameters. To the best of our knowledge, this is the first tool that supports tuning for both Apache Hadoop and Spark, as well as the RDMA and Lustre-based advanced designs.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122923621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Performance-Aware Device Driver Architecture for Signal Processing 信号处理的性能感知设备驱动架构

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI: 10.1109/SBAC-PAD.2016.17

Stefan Sydow, Mohannad Nabelsee, Anselm Busse, Sebastian Koch, Helge Parzyjegla

引用次数: 1

Building a Low Latency, Highly Associative DRAM Cache with the Buffered Way Predictor 用缓冲方式预测器构建低延迟、高度关联的DRAM缓存

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI: 10.1109/SBAC-PAD.2016.22

Zhe Wang, Daniel A. Jiménez, Zhang Tao, G. Loh, Yuan Xie

{"title":"Building a Low Latency, Highly Associative DRAM Cache with the Buffered Way Predictor","authors":"Zhe Wang, Daniel A. Jiménez, Zhang Tao, G. Loh, Yuan Xie","doi":"10.1109/SBAC-PAD.2016.22","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.22","url":null,"abstract":"The emerging die-stacked DRAM technology allows computer architects to design a last-level cache (LLC) with high memory bandwidth and large capacity. There are four key requirements for DRAM cache design: minimizing on-chip tag storage overhead, optimizing access latency, improving hit rate, and reducing off-chip traffic. These requirements seem mutually incompatible. For example, to reduce the tag storage overhead, the recent proposed LH-cache co-locates tags and data in the same DRAM cache row, and the Alloy Cache proposed to alloy data and tags in the same cache line in a direct-mapped design. However, these ideas either require significant tag lookup latency or sacrifice hit rate for hit latency. To optimize all four key requirements, we propose the Buffered Way Predictor (BWP). The BWP predicts the way ID of a DRAM cache request with high accuracy and coverage, allowing data and tag to be fetched back to back. Thus, the read latency for the data can be completely hidden so that DRAM cache hitting requests have low access latency. The BWP technique is designed for highly associative block-based DRAM caches and achieves a low miss rate and low off-chip traffic. Our evaluation with multi-programmed workloads and a 128MB DRAM cache shows that a 128KB BWP achieves a 76.2% hit rate. The BWP improves performance by 8.8% and 12.3% compared to LH-cache and Alloy Cache, respectively.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117089855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Breadth-First Search on Heterogeneous Platforms: A Case of Study on Social Networks 异构平台上的广度优先搜索:以社交网络为例

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI: 10.1109/SBAC-PAD.2016.23

Luis Remis, M. Garzarán, R. Asenjo, A. Navarro

{"title":"Breadth-First Search on Heterogeneous Platforms: A Case of Study on Social Networks","authors":"Luis Remis, M. Garzarán, R. Asenjo, A. Navarro","doi":"10.1109/SBAC-PAD.2016.23","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.23","url":null,"abstract":"Breadth-First Search (BFS) is the core of many graph analysis algorithms and it is used in many problems, such as social network, computer network analysis, and data organization. BFS is an iterative algorithm that due to its irregular behavior is quite challenging to parallelize. Several approaches implement efficient algorithms for BFS for multicore architectures and for Graphics Processors, but it is still an open problem how to distribute the work among the main cores and the accelerators. In this paper, we assess several approaches to perform BFS on different heterogenous architectures (highend and embedded mobile processors composed of a multi-core CPU and an integrated GPU) with a focus on social network graphs. In particular, we propose two heterogenous approaches to exploit both devices. The first one, called Selective, selects on which device to execute each iteration. It is based on a previous approach, but we have adapted it to take advantage of the features of social network graphs (fewer iterations but more unbalanced). The second approach, referred as Concurrent, allows the execution of specific iterations concurrently in both devices. Our heterogenous implementations can be up to 1.56x faster and 1.32x more energy efficient with respect to the best of only-CPU or only-GPU baselines. We have also found that for a highly memory bound problem like BFS, the CPU-GPU collaborative execution is limited by the shared-memory bus bandwidth.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126970185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

An Image Processing VLIW Architecture for Real-Time Depth Detection 用于实时深度检测的图像处理VLIW体系结构

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI: 10.1109/SBAC-PAD.2016.28

D. Iorga, R. Nane, Yi Lu, E. V. Dalen, K. Bertels

引用次数: 4

MAGC: A Mapping Approach for GPU Clusters MAGC: GPU集群的映射方法

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI: 10.1109/SBAC-PAD.2016.15

S. Mirsadeghi, Iman Faraji, A. Afsahi

{"title":"MAGC: A Mapping Approach for GPU Clusters","authors":"S. Mirsadeghi, Iman Faraji, A. Afsahi","doi":"10.1109/SBAC-PAD.2016.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.15","url":null,"abstract":"GPU accelerators have been increasingly used in modern heterogeneous HPC clusters by offering high performance and energy efficiency. Such heterogeneous GPU clusters consisting of multiple CPU cores and GPU devices have become the platform of choice for many HPC applications. The communication channels among these processing elements expose different latency and bandwidth characteristics. Thus, efficient utilization of communication channels becomes an important factor for achieving higher inter-process communication performance. In this paper, we exploit topology awareness for a better utilization of communication channels in GPU clusters. We first discuss the challenges associated with topology-aware mapping in GPU clusters, and then propose MAGC, a Mapping Approach for GPU Clusters. MAGC seeks to improve the total communication performance by a joint consideration of both CPU-to-CPU and GPU-to-GPU communications of the application, and CPU and GPU physical topologies of the underlying GPU cluster. It provides a unified framework for topology-aware process-to-core mapping and GPU-to-process assignment across a GPU cluster. We study the potential benefits of MAGC with two different mapping algorithms: a) the Scotch graph mapping library, and b) a heuristic designed to explicitly consider maximum congestion. We evaluate our design through extensive experiments at micro-benchmark and application levels on two GPU clusters with different GPU types and topologies. We have developed a micro-benchmark suite to model various communication patterns among CPU cores and among GPU devices. For application results, we use the molecular dynamics simulator, HOOMD-blue. Micro-benchmark results show that we can achieve up to 91.4% improvement in communication time. At the application level, we can achieve up to 8% performance improvement.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122771396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

On the Dark Silicon Automatic Evaluation on Multicore Processors 多核处理器暗硅自动评价研究

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI: 10.1109/SBAC-PAD.2016.29

Tony Santos, Ana Silva, Liana Duenha, R. Santos, E. Moreno, R. Azevedo

{"title":"On the Dark Silicon Automatic Evaluation on Multicore Processors","authors":"Tony Santos, Ana Silva, Liana Duenha, R. Santos, E. Moreno, R. Azevedo","doi":"10.1109/SBAC-PAD.2016.29","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.29","url":null,"abstract":"The advent of Dark Silicon as result of the limit on Dennard scaling forced modern processor designs to reduce the chip area that can work on maximum clock frequency. This effect reduced the free gains from Moore's law. This work introduces a less conservative dark silicon estimate based on chip components power density and technological process, so that designers could explore architectural resources to mitigate it. We implemented our dark silicon estimation tool on top of MultiExplorer and evaluated on a set of Intel Pentium and AMD K8/10 multicore processors built on transistor technologies from 90nm down to 32nm. Our contributions are twofold: (1) Our experiments have shown dark silicon estimates up to 8.26% of the chip area compared to a baseline 90nm real processor, we also evaluated clock frequency behavior based on Dennard scaling and obtained up to 15.65% dark silicon on chip area. (2) We designed and showed that a dark silicon aware Design Space Exploration (DSE) strategy can minimize chip dark area while increasing performance at design time. Our results on DSE found dark silicon free multicore platforms while providing 3.6 speedup.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133580296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters 为GPU集群上的流应用设计高性能异构广播

2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) Pub Date : 2016-10-01 DOI: 10.1109/SBAC-PAD.2016.16

Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda

{"title":"Designing High Performance Heterogeneous Broadcast for Streaming Applications on GPU Clusters","authors":"Ching-Hsiang Chu, Khaled Hamidouche, H. Subramoni, Akshay Venkatesh, B. Elton, D. Panda","doi":"10.1109/SBAC-PAD.2016.16","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2016.16","url":null,"abstract":"High-performance streaming applications are beginning to leverage the compute power offered by graphics processing units (GPUs) and high network throughput offered by high performance interconnects such as InfiniBand (IB) to boost their performance and scalability. These applications rely heavily on broadcast operations to move data, which is stored in the host memory, from a single source—typically live—to multiple GPU-based computing sites. While homogeneous broadcast designs take advantage of IB hardware multicast feature to boost their performance, their heterogeneous counterpart requires an explicit data movement between Host and GPU, which significantly hurts the overall performance. There is a dearth of efficient heterogeneous broadcast designs for streaming applications especially on emerging multi-GPU configurations. In this work, we propose novel techniques to fully and conjointly take advantage of NVIDIA GPUDirect RDMA (GDR), CUDA inter-process communication (IPC) and IB hardware multicast features to design high-performance heterogeneous broadcast operations for modern multi-GPU systems. We propose intra-node, topology-aware schemes to maximize the performance benefits while minimizing the utilization of valuable PCIe resources. Further, we optimize the communication pipeline by overlapping the GDR + IB hardware multicast operations with CUDA IPC operations. Compared to existing solutions, our designs show up to 3X improvement in the latency of a heterogeneous broadcast operation. Our designs also show up to 67% improvement in execution time of a streaming benchmark on a GPU-dense Cray CS-Storm system with 88 GPUs.","PeriodicalId":361160,"journal":{"name":"2016 28th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122610867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3