Proceedings of the 2016 International Conference on Supercomputing最新文献_第2页

Polly-ACC Transparent compilation to heterogeneous hardware 对异构硬件的透明编译

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926286

T. Grosser, T. Hoefler

引用次数: 51

Mini-Ckpts: Surviving OS Failures in Persistent Memory Mini-Ckpts:在持久内存中幸存操作系统故障

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926295

David Fiala, F. Mueller, Kurt B. Ferreira, C. Engelmann

{"title":"Mini-Ckpts: Surviving OS Failures in Persistent Memory","authors":"David Fiala, F. Mueller, Kurt B. Ferreira, C. Engelmann","doi":"10.1145/2925426.2926295","DOIUrl":"https://doi.org/10.1145/2925426.2926295","url":null,"abstract":"Concern is growing in the high-performance computing (HPC) community on the reliability of future extreme-scale systems. Current efforts have focused on application fault-tolerance rather than the operating system (OS), despite the fact that recent studies have suggested that failures in OS memory may be more likely. The OS is critical to a system's correct and efficient operation of the node and processes it governs---and the parallel nature of HPC applications means any single node failure generally forces all processes of this application to terminate due to tight communication in HPC. Therefore, the OS itself must be capable of tolerating failures in a robust system. In this work, we introduce mini-ckpts, a framework which enables application survival despite the occurrence of a fatal OS failure or crash. mini-ckpts achieves this tolerance by ensuring that the critical data describing a process is preserved in persistent memory prior to the failure. Following the failure, the OS is rejuvenated via a warm reboot and the application continues execution effectively making the failure and restart transparent. The mini-ckpts rejuvenation and recovery process is measured to take between three to six seconds and has a failure-free overhead of between 3-5% for a number of key HPC workloads. In contrast to current fault-tolerance methods, this work ensures that the operating and runtime systems can continue in the presence of faults. This is a much finer-grained and dynamic method of fault-tolerance than the current coarse-grained application-centric methods. Handling faults at this level has the potential to greatly reduce overheads and enables mitigation of additional faults.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123457138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

TokenTLB: A Token-Based Page Classification Approach TokenTLB:基于令牌的页面分类方法

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926280

Albert Esteve, Alberto Ros, A. Robles, M. E. Gómez, J. Duato

{"title":"TokenTLB: A Token-Based Page Classification Approach","authors":"Albert Esteve, Alberto Ros, A. Robles, M. E. Gómez, J. Duato","doi":"10.1145/2925426.2926280","DOIUrl":"https://doi.org/10.1145/2925426.2926280","url":null,"abstract":"Classifying memory accesses into private or shared data has become a fundamental approach to achieving efficiency and scalability in multi- and many-core systems. Since most memory accesses in both sequential and parallel applications are either private (accessed only by one core) or read-only (not written) data, devoting the full cost of coherence to every memory access results in sub-optimal performance and limits the scalability and efficiency of the multiprocessor. This work proposes TokenTLB, a page classification approach based on exchange and count of tokens. The key observation behind our proposal is that, opposed to coherence management, data classification meets all the benefits of a token-based approach without the burden of complex arbitration mechanisms, which has discouraged the implementation of token-based coherence protocols in commodity systems. Token counting on TLBs is a natural and efficient way for classifying memory pages. It does not require the use of complex and undesirable persistent requests or arbitration, since when two or more TLBs race for accessing a page, tokens are appropriately distributed classifying the page as shared. TokenTLB also favors shareability of translation information among TLBs, which improves system performance and constrains much of the TLB traffic compared to other broadcast-based approaches. It is achieved by requiring only TLBs holding extra tokens provide them along with the page translation (about one response per TLB miss). TokenTLB effectively increases blocks classified as private up to 61.1% while allowing read-only detection (24.4% shared-read-only blocks). When TokenTLB is applied to optimize the directory, it reduces the dynamic energy consumed by the cache hierarchy by nearly 27.3% over the baseline.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121981689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Parallel Transposition of Sparse Data Structures 稀疏数据结构的并行转置

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926291

Hao Wang, Weifeng Liu, Kaixi Hou, Wu-chun Feng

引用次数: 44

Fast Multiplication in Binary Fields on GPUs via Register Cache 通过寄存器缓存在gpu上的二进制字段快速乘法

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926259

Eli Ben-Sasson, Matan Hamilis, M. Silberstein, Eran Tromer

{"title":"Fast Multiplication in Binary Fields on GPUs via Register Cache","authors":"Eli Ben-Sasson, Matan Hamilis, M. Silberstein, Eran Tromer","doi":"10.1145/2925426.2926259","DOIUrl":"https://doi.org/10.1145/2925426.2926259","url":null,"abstract":"Finite fields of characteristic 2 -- \"binary fields\" -- are used in a variety of applications in cryptography and data storage. Multiplication of two finite field elements is a fundamental operation and a well-known computational bottleneck in many of these applications, as they often require multiplication of a large number of elements. In this work we focus on accelerating multiplication in \"large\" binary fields of sizes greater than 232. We devise a new parallel algorithm optimized for execution on GPUs. This algorithm makes it possible to multiply large number of finite field elements, and achieves high performance via bit-slicing and fine-grained parallelization. The key to the efficient implementation of the algorithm is a novel performance optimization methodology we call the register cache. This methodology speeds up an algorithm that caches its input in shared memory by transforming the code to use per-thread registers instead. We show how to replace shared memory accesses with the shuffle() intra-warp communication instruction, thereby significantly reducing or even eliminating shared memory accesses. We thoroughly analyze the register cache approach and characterize its benefits and limitations. We apply the register cache methodology to the implementation of the binary finite field multiplication algorithm on GPUs. We achieve up to 138x speedup for fields of size 232 over the popular, highly optimized Number Theory Library (NTL) [26], which uses the specialized CLMUL CPU instruction, and over 30x for larger fields of size below 2256. Our register cache implementation enables up to 50% higher performance compared to the traditional shared-memory based design.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132516700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes TurboTiling:利用预取来提高平铺代码的性能

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926288

Sanyam Mehta, R. Garg, Nishad Trivedi, P. Yew

{"title":"TurboTiling: Leveraging Prefetching to Boost Performance of Tiled Codes","authors":"Sanyam Mehta, R. Garg, Nishad Trivedi, P. Yew","doi":"10.1145/2925426.2926288","DOIUrl":"https://doi.org/10.1145/2925426.2926288","url":null,"abstract":"Loop tiling or blocking improves temporal locality by dividing the problem domain into tiles and then repeatedly accessing the data within a tile. While this reduces reuse, it also leads to an often ignored side-effect: breaking the streaming data access pattern. As a result, tiled codes are unable to exploit the sophisticated hardware prefetchers in present-day processors to extract extra performance. In this work, we propose a tiling algorithm to leverage prefetching to boost the performance of tiled codes. To achieve this, we propose to tile for the last-level cache as opposed to tiling for higher levels of cache as generally recommended. This approach not only exposes streaming access patterns in the tiled code that are amenable for prefetching, but also allows for a reduction in the off-chip traffic to memory (and therefore, better scaling with the number of cores). As a result, although we tile for the last level cache, we effectively access the data in the higher levels of cache because the data is prefetched in time for computation. To achieve this, we propose an algorithm to select a tile size that aims to maximize data reuse and minimize conflict misses in the shared last-level cache in modern multi-core processors. We find that the combined effect of tiling for the last-level cache and effective hardware prefetching gives significant improvement over existing tiling algorithms that target higher level L1/L2 caches and do not leverage the hardware prefetchers. When run on an Intel 8-core machine using different problem sizes, it achieves an average improvement of 27% and 48% for smaller and larger problem sizes, respectively, over the best tile sizes selected by state-of-the-art algorithms.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133340832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Towards an Adaptive Multi-Power-Source Datacenter 迈向自适应多电源数据中心

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926276

Longjun Liu, Hongbin Sun, Chao Li, Yang Hu, Nanning Zheng, Tao Li

{"title":"Towards an Adaptive Multi-Power-Source Datacenter","authors":"Longjun Liu, Hongbin Sun, Chao Li, Yang Hu, Nanning Zheng, Tao Li","doi":"10.1145/2925426.2926276","DOIUrl":"https://doi.org/10.1145/2925426.2926276","url":null,"abstract":"Big data and cloud computing are accelerating the capacity growth of datacenters all over the world. Their energy costs and environmental issues have pushed datacenter operators to explore and integrate alternative energy sources, such as various renewable energy supplies and energy storage devices. Designing datacenters powered by multi-power supplies in the smart grid environment is becoming a promising trend in the next few decades. However, gracefully provisioning various power sources and efficiently manage them in datacenter is a significant challenge. In this paper, we explore an unconventional fine-grained power distribution architecture for multi-source powered datacenters. We thoroughly investigate how to deliver and manage multiple power sources from the power generation plant outside of the datacenter to datacenter inside. We then propose a novel Power Switch Network (PSN) for datacenters. PSN is a reconfigurable multi-power-source distribution architecture which enables datacenter to distribute various power sources with a fine-grained manner. Moreover, a tailored machine learning based power sources management framework is proposed for PSN to dynamically select different power sources and optimize user-demanded performance metrics. Compared with the conventional single-switch system, evaluation results show that PSN could improve solar energy utilization by 39.6%, reduce utility power cost by 11.1% and improve workload performance by 33.8%, meanwhile enhancing battery lifetime by 9.3%. We expect that our work could provide valuable guidelines for the emerging multi-power-source datacenter to improve their efficiency, sustainability and economy.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125158915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics 面向大规模数据分析的稀疏矩阵向量乘法优化

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926278

Daniele Buono, F. Petrini, Fabio Checconi, Xing Liu, Xinyu Que, Chris Long, Tai-Ching Tuan

{"title":"Optimizing Sparse Matrix-Vector Multiplication for Large-Scale Data Analytics","authors":"Daniele Buono, F. Petrini, Fabio Checconi, Xing Liu, Xinyu Que, Chris Long, Tai-Ching Tuan","doi":"10.1145/2925426.2926278","DOIUrl":"https://doi.org/10.1145/2925426.2926278","url":null,"abstract":"Sparse Matrix-Vector multiplication (SpMV) is a fundamental kernel, used by a large class of numerical algorithms. Emerging big-data and machine learning applications are propelling a renewed interest in SpMV algorithms that can tackle massive amount of unstructured data---rapidly approaching the TeraByte range---with predictable, high performance. In this paper we describe a new methodology to design SpMV algorithms for shared memory multiprocessors (SMPs) that organizes the original SpMV algorithm into two distinct phases. In the first phase we build a scaled matrix, that is reduced in the second phase, providing numerous opportunities to exploit memory locality. Using this methodology, we have designed two algorithms. Our experiments on irregular big-data matrices (an order of magnitude larger than the current state of the art) show a quasi-optimal scaling on a large-scale POWER8 SMP system, with an average performance speedup of 3.8x, when compared to an equally optimized version of the CSR algorithm. In terms of absolute performance, with our implementation, the POWER8 SMP system is comparable to a 256-node cluster. In terms of size, it can process matrices with up to 68 billion edges, an order of magnitude larger than state-of-the-art clusters.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"41 10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121250211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 45

Simulation and Analysis Engine for Scale-Out Workloads 用于横向扩展工作负载的仿真和分析引擎

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926293

Nadav Chachmon, Daniel Richins, R. Cohn, M. Christensson, Wenzhi Cui, V. Reddi

{"title":"Simulation and Analysis Engine for Scale-Out Workloads","authors":"Nadav Chachmon, Daniel Richins, R. Cohn, M. Christensson, Wenzhi Cui, V. Reddi","doi":"10.1145/2925426.2926293","DOIUrl":"https://doi.org/10.1145/2925426.2926293","url":null,"abstract":"We introduce a system-level Simulation and Analysis Engine (SAE) framework based on dynamic binary instrumentation for fine-grained and customizable instruction-level introspection of everything that executes on the processor. SAE can instrument the BIOS, kernel, drivers, and user processes. It can also instrument multiple systems simultaneously using a single instrumentation interface, which is essential for studying scale-out applications. SAE is an x86 instruction set simulator designed specifically to enable rapid prototyping, evaluation, and validation of architectural extensions and program analysis tools using its flexible APIs. It is fast enough to execute full platform workloads---a modern operating system can boot in a few minutes---thus enabling research, evaluation, and validation of complex functionalities related to multicore configurations, virtualization, security, and more. To reach high speeds, SAE couples tightly with a virtual platform and employs both a just-in-time (JIT) compiler that helps simulate simple instructions efficiently and a fast interpreter for simulating new or complex instructions. We describe SAE's architecture and instrumentation engine design and show the framework's usefulness for single- and multi-system architectural and program analysis studies.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122390679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Galaxyfly: A Novel Family of Flexible-Radix Low-Diameter Topologies for Large-Scales Interconnection Networks Galaxyfly:一种用于大规模互连网络的柔性基低直径拓扑结构

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI: 10.1145/2925426.2926275

Fei Lei, Dezun Dong, Xiangke Liao, Xing Su, Cunlu Li

引用次数: 13