2014 23rd International Conference on Parallel Architecture and Compilation (PACT)最新文献_第3页

Design for scalability in enterprise SSDs 企业ssd的可扩展性设计

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628098

Arash Tavakkol, M. Arjomand, H. Sarbazi-Azad

{"title":"Design for scalability in enterprise SSDs","authors":"Arash Tavakkol, M. Arjomand, H. Sarbazi-Azad","doi":"10.1145/2628071.2628098","DOIUrl":"https://doi.org/10.1145/2628071.2628098","url":null,"abstract":"Solid State Drives (SSDs) have recently emerged as a high speed random access alternative to classical magnetic disks. To date, SSD designs have been largely based on multichannel bus architecture that confronts serious scalability problems in high-end enterprise SSDs with dozens of flash memory chips and a gigabyte host interface. This forces the community to rapidly change the bus-based inter-flash standards to respond to ever increasing application demands. In this paper, we first give a deep look at how different flash parameters and SSD internal designs affect the actual performance and scalability of the conventional architecture. Our experiments show that SSD performance improvement through either enhancing intra-chip parallelism or increasing the number of flash units is limited by frequent contentions occurred on the shared channels. Our discussion will be followed up by presenting and evaluating a network-based protocol adopted for flash communications in SSDs that addresses design constraints of the multi-channel bus architecture. This protocol leverages the properties of interconnection networks to attain a high performance SSD. Further, we will show and discuss that using this communication paradigm not only helps to obtain better SSD backend latency and throughput, but also to lower the variance of response time compared to the conventional designs. In addition, greater number of flash chips can be added with much less concerns on board-level signal integrity challenges including channels' maximum capacitive load, output drivers' slew rate, and impedance control.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121827607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Coarrays in GNU Fortran GNU Fortran中的数组

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2671427

A. Fanfarillo, T. Burnus, V. Cardellini, S. Filippone, D. Nagle, D. Rouson

引用次数: 1

ATCache: Reducing DRAM cache latency via a small SRAM tag cache ATCache:通过一个小的SRAM标签缓存减少DRAM缓存延迟

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628089

Cheng-Chieh Huang, V. Nagarajan

{"title":"ATCache: Reducing DRAM cache latency via a small SRAM tag cache","authors":"Cheng-Chieh Huang, V. Nagarajan","doi":"10.1145/2628071.2628089","DOIUrl":"https://doi.org/10.1145/2628071.2628089","url":null,"abstract":"3D-stacking technology has enabled the option of embedding a large DRAM onto the processor. Prior works have proposed to use this as a DRAM cache. Because of its large size (a DRAM cache can be in the order of hundreds of megabytes), the total size of the tags associated with it can also be quite large (in the order of tens of megabytes). The large size of the tags has created a problem. Should we maintain the tags in the DRAM and pay the cost of a costly tag access in the critical path? Or should we maintain the tags in the faster SRAM by paying the area cost of a large SRAM for this purpose? Prior works have primarily chosen the former and proposed a variety of techniques for reducing the cost of a DRAM tag access. In this paper, we first establish (with the help of a study) that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance. Motivated by this study, we ask if it is possible to maintain tags in SRAM without incurring high area overhead. Our key idea is simple. We propose to cache the tags in a small SRAM tag cache — we show that there is enough spatial and temporal locality amongst tag accesses to merit this idea. We propose the ATCache which is a small SRAM tag cache. Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In order to avoid the high miss latency and cache pollution caused by excessive prefetching, we use a simple technique to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117027520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 98

An event-based language for dynamic binary translation frameworks 一种用于动态二进制翻译框架的基于事件的语言

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2671420

S. Makarov, Angela Demke Brown, Ashvin Goel

引用次数: 3

KLA: A new algorithmic paradigm for parallel graph computations 并行图计算的新算法范式

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628091

Harshvardhan, Adam Fidel, N. Amato, Lawrence Rauchwerger

{"title":"KLA: A new algorithmic paradigm for parallel graph computations","authors":"Harshvardhan, Adam Fidel, N. Amato, Lawrence Rauchwerger","doi":"10.1145/2628071.2628091","DOIUrl":"https://doi.org/10.1145/2628071.2628091","url":null,"abstract":"This paper proposes a new algorithmic paradigm — k-level asynchronous (KLA) — that bridges level-synchronous and asynchronous paradigms for processing graphs. The KLA paradigm enables the level of asynchrony in parallel graph algorithms to be parametrically varied from none (level-synchronous) to full (asynchronous). The motivation is to improve execution times through an appropriate trade-off between the use of fewer, but more expensive global synchronizations, as in level-synchronous algorithms, and more, but less expensive local synchronizations (and perhaps also redundant work), as in asynchronous algorithms. We show how common patterns in graph algorithms can be expressed in the KLA pardigm and provide techniques for determining k, the number of asynchronous steps allowed between global synchronizations. Results of an implementation of KLA in the STAPL Graph Library show excellent scalability on up to 96K cores and improvements of 10× or more over level-synchronous and asynchronous versions for graph algorithms such as breadth-first search, PageRank, k-core decomposition and others on certain classes of real-world graphs.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115257062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

SM-centric transformation: Circumventing hardware restrictions for flexible GPU scheduling 以sm为中心的转换:规避硬件限制，实现灵活的GPU调度

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628130

Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, J. Vetter

引用次数: 1

Processing big data graphs on memory-restricted systems 在内存受限的系统上处理大数据图

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2671429

Harshvardhan, N. Amato, Lawrence Rauchwerger

引用次数: 2

Data-reuse optimizations for pipelined tiling with parametric tile sizes 具有参数化平铺大小的流水线平铺的数据重用优化

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2671425

Alexandre Isoard

{"title":"Data-reuse optimizations for pipelined tiling with parametric tile sizes","authors":"Alexandre Isoard","doi":"10.1145/2628071.2671425","DOIUrl":"https://doi.org/10.1145/2628071.2671425","url":null,"abstract":"Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more specialized hardware. This requires static analysis to identify the kernel input (data read) and output (data produced) and code generation for the kernel itself, the associated transfers, and the synchronization with the rest of the code (on the host CPU). In general, such tasks are done by the developer who is required to explicit the communications, allocate and size the intermediate buffers, and segment the kernel into fitting chunks of computation. When a single kernel is offloaded in a three-phases process (i.e., upload, compute, store back), such programming remains feasible: for GPUs, the developers can use OpenCL or CUDA, or rely on higherlevel abstractions, such as the directives of OpenACC1 or the garbage collector mechanisms of SPOC2. However, in some cases, it is necessary to decompose a kernel into a sequence of smaller kernels (to get blocking algorithms, thanks to loop tiling) that are optimized with pipelined communications and data reuse among blocks (tiles). The choice of tile sizes is driven by hardware capabilities such as memory bandwidth, memory size and organization, computational power, and such codes are extremely hard to obtain without automation and some cost model. The contribution supported by this abstract and the associated poster is a parametric (w.r.t. tile size) analysis technique to perform these steps, including inter-tile data reuse and pipelining, using polyhedral optimizations3. It has been presented at the IMPACT'14 workshop [2].","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"70 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120913754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

kMAF: Automatic kernel-level management of thread and data affinity kMAF:线程和数据关联的自动内核级管理

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628085

M. Diener, E. Cruz, P. Navaux, Anselm Busse, Hans-Ulrich Heiß

{"title":"kMAF: Automatic kernel-level management of thread and data affinity","authors":"M. Diener, E. Cruz, P. Navaux, Anselm Busse, Hans-Ulrich Heiß","doi":"10.1145/2628071.2628085","DOIUrl":"https://doi.org/10.1145/2628071.2628085","url":null,"abstract":"One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116523706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

COLORIS: A dynamic cache partitioning system using page coloring COLORIS:使用页面着色的动态缓存分区系统

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2014-08-24 DOI: 10.1145/2628071.2628104

Y. Ye, R. West, Zhuoqun Cheng, Ye Li

{"title":"COLORIS: A dynamic cache partitioning system using page coloring","authors":"Y. Ye, R. West, Zhuoqun Cheng, Ye Li","doi":"10.1145/2628071.2628104","DOIUrl":"https://doi.org/10.1145/2628071.2628104","url":null,"abstract":"Shared caches in multicore processors are subject to contention from co-running threads. The resultant interference can lead to highly-variable performance for individual applications. This is particularly problematic for real-time applications, requiring predictable timing guarantees. Previous work has applied page coloring techniques to partition a shared cache, so that conflict misses are minimized amongst co-running workloads. However, prior page coloring techniques have not addressed the problem of partitioning a cache on over-committed processors where there are more executable threads than cores. Similarly, page coloring techniques have not proven efficient at adapting the cache partition sizes for threads with varying memory demands. This paper presents a memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring. COLORIS supports novel policies to reconfigure the assignment of page colors amongst application threads in over-committed systems. For quality-of-service (QoS), COLORIS monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges. This paper presents the design and evaluation of COLORIS as applied to Linux. We show the efficiency and effectiveness of COLORIS to color memory pages for a set of SPEC CPU2006 workloads, thereby enhancing performance isolation over existing page coloring techniques.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133785550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 133