{"title":"Design for scalability in enterprise SSDs","authors":"Arash Tavakkol, M. Arjomand, H. Sarbazi-Azad","doi":"10.1145/2628071.2628098","DOIUrl":"https://doi.org/10.1145/2628071.2628098","url":null,"abstract":"Solid State Drives (SSDs) have recently emerged as a high speed random access alternative to classical magnetic disks. To date, SSD designs have been largely based on multichannel bus architecture that confronts serious scalability problems in high-end enterprise SSDs with dozens of flash memory chips and a gigabyte host interface. This forces the community to rapidly change the bus-based inter-flash standards to respond to ever increasing application demands. In this paper, we first give a deep look at how different flash parameters and SSD internal designs affect the actual performance and scalability of the conventional architecture. Our experiments show that SSD performance improvement through either enhancing intra-chip parallelism or increasing the number of flash units is limited by frequent contentions occurred on the shared channels. Our discussion will be followed up by presenting and evaluating a network-based protocol adopted for flash communications in SSDs that addresses design constraints of the multi-channel bus architecture. This protocol leverages the properties of interconnection networks to attain a high performance SSD. Further, we will show and discuss that using this communication paradigm not only helps to obtain better SSD backend latency and throughput, but also to lower the variance of response time compared to the conventional designs. In addition, greater number of flash chips can be added with much less concerns on board-level signal integrity challenges including channels' maximum capacitive load, output drivers' slew rate, and impedance control.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121827607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Fanfarillo, T. Burnus, V. Cardellini, S. Filippone, D. Nagle, D. Rouson
{"title":"Coarrays in GNU Fortran","authors":"A. Fanfarillo, T. Burnus, V. Cardellini, S. Filippone, D. Nagle, D. Rouson","doi":"10.1145/2628071.2671427","DOIUrl":"https://doi.org/10.1145/2628071.2671427","url":null,"abstract":"Coarray Fortran is a set of features of the Fortran 2008 standard which makes Fortran a PGAS language. Currently, the coarray support is provided mainly by commercial compilers like Cray and Intel. In this work we present two coarray implementations on the GNU Fortran compiler. We present a performance comparison between our coarray implementations and those provided by Cray and Intel. Such comparison includes synthetic benchmarks and real, commonly used, scientific applications.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124082649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ATCache: Reducing DRAM cache latency via a small SRAM tag cache","authors":"Cheng-Chieh Huang, V. Nagarajan","doi":"10.1145/2628071.2628089","DOIUrl":"https://doi.org/10.1145/2628071.2628089","url":null,"abstract":"3D-stacking technology has enabled the option of embedding a large DRAM onto the processor. Prior works have proposed to use this as a DRAM cache. Because of its large size (a DRAM cache can be in the order of hundreds of megabytes), the total size of the tags associated with it can also be quite large (in the order of tens of megabytes). The large size of the tags has created a problem. Should we maintain the tags in the DRAM and pay the cost of a costly tag access in the critical path? Or should we maintain the tags in the faster SRAM by paying the area cost of a large SRAM for this purpose? Prior works have primarily chosen the former and proposed a variety of techniques for reducing the cost of a DRAM tag access. In this paper, we first establish (with the help of a study) that maintaining the tags in SRAM, because of its smaller access latency, leads to overall better performance. Motivated by this study, we ask if it is possible to maintain tags in SRAM without incurring high area overhead. Our key idea is simple. We propose to cache the tags in a small SRAM tag cache — we show that there is enough spatial and temporal locality amongst tag accesses to merit this idea. We propose the ATCache which is a small SRAM tag cache. Similar to a conventional cache, the ATCache caches recently accessed tags to exploit temporal locality; it exploits spatial locality by prefetching tags from nearby cache sets. In order to avoid the high miss latency and cache pollution caused by excessive prefetching, we use a simple technique to throttle the number of sets prefetched. Our proposed ATCache (which consumes 0.4% of overall tag size) can satisfy over 60% of DRAM cache tag accesses on average.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117027520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An event-based language for dynamic binary translation frameworks","authors":"S. Makarov, Angela Demke Brown, Ashvin Goel","doi":"10.1145/2628071.2671420","DOIUrl":"https://doi.org/10.1145/2628071.2671420","url":null,"abstract":"Dynamic binary translation (DBT) frameworks such as DynamoRIO [1] or Granary apply just-in-time rewriting techniques to allow pervasive instrumentation of a target program, for applications such as instruction-level profiling or watchpoints. This is a powerful approach, but analysis tools based on DBT frameworks are difficult to develop. Client modules written using a DBT framework must specify the instrumentation to perform on each basic block of the target program, and make use of explicit synchronization when aggregating data from multiple threads of a program. This can result in hundreds of lines of code for even simple analysis tools.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132355491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Harshvardhan, Adam Fidel, N. Amato, Lawrence Rauchwerger
{"title":"KLA: A new algorithmic paradigm for parallel graph computations","authors":"Harshvardhan, Adam Fidel, N. Amato, Lawrence Rauchwerger","doi":"10.1145/2628071.2628091","DOIUrl":"https://doi.org/10.1145/2628071.2628091","url":null,"abstract":"This paper proposes a new algorithmic paradigm — k-level asynchronous (KLA) — that bridges level-synchronous and asynchronous paradigms for processing graphs. The KLA paradigm enables the level of asynchrony in parallel graph algorithms to be parametrically varied from none (level-synchronous) to full (asynchronous). The motivation is to improve execution times through an appropriate trade-off between the use of fewer, but more expensive global synchronizations, as in level-synchronous algorithms, and more, but less expensive local synchronizations (and perhaps also redundant work), as in asynchronous algorithms. We show how common patterns in graph algorithms can be expressed in the KLA pardigm and provide techniques for determining k, the number of asynchronous steps allowed between global synchronizations. Results of an implementation of KLA in the STAPL Graph Library show excellent scalability on up to 96K cores and improvements of 10× or more over level-synchronous and asynchronous versions for graph algorithms such as breadth-first search, PageRank, k-core decomposition and others on certain classes of real-world graphs.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115257062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, J. Vetter
{"title":"SM-centric transformation: Circumventing hardware restrictions for flexible GPU scheduling","authors":"Bo Wu, Guoyang Chen, Dong Li, Xipeng Shen, J. Vetter","doi":"10.1145/2628071.2628130","DOIUrl":"https://doi.org/10.1145/2628071.2628130","url":null,"abstract":"To circumvent the limitation from the hardware scheduler on GPU, we create an SM-centric transformation technique. This technique enables complete control of the mapping between tasks and streaming multi-processors (SMs), and enables controlling the number of active thread blocks on each SM. Results show that our approach achieves better speedup than previous ones with kernel co-run cases.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115296845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Processing big data graphs on memory-restricted systems","authors":"Harshvardhan, N. Amato, Lawrence Rauchwerger","doi":"10.1145/2628071.2671429","DOIUrl":"https://doi.org/10.1145/2628071.2671429","url":null,"abstract":"With the advent of big-data, processing large graphs quickly has become increasingly important. Most existing approaches either utilize in-memory processing techniques, which can only process graphs that fit completely in RAM, or disk-based techniques that sacrifice performance. Contribution. In this work, we propose a novel RAM-Disk hybrid approach to graph processing that can scale well from a single shared-memory node to large distributed-memory systems. It works by partitioning the graph into subgraphs that fit in RAM and uses a paging-like technique to load subgraphs. We show that without modifying the algorithms, this approach can scale from small memory-constrained systems (such as tablets) to large-scale distributed machines with 16, 000+ cores.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"1 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128921055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-reuse optimizations for pipelined tiling with parametric tile sizes","authors":"Alexandre Isoard","doi":"10.1145/2628071.2671425","DOIUrl":"https://doi.org/10.1145/2628071.2671425","url":null,"abstract":"Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more specialized hardware. This requires static analysis to identify the kernel input (data read) and output (data produced) and code generation for the kernel itself, the associated transfers, and the synchronization with the rest of the code (on the host CPU). In general, such tasks are done by the developer who is required to explicit the communications, allocate and size the intermediate buffers, and segment the kernel into fitting chunks of computation. When a single kernel is offloaded in a three-phases process (i.e., upload, compute, store back), such programming remains feasible: for GPUs, the developers can use OpenCL or CUDA, or rely on higherlevel abstractions, such as the directives of OpenACC1 or the garbage collector mechanisms of SPOC2. However, in some cases, it is necessary to decompose a kernel into a sequence of smaller kernels (to get blocking algorithms, thanks to loop tiling) that are optimized with pipelined communications and data reuse among blocks (tiles). The choice of tile sizes is driven by hardware capabilities such as memory bandwidth, memory size and organization, computational power, and such codes are extremely hard to obtain without automation and some cost model. The contribution supported by this abstract and the associated poster is a parametric (w.r.t. tile size) analysis technique to perform these steps, including inter-tile data reuse and pipelining, using polyhedral optimizations3. It has been presented at the IMPACT'14 workshop [2].","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"70 1-2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120913754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Diener, E. Cruz, P. Navaux, Anselm Busse, Hans-Ulrich Heiß
{"title":"kMAF: Automatic kernel-level management of thread and data affinity","authors":"M. Diener, E. Cruz, P. Navaux, Anselm Busse, Hans-Ulrich Heiß","doi":"10.1145/2628071.2628085","DOIUrl":"https://doi.org/10.1145/2628071.2628085","url":null,"abstract":"One of the main challenges for parallel architectures is the increasing complexity of the memory hierarchy, which consists of several levels of private and shared caches, as well as interconnections between separate memories in NUMA machines. To make full use of this hierarchy, it is necessary to improve the locality of memory accesses by reducing accesses to remote caches and memories, and using local ones instead. Two techniques can be used to increase the memory access locality: executing threads and processes that access shared data close to each other in the memory hierarchy (thread affinity), and placing the memory pages they access on the NUMA node they are executing on (data affinity). Most related work in this area focuses on either thread or data affinity, but not both, which limits the improvements. Other mechanisms require expensive operations, such as memory access traces or binary analysis, require changes to hardware or work only on specific parallel APIs. In this paper, we introduce kMAF, a mechanism that automatically manages thread and data affinity on the kernel level. The memory access behavior of the running application is determined during its execution by analyzing its page faults. This information is used by kMAF to migrate threads and memory pages, such that the overall memory access locality is optimized. Extensive evaluation with 27 benchmarks from 4 benchmark suites shows substantial performance improvements, with results close to an oracle mechanism. Execution time was reduced by up to 35.7% (13.8% on average), while energy efficiency was improved by up to 34.6% (9.3% on average).","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116523706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"COLORIS: A dynamic cache partitioning system using page coloring","authors":"Y. Ye, R. West, Zhuoqun Cheng, Ye Li","doi":"10.1145/2628071.2628104","DOIUrl":"https://doi.org/10.1145/2628071.2628104","url":null,"abstract":"Shared caches in multicore processors are subject to contention from co-running threads. The resultant interference can lead to highly-variable performance for individual applications. This is particularly problematic for real-time applications, requiring predictable timing guarantees. Previous work has applied page coloring techniques to partition a shared cache, so that conflict misses are minimized amongst co-running workloads. However, prior page coloring techniques have not addressed the problem of partitioning a cache on over-committed processors where there are more executable threads than cores. Similarly, page coloring techniques have not proven efficient at adapting the cache partition sizes for threads with varying memory demands. This paper presents a memory management framework called COLORIS, which provides support for both static and dynamic cache partitioning using page coloring. COLORIS supports novel policies to reconfigure the assignment of page colors amongst application threads in over-committed systems. For quality-of-service (QoS), COLORIS monitors the cache miss rates of running applications and triggers re-partitioning of the cache to prevent miss rates exceeding applications-specific ranges. This paper presents the design and evaluation of COLORIS as applied to Linux. We show the efficiency and effectiveness of COLORIS to color memory pages for a set of SPEC CPU2006 workloads, thereby enhancing performance isolation over existing page coloring techniques.","PeriodicalId":263670,"journal":{"name":"2014 23rd International Conference on Parallel Architecture and Compilation (PACT)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133785550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}