2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)最新文献_第3页

Towards sustainable in-situ server systems in the big data era 迈向大数据时代可持续的原位服务器系统

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750381

Chao Li, Yang Hu, Longjun Liu, Juncheng Gu, Mingcong Song, Xiaoyao Liang, Jingling Yuan, Tao Li

{"title":"Towards sustainable in-situ server systems in the big data era","authors":"Chao Li, Yang Hu, Longjun Liu, Juncheng Gu, Mingcong Song, Xiaoyao Liang, Jingling Yuan, Tao Li","doi":"10.1145/2749469.2750381","DOIUrl":"https://doi.org/10.1145/2749469.2750381","url":null,"abstract":"Recent years have seen an explosion of data volumes from a myriad of distributed sources such as ubiquitous cameras and various sensors. The challenges of analyzing these geographically dispersed datasets are increasing due to the significant data movement overhead, time-consuming data aggregation, and escalating energy needs. Rather than constantly move a tremendous amount of raw data to remote warehouse-scale computing systems for processing, it would be beneficial to leverage in-situ server systems (InS) to pre-process data, i.e., bringing computation to where the data is located. This paper takes the first step towards designing server clusters for data processing in the field. We investigate two representative in-situ computing applications, where data is normally generated from environmentally sensitive areas or remote places that lack established utility infrastructure. These very special operating environments of in-situ servers urge us to explore standalone (i.e., off-grid) systems that offer the opportunity to benefit from local, self-generated energy sources. In this work we implement a heavily instrumented proof-of-concept prototype called InSURE: in-situ server systems using renewable energy. We develop a novel energy buffering mechanism and a unique joint spatio-temporal power management strategy to coordinate standalone power supplies and in-situ servers. We present detailed deployment experiences to quantify how our design fits with in-situ processing in the real world. Overall, InSURE yields 20%~60% improvements over a state-of-the-art baseline. It maintains impressive control effectiveness in under-provisioned environment and can economically scale along with the data processing needs. The proposed design is well complementary to today's grid-connected cloud data centers and provides competitive cost-effectiveness.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"112 1","pages":"14-26"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79340326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Fusion: Design tradeoffs in coherent cache hierarchies for accelerators 融合:加速器连贯缓存层次结构的设计权衡

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750421

Snehasish Kumar, Arrvindh Shriraman, Naveen Vedula

{"title":"Fusion: Design tradeoffs in coherent cache hierarchies for accelerators","authors":"Snehasish Kumar, Arrvindh Shriraman, Naveen Vedula","doi":"10.1145/2749469.2750421","DOIUrl":"https://doi.org/10.1145/2749469.2750421","url":null,"abstract":"Chip designers have shown increasing interest in integrating specialized fixed-function coprocessors into multicore designs to improve energy efficiency. Recent work in academia [11, 37] and industry [16] has sought to enable more fine-grain offloading at the granularity of functions and loops. The sequential program now needs to migrate across the chip utilizing the appropriate accelerator for each program region. As the execution migrates, it has become increasingly challenging to retain the temporal and spatial locality of the original program as well as manage the data sharing. We show that with the increasing energy cost of wires and caches relative to compute operations, it is imperative to optimize data movement to retain the energy benefits of accelerators. We develop FUSION, a lightweight coherent cache hierarchy for accelerators and study the tradeoffs compared to a scratchpad based architecture. We find that coherency, both between the accelerators and with the CPU, can help minimize data movement and save energy. FUSION leverages temporal coherence [32] to optimize data movement within the accelerator tile. The accelerator tile includes small per-accelerator L0 caches to minimize hit energy and a per-tile shared cache to improve localized-sharing between accelerators and minimize data exchanges with the host LLC. We find that overall FUSION improves performance by 4.3× compared to an oracle DMA that pushes data into the scratchpad. In workloads with inter-accelerator sharing we save up to 10× the dynamic energy of the cache hierarchy by minimizing the host-accelerator data ping-ponging.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"8 1","pages":"733-745"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80834072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Data reorganization in memory using 3D-stacked DRAM 使用3d堆叠DRAM的内存数据重组

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750397

Berkin Akin, F. Franchetti, J. Hoe

{"title":"Data reorganization in memory using 3D-stacked DRAM","authors":"Berkin Akin, F. Franchetti, J. Hoe","doi":"10.1145/2749469.2750397","DOIUrl":"https://doi.org/10.1145/2749469.2750397","url":null,"abstract":"In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged approach for efficient data reorganization, which combines (i) a proposed DRAM-aware reshape accelerator integrated within 3D-stacked DRAM, and (ii) a mathematical framework that is used to represent and optimize the reorganization operations. We evaluate our proposed system through two major use cases. First, we demonstrate the reshape accelerator in performing a physical address remapping via data layout transform to utilize the internal parallelism/locality of the 3D-stacked DRAM structure more efficiently for general purpose workloads. Then, we focus on offloading and accelerating commonly used data reorganization routines selected from the Intel Math Kernel Library package. We evaluate the energy and performance benefits of our approach by comparing it against existing optimized implementations on state-of-the-art GPUs and CPUs. For the various test cases, in-memory data reorganization provides orders of magnitude performance and energy efficiency improvements via low overhead hardware.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"78 1","pages":"131-143"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78513816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 178

FaultHound: Value-locality-based soft-fault tolerance FaultHound:基于值-位置的软容错

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750372

Nitin, I. Pomeranz, T. N. Vijaykumar

{"title":"FaultHound: Value-locality-based soft-fault tolerance","authors":"Nitin, I. Pomeranz, T. N. Vijaykumar","doi":"10.1145/2749469.2750372","DOIUrl":"https://doi.org/10.1145/2749469.2750372","url":null,"abstract":"Soft error susceptibility is a growing concern with continued CMOS scaling. Previous work explores full- and partial-redundancy schemes in hardware and software for soft-fault tolerance. However, full-redundancy schemes incur high performance and energy overheads whereas partial-redundancy schemes achieve low coverage. An initial study, called Perturbation Based Fault Screening (PBFS), explores exploiting value locality to provide hints of soft faults whenever a value falls outside its neighborhood. PBFS employs bit-mask filters to capture value neighborhoods. However, PBFS achieves low coverage; straightforwardly improving the coverage results in high false-positive rates, and performance and energy overheads. We propose FaultHound, a value-locality-based soft-fault tolerance scheme, which employs five mechanisms to address PBFS's limitations: (1) a scheme to cluster the filters via an inverted organization of the filter tables to reinforce learning and reduce the false-positive rates; (2) a learning scheme for ignoring the delinquent bit positions that raise repeated false alarms, to reduce further the false-positive rate; (3) a light-weight predecessor replay scheme instead of a full rollback to reduce the performance and energy penalty of the remaining false positives; (4) a simple scheme to distinguish rename faults, which require rollback instead of replay for recovery, from false positives to avoid unnecessary rollback penalty; and (5) a detection scheme, which avoids rollback, for the load-store queue which is not covered by our replay. Using simulations, we show that while PBFS achieves either low coverage (30%), or high false-positive rates (8%) with high performance overheads (97%), FaultHound achieves higher coverage (75%) and lower false-positive rates (3%) with lower performance and energy overheads (10% and 25%).","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"1 1","pages":"668-681"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88649731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Flexible auto-refresh: Enabling scalable and energy-efficient DRAM refresh reductions 灵活的自动刷新:实现可扩展和节能的DRAM刷新减少

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750408

Ishwar Bhati, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob

{"title":"Flexible auto-refresh: Enabling scalable and energy-efficient DRAM refresh reductions","authors":"Ishwar Bhati, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob","doi":"10.1145/2749469.2750408","DOIUrl":"https://doi.org/10.1145/2749469.2750408","url":null,"abstract":"DRAM cells require periodic refreshing to preserve data. In JEDEC DDRx devices, a refresh operation is performed via an auto-refresh command, which refreshes multiple rows in multiple banks simultaneously. The internal implementation of auto-refresh is completely opaque outside the DRAM - all the memory controller can do is to instruct the DRAM to refresh itself - the DRAM handles all else, in particular determining which rows in which banks are to be refreshed. This is in conflict with a large body of research on reducing the refresh overhead, in which the memory controller needs fine-grained control over which regions of the memory are refreshed. For example, prior works exploit the fact that a subset of DRAM rows can be refreshed at a slower rate than other rows due to access rate or retention period variations. However, such row-granularity approaches cannot use the standard auto-refresh command, which refreshes an entire batch of rows at once and does not permit skipping of rows. Consequently, prior schemes are forced to use explicit sequences of activate (ACT) and precharge (PRE) operations to mimic row-level refreshing. The drawback is that, compared to using JEDEC's auto-refresh mechanism, using explicit ACT and PRE commands is inefficient, both in terms of performance and power. In this paper, we show that even when skipping a high percentage of refresh operations, existing row-granurality refresh techniques are mostly ineffective due to the inherent efficiency disparity between ACT/PRE and the JEDEC auto-refresh mechanism. We propose a modification to the DRAM that extends its existing control-register access protocol to include the DRAM's internal refresh counter. We also introduce a new “dummy refresh” command that skips refresh operations and simply increments the internal counter. We show that these modifications allow a memory controller to reduce as many refreshes as in prior work, while achieving significant energy and performance advantages by using auto-refresh most of the time.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"9 1","pages":"235-246"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85175731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

Efficient execution of memory access phases using dataflow specialization 使用数据流专门化有效地执行内存访问阶段

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750390

C. Ho, Sung Jin Kim, K. Sankaralingam

{"title":"Efficient execution of memory access phases using dataflow specialization","authors":"C. Ho, Sung Jin Kim, K. Sankaralingam","doi":"10.1145/2749469.2750390","DOIUrl":"https://doi.org/10.1145/2749469.2750390","url":null,"abstract":"This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators. We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"16 1","pages":"118-130"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88352103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 30

Computer performance microscopy with Shim 计算机性能显微镜与Shim

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750401

Xi Yang, S. Blackburn, K. McKinley

{"title":"Computer performance microscopy with Shim","authors":"Xi Yang, S. Blackburn, K. McKinley","doi":"10.1145/2749469.2750401","DOIUrl":"https://doi.org/10.1145/2749469.2750401","url":null,"abstract":"Developers and architects spend a lot of time trying to understand and eliminate performance problems. Unfortunately, the root causes of many problems occur at a fine granularity that existing continuous profiling and direct measurement approaches cannot observe. This paper presents the design and implementation of Shim, a continuous profiler that samples at resolutions as fine as 15 cycles; three to five orders of magnitude finer than current continuous profilers. Shim's fine-grain measurements reveal new behaviors, such as variations in instructions per cycle (IPC) within the execution of a single function. A Shim observer thread executes and samples autonomously on unutilized hardware. To sample, it reads hardware performance counters and memory locations that store software state. Shim improves its accuracy by automatically detecting and discarding samples affected by measurement skew. We measure Shim's observer effects and show how to analyze them. When on a separate core, Shim can continuously observe one software signal with a 2% overhead at a ~1200 cycle resolution. At an overhead of 61%, Shim samples one software signal on the same core with SMT at a ~15 cycle resolution. Modest hardware changes could significantly reduce overheads and add greater analytical capability to Shim. We vary prefetching and DVFS policies in case studies that show the diagnostic power of fine-grain IPC and memory bandwidth results. By repurposing existing hardware, we deliver a practical tool for fine-grain performance microscopy for developers and architects.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"106 1","pages":"170-184"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88115909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Warped-Compression: Enabling power efficient GPUs through register compression warp - compression:通过寄存器压缩使能高效的gpu

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750417

Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, W. Ro, M. Annavaram

{"title":"Warped-Compression: Enabling power efficient GPUs through register compression","authors":"Sangpil Lee, Keunsoo Kim, Gunjae Koo, Hyeran Jeon, W. Ro, M. Annavaram","doi":"10.1145/2749469.2750417","DOIUrl":"https://doi.org/10.1145/2749469.2750417","url":null,"abstract":"This paper presents Warped-Compression, a warp-level register compression scheme for reducing GPU power consumption. This work is motivated by the observation that the register values of threads within the same warp are similar, namely the arithmetic differences between two successive thread registers is small. Removing data redundancy of register values through register compression reduces the effective register width, thereby enabling power reduction opportunities. GPU register files are huge as they are necessary to keep concurrent execution contexts and to enable fast context switching. As a result register file consumes a large fraction of the total GPU chip power. GPU design trends show that the register file size will continue to increase to enable even more thread level parallelism. To reduce register file data redundancy warped-compression uses low-cost and implementationefficient base-delta-immediate (BDI) compression scheme, that takes advantage of banked register file organization used in GPUs. Since threads within a warp write values with strong similarity, BDI can quickly compress and decompress by selecting either a single register, or one of the register banks, as the primary base and then computing delta values of all the other registers, or banks. Warped-compression can be used to reduce both dynamic and leakage power. By compressing register values, each warp-level register access activates fewer register banks, which leads to reduction in dynamic power. When fewer banks are used to store the register content, leakage power can be reduced by power gating the unused banks. Evaluation results show that register compression saves 25% of the total register file power consumption.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"23 1","pages":"502-514"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87615899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 101

BlueDBM: An appliance for Big Data analytics BlueDBM:用于大数据分析的设备

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2749469.2750412

S. Jun, Ming Liu, Sungjin Lee, Jamey Hicks, J. Ankcorn, Myron King, Shuotao Xu, Arvind

{"title":"BlueDBM: An appliance for Big Data analytics","authors":"S. Jun, Ming Liu, Sungjin Lee, Jamey Hicks, J. Ankcorn, Myron King, Shuotao Xu, Arvind","doi":"10.1145/2749469.2750412","DOIUrl":"https://doi.org/10.1145/2749469.2750412","url":null,"abstract":"Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"17 1","pages":"1-13"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87494477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 174

Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures 共享内存多核架构中刮本存储器透明管理的一致性协议

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI: 10.1145/2872887.2750411

Lluc Alvarez, L. Vilanova, Miquel Moretó, Marc Casas, Marc González, X. Martorell, N. Navarro, E. Ayguadé, M. Valero

{"title":"Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures","authors":"Lluc Alvarez, L. Vilanova, Miquel Moretó, Marc Casas, Marc González, X. Martorell, N. Navarro, E. Ayguadé, M. Valero","doi":"10.1145/2872887.2750411","DOIUrl":"https://doi.org/10.1145/2872887.2750411","url":null,"abstract":"The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards. This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"125 1","pages":"720-732"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73307901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 32