Proceedings of the Second International Symposium on Memory Systems最新文献

Adaptive Row Addressing for Cost-Efficient Parallel Memory Protocols in Large-Capacity Memories 大容量存储器中经济高效并行存储器协议的自适应行寻址

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989103

Dmitry Knyaginin, Vassilis D. Papaefstathiou, P. Stenström

{"title":"Adaptive Row Addressing for Cost-Efficient Parallel Memory Protocols in Large-Capacity Memories","authors":"Dmitry Knyaginin, Vassilis D. Papaefstathiou, P. Stenström","doi":"10.1145/2989081.2989103","DOIUrl":"https://doi.org/10.1145/2989081.2989103","url":null,"abstract":"Modern commercial workloads drive a continuous demand for larger and still low-latency main memories. JEDEC member companies indicate that parallel memory protocols will remain key to such memories, though widening the bus (increasing the pin count) to address larger capacities would cause multiple issues ultimately reducing the speed (the peak data rate) and cost-efficiency of the protocols. Thus to stay high-speed and cost-efficient, parallel memory protocols should address larger capacities using the available number of pins. This is accomplished by multiplexing the pins to transfer each address in multiple bus cycles, implementing Multi-Cycle Addressing (MCA). However, additional address-transfer cycles can significantly worsen performance and energy efficiency. This paper contributes with the concept of adaptive row addressing that comprises row-address caching to reduce the number of address-transfer cycles, enhanced by row-address prefetching and an adaptive row-access priority policy to improve state-of-the-art memory schedulers. For a case-study MCA protocol, the paper shows that the proposed concept improves: i) the read latency by 7.5% on average and up to 12.5%, and ii) the system-level performance and energy efficiency by 5.5% on average and up to 6.5%. This way, adaptive row addressing makes the MCA protocol as efficient as an idealistic protocol of the same speed but with enough pins to transfer each row address in a single bus cycle.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"8 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122066097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Analytical Study on Bandwidth Efficiency of Heterogeneous Memory Systems 异构存储系统带宽效率的分析研究

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989089

Amin Farmahini Farahani, D. Roberts, N. Jayasena

{"title":"Analytical Study on Bandwidth Efficiency of Heterogeneous Memory Systems","authors":"Amin Farmahini Farahani, D. Roberts, N. Jayasena","doi":"10.1145/2989081.2989089","DOIUrl":"https://doi.org/10.1145/2989081.2989089","url":null,"abstract":"Heterogeneous memory systems integrate different memory technologies to balance design requirements such as bandwidth, capacity, and cost. Performance of these systems depends heavily on memory hierarchy organization, memory attributes, and application characteristics. In this paper, we present analytical bandwidth models for a range of heterogeneous memory systems composed of DRAM and non-volatile memory (NVM). Our models enable exploring heterogeneous memory systems with different organizations and attributes. Using the models, we study the bandwidth efficiency of heterogeneous memory systems to provide insights into the bandwidth bottlenecks of these systems under different application characteristics. Our analytical results highlight the importance of NVM read-write bandwidth asymmetry and DRAM-NVM bandwidth asymmetry in bandwidth efficiency. Specifically, in flat non-uniform memory access (NUMA) systems, the read bandwidth is maximized when a certain portion of bandwidth is delivered by DRAM and that portion depends on multiple factors including DRAM and NVM bandwidth attributes and application bandwidth characteristics. In DRAM-cache-based systems, when the hit rate is low, the impact of the DRAM cache organization on the read bandwidth is minimal. However, at higher hit rates and NVM bandwidths, the impact of the cache organization on sustained read bandwidth becomes pronounced.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"8 31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124638153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Photonic Interconnects for Interposer-based 2.5D/3D Integrated Systems on a Chip 基于interposer的2.5D/3D芯片集成系统的光子互连

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989111

P. Grani, R. Proietti, V. Akella, S. Yoo

引用次数: 6

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory 以数据为中心的计算前沿:内存处理研究综述

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989087

P. Siegl, R. Buchty, Mladen Berekovic

{"title":"Data-Centric Computing Frontiers: A Survey On Processing-In-Memory","authors":"P. Siegl, R. Buchty, Mladen Berekovic","doi":"10.1145/2989081.2989087","DOIUrl":"https://doi.org/10.1145/2989081.2989087","url":null,"abstract":"A major shift from compute-centric to data-centric computing systems can be perceived, as novel big data workloads like cognitive computing and machine learning strongly enforce embarrassingly parallel and highly efficient processor architectures. With Moore's law having surrendered, innovative architectural concepts as well as technologies are urgently required, to enable a path for tackling exascale and beyond -- even though current computing systems face the inevitable instruction-level parallelism, power, memory, and bandwidth walls. As part of any computing system, the general perception of memories depicts unreliability, power hungriness and slowness, resulting in a future prospective bottleneck. The latter being an outcome of a pin limitation derived by packaging constraints, an unexploited tremendous row bandwidth is determinable, which off-chip diminishes to a bare minimum. Building upon a shift towards data-centric computing systems, the near-memory processing concept seems to be most promising, since power efficiency and computing performance increase by co-locating tasks on bandwidth-rich in-memory processing units, whereas data motion mitigates by the avoidance of entire memory hierarchies. By considering the umbrella of near-data processing as the urgent required breakthrough for future computing systems, this survey presents its derivations with a special emphasis on Processing-In-Memory (PIM), highlighting historical achievements in technology as well as architecture while depicting its advantages and obstacles.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127949297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

AWARD: Approximation-aWAre Restore in Further Scaling DRAM 奖项:进一步扩展DRAM中的近似感知恢复

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989127

Xianwei Zhang, Youtao Zhang, B. Childers, Jun Yang

引用次数: 4

Analyzing allocation behavior for multi-level memory 分析多级内存的分配行为

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989116

G. Voskuilen, Arun Rodrigues, S. Hammond

引用次数: 14

Multi-Level Memory Policies: What You Add Is More Important Than What You Take Out 多级内存策略:添加的内容比取出的内容更重要

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989117

S. Hammond, Arun Rodrigues, G. Voskuilen

引用次数: 8

Twin-Load: Bridging the Gap between Conventional Direct-Attached and Buffer-on-Board Memory Systems 双负载:弥合传统直接连接和缓冲板上存储系统之间的差距

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989106

Zehan Cui, Tianyue Lu, S. Mckee, Mingyu Chen, Haiyang Pan, Yuan Ruan

{"title":"Twin-Load: Bridging the Gap between Conventional Direct-Attached and Buffer-on-Board Memory Systems","authors":"Zehan Cui, Tianyue Lu, S. Mckee, Mingyu Chen, Haiyang Pan, Yuan Ruan","doi":"10.1145/2989081.2989106","DOIUrl":"https://doi.org/10.1145/2989081.2989106","url":null,"abstract":"Conventional systems with direct-attached DRAM struggle to meet growing memory capacity demands: the number of channels is limited by pin count, and the number of modules per channel is limited by signal integrity issues. Recent buffer-on-board (BOB) designs move some memory controller functionality to a separate buffer chip, which lets them support larger capacities (by adding more DRAM or denser, non-volatile components). Nonetheless, lower-cost, lower-latency, direct-attached DRAM still represents a better price-performance solution for many applications. Most processors exclusively implement either the direct-attached or the BOB approach. Combining both technologies within one processor has obvious benefits, but current memory-interface requirements complicate this straightforward solution. The standard DRAM interface is DDR, which requires data to be returned at a fixed latency. In contrast, the BOB interface supports diverse memory technologies precisely because it allows asynchrony. We propose Twin-Load technology to enable one processor to support both direct-attached and BOB memory. We show how to use Twin-Load to support BOB memory over standard DDR interfaces with minimal processor modifications. We build an asynchronous protocol over the existing, synchronous interface by splitting each memory read into twinned loads. The first acts as a prefetch to the buffer chip, and the second asynchronously fetches the data. We describe three methods for generating twinned loads, each leveraging different layers of the system stack.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"33 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133107227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Co-DIMM: Inter-Socket Data Sharing via a Common DIMM Channel Co-DIMM:通过公共DIMM通道共享插槽间数据

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989112

Ke Zhang, Lei-Ping Yu, Yisong Chang, Ran Zhao, Hongxia Zhang, Lixin Zhang, Mingyu Chen, S. Mckee

{"title":"Co-DIMM: Inter-Socket Data Sharing via a Common DIMM Channel","authors":"Ke Zhang, Lei-Ping Yu, Yisong Chang, Ran Zhao, Hongxia Zhang, Lixin Zhang, Mingyu Chen, S. Mckee","doi":"10.1145/2989081.2989112","DOIUrl":"https://doi.org/10.1145/2989081.2989112","url":null,"abstract":"To improve computing density, modern datacenters widely deploy server chassis with several processor sockets integrated as independent nodes. Distributed applications processing enormous datasets on such systems require frequent inter-node communication. Data sharing among distributed on-board socket nodes in the same server chassis via commodity networking and inter-socket connection technologies is inefficient, though. To address this problem, we propose inter-socket data sharing via normal memory access instructions. Co-DIMM eliminates the overheads of protocol-stack processing and data movement through the network. Instead of sharing data through centralized shared memory based on NUMA inter-socket connections, DDR switches allow Co-DIMM ownership to be changed dynamically to support asynchronous producer-consumer data sharing. We implement Co-DIMM in a custom in-house FPGA-based platform to generate preliminary results showing that data-sharing latency between two sockets is as low as 1.33μs. We present potential Co-DIMM usage scenarios and discuss implementation challenges.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"397 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133532184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture 基于GoblinCore-64架构的并发动态内存合并

Proceedings of the Second International Symposium on Memory Systems Pub Date : 2016-10-03 DOI: 10.1145/2989081.2989128

Xi Wang, John D. Leidel, Yong Chen

{"title":"Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture","authors":"Xi Wang, John D. Leidel, Yong Chen","doi":"10.1145/2989081.2989128","DOIUrl":"https://doi.org/10.1145/2989081.2989128","url":null,"abstract":"The majority of modern microprocessors are architected to utilize multi-level data caches as a primary optimization to reduce the latency and increase the perceived bandwidth from an application. The spatial and temporal locality provided by data caches work well in conjunction with applications that access memory in a linear fashion. However, applications that exhibit random or non-deterministic memory access patterns often induce a significant number of data cache misses, thus reducing the natural performance benefit from the data cache. In response to the performance penalties inherently present with non-deterministic applications, we have constructed a unique memory hierarchy within the GoblinCore-64 (GC64) architecture explicitly designed to exploit memory performance from irregular memory access patterns. The GC64 architecture combines a RISC-V-based core coupled with latency-hiding architectural features to a memory hierarchy with Hybrid Memory Cube (HMC) devices. In order to cope with the inherent non-determinism of applications and to exploit the packetized interface presented by the HMC device, we develop a methodology and associated implementation of a dynamic memory coalescing unit for the GC64 memory hierarchy that permits us to statistically sample memory requests from non-deterministic applications and coalesce them into the largest possible HMC payload requests. In this work, we present two parallel methodologies and associated implementations for coalescing non-deterministic memory requests into the largest potential HMC request by constructing a binary tree representation of the live memory requests from disparate cores. We present the coalesced HMC memory request results from applications that exhibit linear and non-linear memory request patterns compiled for a RISC-V core in contrast with a traditional memory hierarchy.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114280829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13