Dmitry Knyaginin, Vassilis D. Papaefstathiou, P. Stenström
{"title":"Adaptive Row Addressing for Cost-Efficient Parallel Memory Protocols in Large-Capacity Memories","authors":"Dmitry Knyaginin, Vassilis D. Papaefstathiou, P. Stenström","doi":"10.1145/2989081.2989103","DOIUrl":"https://doi.org/10.1145/2989081.2989103","url":null,"abstract":"Modern commercial workloads drive a continuous demand for larger and still low-latency main memories. JEDEC member companies indicate that parallel memory protocols will remain key to such memories, though widening the bus (increasing the pin count) to address larger capacities would cause multiple issues ultimately reducing the speed (the peak data rate) and cost-efficiency of the protocols. Thus to stay high-speed and cost-efficient, parallel memory protocols should address larger capacities using the available number of pins. This is accomplished by multiplexing the pins to transfer each address in multiple bus cycles, implementing Multi-Cycle Addressing (MCA). However, additional address-transfer cycles can significantly worsen performance and energy efficiency. This paper contributes with the concept of adaptive row addressing that comprises row-address caching to reduce the number of address-transfer cycles, enhanced by row-address prefetching and an adaptive row-access priority policy to improve state-of-the-art memory schedulers. For a case-study MCA protocol, the paper shows that the proposed concept improves: i) the read latency by 7.5% on average and up to 12.5%, and ii) the system-level performance and energy efficiency by 5.5% on average and up to 6.5%. This way, adaptive row addressing makes the MCA protocol as efficient as an idealistic protocol of the same speed but with enough pins to transfer each row address in a single bus cycle.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"8 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122066097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analytical Study on Bandwidth Efficiency of Heterogeneous Memory Systems","authors":"Amin Farmahini Farahani, D. Roberts, N. Jayasena","doi":"10.1145/2989081.2989089","DOIUrl":"https://doi.org/10.1145/2989081.2989089","url":null,"abstract":"Heterogeneous memory systems integrate different memory technologies to balance design requirements such as bandwidth, capacity, and cost. Performance of these systems depends heavily on memory hierarchy organization, memory attributes, and application characteristics. In this paper, we present analytical bandwidth models for a range of heterogeneous memory systems composed of DRAM and non-volatile memory (NVM). Our models enable exploring heterogeneous memory systems with different organizations and attributes. Using the models, we study the bandwidth efficiency of heterogeneous memory systems to provide insights into the bandwidth bottlenecks of these systems under different application characteristics. Our analytical results highlight the importance of NVM read-write bandwidth asymmetry and DRAM-NVM bandwidth asymmetry in bandwidth efficiency. Specifically, in flat non-uniform memory access (NUMA) systems, the read bandwidth is maximized when a certain portion of bandwidth is delivered by DRAM and that portion depends on multiple factors including DRAM and NVM bandwidth attributes and application bandwidth characteristics. In DRAM-cache-based systems, when the hit rate is low, the impact of the DRAM cache organization on the read bandwidth is minimal. However, at higher hit rates and NVM bandwidths, the impact of the cache organization on sustained read bandwidth becomes pronounced.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"8 31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124638153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Photonic Interconnects for Interposer-based 2.5D/3D Integrated Systems on a Chip","authors":"P. Grani, R. Proietti, V. Akella, S. Yoo","doi":"10.1145/2989081.2989111","DOIUrl":"https://doi.org/10.1145/2989081.2989111","url":null,"abstract":"Instead of a single die per chip (package) multiple dies stacked vertically (3D) and placed on an interposer (2.5D), is emerging as the building block for the future in both mobile and high-performance applications. We identify that bandwidth, energy per bit, and the ability to support a large amount of memory, are the key requirements of the inter-die Network on Chip (NoC). We propose to use an interposer with optical interconnections exploiting AWGR (Arrayed Waveguide Grating Router) wavelength routing to realize a 16x16 photonic NoC with a bisection bandwidth of 16 Tb/s. We propose a baseline network, which consumes 2.81 pJ/bit assuming 100% utilization. We show that the power is dominated by the electro-optical interface of the transmitter, which can be reduced by a more aggressive design that improves the energy per bit to 0.437 pJ/bit at 100% utilization. The networks exhibit very low latency as they are based on an optical crossbar topology and are scalable to multiple chips to support 1 TB of memory or more to meet the requirements of Exascale computing.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126538480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data-Centric Computing Frontiers: A Survey On Processing-In-Memory","authors":"P. Siegl, R. Buchty, Mladen Berekovic","doi":"10.1145/2989081.2989087","DOIUrl":"https://doi.org/10.1145/2989081.2989087","url":null,"abstract":"A major shift from compute-centric to data-centric computing systems can be perceived, as novel big data workloads like cognitive computing and machine learning strongly enforce embarrassingly parallel and highly efficient processor architectures. With Moore's law having surrendered, innovative architectural concepts as well as technologies are urgently required, to enable a path for tackling exascale and beyond -- even though current computing systems face the inevitable instruction-level parallelism, power, memory, and bandwidth walls. As part of any computing system, the general perception of memories depicts unreliability, power hungriness and slowness, resulting in a future prospective bottleneck. The latter being an outcome of a pin limitation derived by packaging constraints, an unexploited tremendous row bandwidth is determinable, which off-chip diminishes to a bare minimum. Building upon a shift towards data-centric computing systems, the near-memory processing concept seems to be most promising, since power efficiency and computing performance increase by co-locating tasks on bandwidth-rich in-memory processing units, whereas data motion mitigates by the avoidance of entire memory hierarchies. By considering the umbrella of near-data processing as the urgent required breakthrough for future computing systems, this survey presents its derivations with a special emphasis on Processing-In-Memory (PIM), highlighting historical achievements in technology as well as architecture while depicting its advantages and obstacles.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127949297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xianwei Zhang, Youtao Zhang, B. Childers, Jun Yang
{"title":"AWARD: Approximation-aWAre Restore in Further Scaling DRAM","authors":"Xianwei Zhang, Youtao Zhang, B. Childers, Jun Yang","doi":"10.1145/2989081.2989127","DOIUrl":"https://doi.org/10.1145/2989081.2989127","url":null,"abstract":"DRAM further scaling becomes more and more challenging, making restore operation an serious issue in the near future. Fortunately, a wide range of modern applications are able to tolerate error or inexactness, providing a new dimension to mitigate the slow-restore issue. And thus, we can trade-off acceptable QoS loss in those applications to accelerate restore operations, and further to achieve performance and energy improvements. In this extended research abstract, we briefly explore DRAM restore-based approximate computing, and present a preliminary evaluation on impacts of quality-of-service (QoS) degradation and performance speedup. We show that restore-based approximate computing is a challenging work, and dedicated error correction/tolerance techniques are needed to balance QoS and performance.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131870998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analyzing allocation behavior for multi-level memory","authors":"G. Voskuilen, Arun Rodrigues, S. Hammond","doi":"10.1145/2989081.2989116","DOIUrl":"https://doi.org/10.1145/2989081.2989116","url":null,"abstract":"Managing multi-level memories will require different policies from those used for cache hierarchies, as memory technologies differ in latency, bandwidth, and volatility. To this end we analyze application data allocations and main memory accesses to determine whether an application-driven approach to managing a multi-level memory system comprising stacked and conventional DRAM is viable. Our early analysis shows that the approach is viable, but some applications may require dynamic allocations (i.e., migration) while others are amenable to static allocation.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116413343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Level Memory Policies: What You Add Is More Important Than What You Take Out","authors":"S. Hammond, Arun Rodrigues, G. Voskuilen","doi":"10.1145/2989081.2989117","DOIUrl":"https://doi.org/10.1145/2989081.2989117","url":null,"abstract":"Multi-Level Memory (MLM) will be an increasingly common organization for main memory. Hybrid main memories that combine conventional DDR and \"fast\" memory will allow higher peak bandwidth at an attainable cost. However, the chief hurdle for MLM systems is the management of data placement. While user-directed placement may work for some applications, it imposes a heavy burden on the programmer. To avoid this burden while still benefiting from MLM, we propose a number of automated management policies. Our results show that several possible policies offer performance and implementation trade offs. Also, unlike conventional cache or paged memory policies, the addition policy is much more important than the replacement policy.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133507029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Twin-Load: Bridging the Gap between Conventional Direct-Attached and Buffer-on-Board Memory Systems","authors":"Zehan Cui, Tianyue Lu, S. Mckee, Mingyu Chen, Haiyang Pan, Yuan Ruan","doi":"10.1145/2989081.2989106","DOIUrl":"https://doi.org/10.1145/2989081.2989106","url":null,"abstract":"Conventional systems with direct-attached DRAM struggle to meet growing memory capacity demands: the number of channels is limited by pin count, and the number of modules per channel is limited by signal integrity issues. Recent buffer-on-board (BOB) designs move some memory controller functionality to a separate buffer chip, which lets them support larger capacities (by adding more DRAM or denser, non-volatile components). Nonetheless, lower-cost, lower-latency, direct-attached DRAM still represents a better price-performance solution for many applications. Most processors exclusively implement either the direct-attached or the BOB approach. Combining both technologies within one processor has obvious benefits, but current memory-interface requirements complicate this straightforward solution. The standard DRAM interface is DDR, which requires data to be returned at a fixed latency. In contrast, the BOB interface supports diverse memory technologies precisely because it allows asynchrony. We propose Twin-Load technology to enable one processor to support both direct-attached and BOB memory. We show how to use Twin-Load to support BOB memory over standard DDR interfaces with minimal processor modifications. We build an asynchronous protocol over the existing, synchronous interface by splitting each memory read into twinned loads. The first acts as a prefetch to the buffer chip, and the second asynchronously fetches the data. We describe three methods for generating twinned loads, each leveraging different layers of the system stack.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"33 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133107227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Zhang, Lei-Ping Yu, Yisong Chang, Ran Zhao, Hongxia Zhang, Lixin Zhang, Mingyu Chen, S. Mckee
{"title":"Co-DIMM: Inter-Socket Data Sharing via a Common DIMM Channel","authors":"Ke Zhang, Lei-Ping Yu, Yisong Chang, Ran Zhao, Hongxia Zhang, Lixin Zhang, Mingyu Chen, S. Mckee","doi":"10.1145/2989081.2989112","DOIUrl":"https://doi.org/10.1145/2989081.2989112","url":null,"abstract":"To improve computing density, modern datacenters widely deploy server chassis with several processor sockets integrated as independent nodes. Distributed applications processing enormous datasets on such systems require frequent inter-node communication. Data sharing among distributed on-board socket nodes in the same server chassis via commodity networking and inter-socket connection technologies is inefficient, though. To address this problem, we propose inter-socket data sharing via normal memory access instructions. Co-DIMM eliminates the overheads of protocol-stack processing and data movement through the network. Instead of sharing data through centralized shared memory based on NUMA inter-socket connections, DDR switches allow Co-DIMM ownership to be changed dynamically to support asynchronous producer-consumer data sharing. We implement Co-DIMM in a custom in-house FPGA-based platform to generate preliminary results showing that data-sharing latency between two sockets is as low as 1.33μs. We present potential Co-DIMM usage scenarios and discuss implementation challenges.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"397 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133532184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concurrent Dynamic Memory Coalescing on GoblinCore-64 Architecture","authors":"Xi Wang, John D. Leidel, Yong Chen","doi":"10.1145/2989081.2989128","DOIUrl":"https://doi.org/10.1145/2989081.2989128","url":null,"abstract":"The majority of modern microprocessors are architected to utilize multi-level data caches as a primary optimization to reduce the latency and increase the perceived bandwidth from an application. The spatial and temporal locality provided by data caches work well in conjunction with applications that access memory in a linear fashion. However, applications that exhibit random or non-deterministic memory access patterns often induce a significant number of data cache misses, thus reducing the natural performance benefit from the data cache. In response to the performance penalties inherently present with non-deterministic applications, we have constructed a unique memory hierarchy within the GoblinCore-64 (GC64) architecture explicitly designed to exploit memory performance from irregular memory access patterns. The GC64 architecture combines a RISC-V-based core coupled with latency-hiding architectural features to a memory hierarchy with Hybrid Memory Cube (HMC) devices. In order to cope with the inherent non-determinism of applications and to exploit the packetized interface presented by the HMC device, we develop a methodology and associated implementation of a dynamic memory coalescing unit for the GC64 memory hierarchy that permits us to statistically sample memory requests from non-deterministic applications and coalesce them into the largest possible HMC payload requests. In this work, we present two parallel methodologies and associated implementations for coalescing non-deterministic memory requests into the largest potential HMC request by constructing a binary tree representation of the live memory requests from disparate cores. We present the coalesced HMC memory request results from applications that exhibit linear and non-linear memory request patterns compiled for a RISC-V core in contrast with a traditional memory hierarchy.","PeriodicalId":283512,"journal":{"name":"Proceedings of the Second International Symposium on Memory Systems","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114280829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}