Proceedings of the 2015 International Symposium on Memory Systems最新文献

Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore 近数据处理:3D存储系统架构对Uncore的影响与优化

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818952

S. M. Hassan, S. Yalamanchili, S. Mukhopadhyay

{"title":"Near Data Processing: Impact and Optimization of 3D Memory System Architecture on the Uncore","authors":"S. M. Hassan, S. Yalamanchili, S. Mukhopadhyay","doi":"10.1145/2818950.2818952","DOIUrl":"https://doi.org/10.1145/2818950.2818952","url":null,"abstract":"A promising recent development that can provide continued scaling of performance is the ability to stack multiple DRAM layers on a multi-core processor die. This paper analyzes the interaction between the interconnection network and the memory hierarchy in such systems, and its impact on system performance. We explore the design considerations of a 3D system with DRAM-on-processor stacking and note that full advantages of 3D can only be achieved by configuring the memory with high number of channels. This significantly increases memory level parallelism which results in decreasing the traffic per DRAM bank, reducing their queuing delays, but increasing it on the interconnection network, making remote accesses expensive. To reduce the latency and traffic on the network, we propose restructuring the memory hierarchy to a memory-side cache organization and also explore the effects of various address translations and OS page allocation strategies. Our results indicate that a carefully designed 3D memory system can already improve performance by 25-35% without looking towards new sophisticated techniques.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126881457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Opportunities to Upgrade Main Memory 升级主存的机会

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818960

D. Resnick

引用次数: 3

Writing without Disturb on Phase Change Memories by Integrating Coding and Layout Design 集成编码与布局设计的相变存储器无干扰写入

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818962

A. Eslami, Alfredo J. Velasco, Alireza Vahid, Georgios Mappouras, A. Calderbank, Daniel J. Sorin

引用次数: 10

Bringing Modern Hierarchical Memory Systems Into Focus: A study of architecture and workload factors on system performance 将现代分层存储系统引入焦点:系统性能的架构和工作负载因素的研究

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818975

Paul Tschirhart, Jim Stevens, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob

{"title":"Bringing Modern Hierarchical Memory Systems Into Focus: A study of architecture and workload factors on system performance","authors":"Paul Tschirhart, Jim Stevens, Zeshan A. Chishti, Shih-Lien Lu, B. Jacob","doi":"10.1145/2818950.2818975","DOIUrl":"https://doi.org/10.1145/2818950.2818975","url":null,"abstract":"The increasing size of workloads has led to the development of new technologies and architectures that are intended to help address the capacity limitations of DRAM main memories. The proposed solutions fall into two categories: those that re-engineer Flash-based SSDs to further improve storage system performance and those that incorporate non-volatile technology into a Hybrid main memory system. These developments have blurred the line between the storage and memory systems. In this paper, we examine the differences between these two approaches to gain insight into the types of applications and memory technologies that benefit the most from these different architectural approaches. In particular this work utilizes full system simulation to examine the impact of workload randomness on system performance, the impact of backing store latency on system performance, and how the different implementations utilize system resources differently. We find that the software overhead incurred by storage based implementations can account for almost 50% of the overall access latency. As a result, backing store technologies that have an access latency up to 25 microseconds tend to perform better when implemented as part of the main memory system. We also see that high degrees of random access can exacerbate the software overhead problem and lead to large performance advantages for the Hybrid main memory approach. Meanwhile, the page replacement algorithm utilized by the OS in the storage approach results in considerably better performance on highly sequential workloads at the cost of greater pressure on the cache.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133346276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads 缓存层次结构中的低效率:移动工作负载下缓存大小的敏感性研究

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818980

A. Laer, William Wang, C. D. Emmons

{"title":"Inefficiencies in the Cache Hierarchy: A Sensitivity Study of Cacheline Size with Mobile Workloads","authors":"A. Laer, William Wang, C. D. Emmons","doi":"10.1145/2818950.2818980","DOIUrl":"https://doi.org/10.1145/2818950.2818980","url":null,"abstract":"With the rising number of cores in mobile devices, the cache hierarchy in mobile application processors gets deeper, and the cache size gets bigger. However, the cacheline size remained relatively constant over the last decade in mobile application processors. In this work, we investigate whether the cacheline size in mobile application processors is due for a refresh, by looking at inefficiencies in the cache hierarchy which tend to be exacerbated when increasing the cacheline size: false sharing and cacheline utilization. Firstly, we look at false sharing, which is more likely to arise at larger cacheline sizes and can severely impact performance. False sharing occurs when non-shared data structures, mapped onto the same cacheline, are being accessed by threads running on different cores, causing avoidable invalidations and subsequent misses. False sharing has been found in various places such as scientific workloads and real applications. We find that whilst increasing the cacheline size does increase false sharing, it still is negligible when compared to known cases of false sharing in scientific workloads, due to the limited level of thread-level parallelism in mobile workloads. Secondly, we look at cacheline utilization which measures the number of bytes in a cacheline actually used by the processor. This effect has been investigated under various names for a multitude of server and desktop applications. As a low cacheline utilization implies that very little of the fetched cachelines was used by the processor, this causes waste in bandwidth and energy in moving data across the memory hierarchy. The energy cost associated with data movements is much higher compared to logic operations, increasing the need for cache efficiency, especially in the case of an energy-constrained platform like a mobile device. We find that the cacheline utilization of mobile workloads is low in general, decreasing when increasing the cacheline size. When increasing the cacheline size from 64 bytes to 128 bytes, the number of misses will be reduced by 10%--30%, depending on the workload. However, because of the low cacheline utilization, this more than doubles the amount of unused traffic to the L1 caches. Using the cacheline utilization as a metric in this way, illustrates an important point. If a change in cacheline size would only be assessed on its local effects, we find that this change in cacheline size will only have advantages as the miss rate decreases. However, at system level, this change will increase the stress on the bus and increase the amount of wasted energy due to unused traffic. Using cacheline utilization as a metric underscores the need for system-level research when changing characteristics of the cache hierarchy.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115619118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

High Performance Computing Co-Design Strategies 高性能计算协同设计策略

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818959

J. Ang

{"title":"High Performance Computing Co-Design Strategies","authors":"J. Ang","doi":"10.1145/2818950.2818959","DOIUrl":"https://doi.org/10.1145/2818950.2818959","url":null,"abstract":"The MEMSYS Call for Papers contains this passage: Many of the problems we see in the memory system are cross-disciplinary in nature -- their solution would likely require work at all levels, from applications to circuits. Thus, while the scope of the problem is memory, the scope of the solutions will be much wider. The Department of Energy's (DOE) high performance computing (HPC) community is thinking about how to define, support and execute work at all levels for the development of future supercomputers to run our portfolio of mission applications. Borrowing a concept from embedded computing, the DOE HPC community is calling our work at all levels co-design [1]. Co-design for embedded computing is focused on hardware/software partitioning of activities to execute a well-defined task within specific constraints. Co-design for general-purpose HPC has many dimensions for both the work to be performed and the constraints, e.g. hardware designs, runtime software, applications and algorithms. The subject of this extended abstract is a description of two alternative DOE HPC co-design strategies. While DOE co-design efforts include more than the memory system, as noted in the MEMSYS call, the memory system impacts applications, circuits and all levels between.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123830751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Anatomy of GPU Memory System for Multi-Application Execution 多应用程序执行的GPU内存系统剖析

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818979

Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, S. Keckler, M. Kandemir, C. Das

{"title":"Anatomy of GPU Memory System for Multi-Application Execution","authors":"Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, S. Keckler, M. Kandemir, C. Das","doi":"10.1145/2818950.2818979","DOIUrl":"https://doi.org/10.1145/2818950.2818979","url":null,"abstract":"As GPUs make headway in the computing landscape spanning mobile platforms, supercomputers, cloud and virtual desktop platforms, supporting concurrent execution of multiple applications in GPUs becomes essential for unlocking their full potential. However, unlike CPUs, multi-application execution in GPUs is little explored. In this paper, we study the memory system of GPUs in a concurrently executing multi-application environment. We first present an analytical performance model for many-threaded architectures and show that the common use of misses-per-kilo-instruction (MPKI) as a proxy for performance is not accurate without considering the bandwidth usage of applications. We characterize the memory interference of applications and discuss the limitations of existing memory schedulers in mitigating this interference. We extend the analytical model to multiple applications and identify the key metrics to control various performance metrics. We conduct extensive simulations using an enhanced version of GPGPU-Sim targeted for concurrently executing multiple applications, and show that memory scheduling decisions based on MPKI and bandwidth information are more effective in enhancing throughput compared to the traditional FR-FCFS and the recently proposed RR FR-FCFS policies.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"68 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125378806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 82

Omitting Refresh: A Case Study for Commodity and Wide I/O DRAMs 省略刷新:商品和宽I/O dram的案例研究

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818964

Matthias Jung, Éder F. Zulian, Deepak M. Mathew, M. Herrmann, Christian Brugger, C. Weis, N. Wehn

{"title":"Omitting Refresh: A Case Study for Commodity and Wide I/O DRAMs","authors":"Matthias Jung, Éder F. Zulian, Deepak M. Mathew, M. Herrmann, Christian Brugger, C. Weis, N. Wehn","doi":"10.1145/2818950.2818964","DOIUrl":"https://doi.org/10.1145/2818950.2818964","url":null,"abstract":"Dynamic Random Access Memories (DRAM) have a big impact on performance and contribute significantly to the total power consumption in systems ranging from mobile devices to servers. Up to half of the power consumption of future high density DRAM devices will be caused by refresh commands. Moreover, not only the refresh rate does depend on the device capacity, but it strongly depends on the temperature as well. In case of 3D integration of MPSoCs with Wide I/O DRAMs the power density and thermal dissipation are increased dramatically. Hence, in 3D-DRAM even more DRAM refresh operations are required. To master these challenges, clever DRAM refresh strategies are mandatory either on hardware or on software level using new or already available infrastructures and implementations, such as Partial Array Self Refresh (PASR) or Temperature Compensated Self Refresh (TCSR). In this paper, we show that for dedicated applications refresh can be disabled completely without or with negligible impact on the application performance. This is possible if it is assured that either the lifetime of the data is shorter than the currently required DRAM refresh period or if the application can tolerate bit errors to some degree in a given time window.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129897125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

k-Means Clustering on Two-Level Memory Systems 两级存储系统的k-均值聚类

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818977

M. A. Bender, Jonathan W. Berry, S. Hammond, Branden J. Moore, Benjamin Moseley, C. Phillips

{"title":"k-Means Clustering on Two-Level Memory Systems","authors":"M. A. Bender, Jonathan W. Berry, S. Hammond, Branden J. Moore, Benjamin Moseley, C. Phillips","doi":"10.1145/2818950.2818977","DOIUrl":"https://doi.org/10.1145/2818950.2818977","url":null,"abstract":"In recent work we quantified the anticipated performance boost when a sorting algorithm is modified to leverage user-addressable \"near-memory,\" which we call scratchpad. This architectural feature is expected in the Intel Knight's Landing processors that will be used in DOE's next large-scale supercomputer. This paper expands our analytical study of the scratchpad to consider k-means clustering, a classical data-analysis technique that is ubiquitous in the literature and in practice. We present new theoretical results using the model introduced in [13], which measures memory transfers and assumes that computations are memory-bound. Our theoretical results indicate that scratchpad-aware versions of k-means clustering can expect performance boosts for high-dimensional instances with relatively few cluster centers. These constraints may limit the practical impact of scratch-pad for k-means acceleration, so we discuss their origins and practical implications. We corroborate our theory with experimental runs on a system instrumented to mimic one with scratchpad memory. We also contribute a semi-formalization of the computational properties that are necessary and sufficient to predict a performance boost from scratchpad-aware variants of algorithms. We have observed and studied these properties in the context of sorting, and now clustering. We conclude with some thoughts on the application of these properties to new areas. Specifically, we believe that dense linear algebra has similar properties to k-means, while sparse linear algebra and FFT computations are more similar to sorting. The sparse operations are more common in scientific computing, so we expect scratchpad to have significant impact in that area.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121216289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

MMC: a Many-core Memory Connection Model MMC:多核内存连接模型

Proceedings of the 2015 International Symposium on Memory Systems Pub Date : 2015-10-05 DOI: 10.1145/2818950.2818958

C. Ding, Hao Lu, Chencheng Ye

引用次数: 0