{"title":"High Performance Computing Co-Design Strategies","authors":"J. Ang","doi":"10.1145/2818950.2818959","DOIUrl":"https://doi.org/10.1145/2818950.2818959","url":null,"abstract":"The MEMSYS Call for Papers contains this passage: Many of the problems we see in the memory system are cross-disciplinary in nature -- their solution would likely require work at all levels, from applications to circuits. Thus, while the scope of the problem is memory, the scope of the solutions will be much wider. The Department of Energy's (DOE) high performance computing (HPC) community is thinking about how to define, support and execute work at all levels for the development of future supercomputers to run our portfolio of mission applications. Borrowing a concept from embedded computing, the DOE HPC community is calling our work at all levels co-design [1]. Co-design for embedded computing is focused on hardware/software partitioning of activities to execute a well-defined task within specific constraints. Co-design for general-purpose HPC has many dimensions for both the work to be performed and the constraints, e.g. hardware designs, runtime software, applications and algorithms. The subject of this extended abstract is a description of two alternative DOE HPC co-design strategies. While DOE co-design efforts include more than the memory system, as noted in the MEMSYS call, the memory system impacts applications, circuits and all levels between.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123830751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, S. Keckler, M. Kandemir, C. Das
{"title":"Anatomy of GPU Memory System for Multi-Application Execution","authors":"Adwait Jog, Onur Kayiran, Tuba Kesten, Ashutosh Pattnaik, Evgeny Bolotin, Niladrish Chatterjee, S. Keckler, M. Kandemir, C. Das","doi":"10.1145/2818950.2818979","DOIUrl":"https://doi.org/10.1145/2818950.2818979","url":null,"abstract":"As GPUs make headway in the computing landscape spanning mobile platforms, supercomputers, cloud and virtual desktop platforms, supporting concurrent execution of multiple applications in GPUs becomes essential for unlocking their full potential. However, unlike CPUs, multi-application execution in GPUs is little explored. In this paper, we study the memory system of GPUs in a concurrently executing multi-application environment. We first present an analytical performance model for many-threaded architectures and show that the common use of misses-per-kilo-instruction (MPKI) as a proxy for performance is not accurate without considering the bandwidth usage of applications. We characterize the memory interference of applications and discuss the limitations of existing memory schedulers in mitigating this interference. We extend the analytical model to multiple applications and identify the key metrics to control various performance metrics. We conduct extensive simulations using an enhanced version of GPGPU-Sim targeted for concurrently executing multiple applications, and show that memory scheduling decisions based on MPKI and bandwidth information are more effective in enhancing throughput compared to the traditional FR-FCFS and the recently proposed RR FR-FCFS policies.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"68 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125378806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthias Jung, Éder F. Zulian, Deepak M. Mathew, M. Herrmann, Christian Brugger, C. Weis, N. Wehn
{"title":"Omitting Refresh: A Case Study for Commodity and Wide I/O DRAMs","authors":"Matthias Jung, Éder F. Zulian, Deepak M. Mathew, M. Herrmann, Christian Brugger, C. Weis, N. Wehn","doi":"10.1145/2818950.2818964","DOIUrl":"https://doi.org/10.1145/2818950.2818964","url":null,"abstract":"Dynamic Random Access Memories (DRAM) have a big impact on performance and contribute significantly to the total power consumption in systems ranging from mobile devices to servers. Up to half of the power consumption of future high density DRAM devices will be caused by refresh commands. Moreover, not only the refresh rate does depend on the device capacity, but it strongly depends on the temperature as well. In case of 3D integration of MPSoCs with Wide I/O DRAMs the power density and thermal dissipation are increased dramatically. Hence, in 3D-DRAM even more DRAM refresh operations are required. To master these challenges, clever DRAM refresh strategies are mandatory either on hardware or on software level using new or already available infrastructures and implementations, such as Partial Array Self Refresh (PASR) or Temperature Compensated Self Refresh (TCSR). In this paper, we show that for dedicated applications refresh can be disabled completely without or with negligible impact on the application performance. This is possible if it is assured that either the lifetime of the data is shorter than the currently required DRAM refresh period or if the application can tolerate bit errors to some degree in a given time window.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129897125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. A. Bender, Jonathan W. Berry, S. Hammond, Branden J. Moore, Benjamin Moseley, C. Phillips
{"title":"k-Means Clustering on Two-Level Memory Systems","authors":"M. A. Bender, Jonathan W. Berry, S. Hammond, Branden J. Moore, Benjamin Moseley, C. Phillips","doi":"10.1145/2818950.2818977","DOIUrl":"https://doi.org/10.1145/2818950.2818977","url":null,"abstract":"In recent work we quantified the anticipated performance boost when a sorting algorithm is modified to leverage user-addressable \"near-memory,\" which we call scratchpad. This architectural feature is expected in the Intel Knight's Landing processors that will be used in DOE's next large-scale supercomputer. This paper expands our analytical study of the scratchpad to consider k-means clustering, a classical data-analysis technique that is ubiquitous in the literature and in practice. We present new theoretical results using the model introduced in [13], which measures memory transfers and assumes that computations are memory-bound. Our theoretical results indicate that scratchpad-aware versions of k-means clustering can expect performance boosts for high-dimensional instances with relatively few cluster centers. These constraints may limit the practical impact of scratch-pad for k-means acceleration, so we discuss their origins and practical implications. We corroborate our theory with experimental runs on a system instrumented to mimic one with scratchpad memory. We also contribute a semi-formalization of the computational properties that are necessary and sufficient to predict a performance boost from scratchpad-aware variants of algorithms. We have observed and studied these properties in the context of sorting, and now clustering. We conclude with some thoughts on the application of these properties to new areas. Specifically, we believe that dense linear algebra has similar properties to k-means, while sparse linear algebra and FFT computations are more similar to sorting. The sparse operations are more common in scientific computing, so we expect scratchpad to have significant impact in that area.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121216289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MMC: a Many-core Memory Connection Model","authors":"C. Ding, Hao Lu, Chencheng Ye","doi":"10.1145/2818950.2818958","DOIUrl":"https://doi.org/10.1145/2818950.2818958","url":null,"abstract":"This extended abstract formulates a model of parallel performance called MMC. It gives the theoretical upper bound of parallel performance based on three factors: the processing capacity, the network capacity, and the memory capacity.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134506276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hyojong Kim, Hyesoon Kim, S. Yalamanchili, Arun Rodrigues
{"title":"Understanding Energy Aspects of Processing-near-Memory for HPC Workloads","authors":"Hyojong Kim, Hyesoon Kim, S. Yalamanchili, Arun Rodrigues","doi":"10.1145/2818950.2818985","DOIUrl":"https://doi.org/10.1145/2818950.2818985","url":null,"abstract":"Interests in the concept of processing-near-memory (PNM) have been reignited with recent improvements of the 3D integration technology. In this work, we analyze the energy consumption characteristics of a system which comprises a conventional processor and a 3D memory stack with fully-programmable cores. We construct a high-level analytical energy model based on the underlying architecture and the technology with which each component is built. From the preliminary experiments with 11 HPC benchmarks from Mantevo benchmark suite, we observed that misses per kilo instructions (MPKI) of last-level cache (LLC) is one of the most important characteristics in determining the friendliness of the application to the PNM execution.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130869836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Data Centric Perspective on Memory Placement","authors":"Y. Birk, O. Mencer","doi":"10.1145/2818950.2818956","DOIUrl":"https://doi.org/10.1145/2818950.2818956","url":null,"abstract":"In this paper, we focus on memory in its role as a channel for passing information from one instruction to another; in particular, in conjunction with spatial or dataflow computing architectures, wherein the computing elements are laid out like an assembly plant. We point out the opportunity to dramatically increase effective data access bandwidth by going from a centralized memory array model with a few ports to numerous tiny buffers that can be accessed concurrently. The penalty is loss in access flexibility, but this flexibility is often a by-product of the memory organization rather than a true need. The improvements in hardware reconfiguration speed and resolution, combined with definition of standard buffer queuing and routing capabilities and efforts by tool designers and application developers are likely to extend the applicability of those architectures, offering dramatic power-cost-performance advantages.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128685486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Data Movement in the Memory Hierarchy in HPC Systems","authors":"Aditya M. Deshpande, J. Draper","doi":"10.1145/2818950.2818972","DOIUrl":"https://doi.org/10.1145/2818950.2818972","url":null,"abstract":"Increasing core counts and cache sizes in modern processors are causing data movement across the memory hierarchy to increase. With High Performance Computing (HPC) systems becoming more and more energy constrained, improving energy efficiency is becoming a necessity. Given its significant impact on system energy efficiency, the data movement costs in terms of energy and performance cannot be neglected. Conventional techniques for modeling and analyzing data movement across the memory hierarchy have proven to be inadequate in helping computer architects and system designers to optimize data movement. Our work is a position statement emphasizing the need for more detailed data movement modeling tools that better quantify how data movement across the memory hierarchy during application execution affects energy and performance. The hope is that exposing more detailed characteristics about the data movement would enable designers to optimize applications and architectures for minimizing data movement and in turn reduce energy and perhaps even increase performance.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133540195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chun-Yi Su, D. Roberts, E. León, K. Cameron, B. D. Supinski, G. Loh, Dimitrios S. Nikolopoulos
{"title":"HpMC: An Energy-aware Management System of Multi-level Memory Architectures","authors":"Chun-Yi Su, D. Roberts, E. León, K. Cameron, B. D. Supinski, G. Loh, Dimitrios S. Nikolopoulos","doi":"10.1145/2818950.2818974","DOIUrl":"https://doi.org/10.1145/2818950.2818974","url":null,"abstract":"DRAM technology faces density and power challenges to increase capacity because of limitations of physical cell design. To overcome these limitations, system designers are exploring alternative solutions that combine DRAM and emerging NVRAM technologies. Previous work on heterogeneous memories focuses, mainly, on two system designs: PCache, a hierarchical, inclusive memory system, and HRank, a flat, non-inclusive memory system. We demonstrate that neither of these designs can universally achieve high performance and energy efficiency across a suite of HPC workloads. In this work, we investigate the impact of a number of multi-level memory designs on the performance, power, and energy consumption of applications. To achieve this goal and overcome the limited number of available tools to study heterogeneous memories, we created HMsim, an infrastructure that enables n-level, heterogeneous memory studies by leveraging existing memory simulators. We, then, propose HpMC, a new memory controller design that combines the best aspects of existing management policies to improve performance and energy. Our energy-aware memory management system dynamically switches between PCache and HRank based on the temporal locality of applications. Our results show that HpMC reduces energy consumption from 13% to 45% compared to PCache and HRank, while providing the same bandwidth and higher capacity than a conventional DRAM system.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127213158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shared Last-Level Caches and The Case for Longer Timeslices","authors":"Viacheslav V. Fedorov, A. Reddy, Paul V. Gratz","doi":"10.1145/2818950.2818968","DOIUrl":"https://doi.org/10.1145/2818950.2818968","url":null,"abstract":"Memory performance is important in modern systems. Contention at various levels in memory hierarchy can lead to significant application performance degradation due to interference. Further, modern, large, last-level caches (LLC) have fill times greater than the OS scheduling window. When several threads are running concurrently and timesharing the CPU cores, they may never be able to load their working sets into the cache before being rescheduled, thus permanently stuck in the \"cold-start\" regime. We show that by increasing the system scheduling timeslice length it is possible to amortize the cache cold-start penalty due to the multitasking and improve application performance by 10--15%.","PeriodicalId":389462,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126580304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}