{"title":"Persistent Memory: The Value to HPC and the Challenges","authors":"A. Rudoff","doi":"10.1145/3145617.3158213","DOIUrl":"https://doi.org/10.1145/3145617.3158213","url":null,"abstract":"This paper provides an overview of the expected value of emerging persistent memory technologies to high performance computing (HPC) use cases. These values are somewhat speculative at the time of writing, based on what has been announced by vendors to become available over the next year, but we describe the potential value to HPC as well as some of the challenges in using persistent memory. The enabling work being done in the software ecosystem, applicable to HPC, is also described.","PeriodicalId":131928,"journal":{"name":"Proceedings of the Workshop on Memory Centric Programming for HPC","volume":"313 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128697711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Principles of Memory-Centric Programming for High Performance Computing","authors":"Yonghong Yan, R. Brightwell, Xian-He Sun","doi":"10.1145/3145617.3158212","DOIUrl":"https://doi.org/10.1145/3145617.3158212","url":null,"abstract":"The memory wall challenge -- the growing disparity between CPU speed and memory speed -- has been one of the most critical and long-standing challenges in computing. For high performance computing, programming to achieve efficient execution of parallel applications often requires more tuning and optimization efforts to improve data and memory access than for managing parallelism. The situation is further complicated by the recent expansion of the memory hierarchy, which is becoming deeper and more diversified with the adoption of new memory technologies and architectures such as 3D-stacked memory, non-volatile random-access memory (NVRAM), and hybrid software and hardware caches. The authors believe it is important to elevate the notion of memory-centric programming, with relevance to the compute-centric or data-centric programming paradigms, to utilize the unprecedented and ever-elevating modern memory systems. Memory-centric programming refers to the notion and techniques of exposing hardware memory system and its hierarchy, which could include DRAM and NUMA regions, shared and private caches, scratch pad, 3-D stacked memory, non-volatile memory, and remote memory, to the programmer via portable programming abstractions and APIs. These interfaces seek to improve the dialogue between programmers and system software, and to enable compiler optimizations, runtime adaptation, and hardware reconguration with regard to data movement, beyond what can be achieved using existing parallel programming APIs. In this paper, we provide an overview of memory-centric programming concepts and principles for high performance computing.","PeriodicalId":131928,"journal":{"name":"Proceedings of the Workshop on Memory Centric Programming for HPC","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124114197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Evaluating GPGPU Memory Performance Through the C-AMAT Model","authors":"Ning Zhang, Chuntao Jiang, Xian-He Sun, S. Song","doi":"10.1145/3145617.3158214","DOIUrl":"https://doi.org/10.1145/3145617.3158214","url":null,"abstract":"General Purpose Graphics Processing Units (GPGPU) have become a popular platform to accelerate high performance applications. Although they provide exceptional computing power, GPGPU impose significant pressure on the off-chip memory system. Evaluating, understanding, and improving GPGPU data access delay has become an important research topic in high-performance computing. In this study, we utilize the newly proposed GPGPU/C-AMAT (Concurrent Average Memory Access Time) model to quantitatively evaluate GPGPU memory performance. Specifically, we extend the current C-AMAT model to include a GPGPU-specific modeling component and then provide its evaluation results.","PeriodicalId":131928,"journal":{"name":"Proceedings of the Workshop on Memory Centric Programming for HPC","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121143289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NUMA Distance for Heterogeneous Memory","authors":"Sean Williams, Latchesar Ionkov, M. Lang","doi":"10.1145/3145617.3145620","DOIUrl":"https://doi.org/10.1145/3145617.3145620","url":null,"abstract":"Experience with Intel Xeon Phi suggests that NUMA alone is inadequate for assignment of pages to devices in heterogeneous memory systems. We argue that this is because NUMA is based on a single distance metric between all domains (i.e., number of devices \"in between\" the domains), while relationships between heterogeneous domains can and should be characterized by multiple metrics (e.g., latency, bandwidth, capacity). We therefore propose elaborating the concept of NUMA distance to give better and more intuitive control of placement of pages, while retaining most of the simplicity of the NUMA abstraction. This can be based on minor modification of the Linux kernel, with the possibility for further development by hardware vendors.","PeriodicalId":131928,"journal":{"name":"Proceedings of the Workshop on Memory Centric Programming for HPC","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126852476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bit Contiguous Memory Allocation for Processing In Memory","authors":"John D. Leidel","doi":"10.1145/3145617.3145618","DOIUrl":"https://doi.org/10.1145/3145617.3145618","url":null,"abstract":"Given the recent resurgence of research into processing in or near memory systems, we find an ever increasing need to augment traditional system software tools in order to make efficient use of the PIM hardware abstractions. One such architecture, the Micron In-Memory Intelligence (IMI) DRAM, provides a unique processing capability within the sense amp stride of a traditional 2D DRAM architecture. This accumulator processing circuit has the ability to compute both horizontally and vertically on pitch within the array. This unique processing capability requires a memory allocator that provides physical bit locality in order to ensure numerical consistency. In this work we introduce a new memory allocation methodology that provides bit contiguous allocation mechanisms for horizontal and vertical memory allocations for the Micron IMI DRAM device architecture. Our methodology drastically reduces the complexity by which to find new, unallocated memory blocks by combining a sparse matrix representation of the array with dense continuity vectors that represent the relative probability of finding candidate free blocks. We demonstrate our methodology using a set of pathological and standard benchmark applications in both horizontal and vertical memory modes.","PeriodicalId":131928,"journal":{"name":"Proceedings of the Workshop on Memory Centric Programming for HPC","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128096908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Beyond 16GB: Out-of-Core Stencil Computations","authors":"Istán Z. Reguly, G. Mudalige, M. Giles","doi":"10.1145/3145617.3145619","DOIUrl":"https://doi.org/10.1145/3145617.3145619","url":null,"abstract":"Stencil computations are a key class of applications, widely used in the scientific computing community, and a class that has particularly benefited from performance improvements on architectures with high memory bandwidth. Unfortunately, such architectures come with a limited amount of fast memory, which is limiting the size of the problems that can be efficiently solved. In this paper, we address this challenge by applying the well-known cache-blocking tiling technique to large scale stencil codes implemented using the OPS domain specific language, such as CloverLeaf 2D, CloverLeaf 3D, and OpenSBLI. We introduce a number of techniques and optimisations to help manage data resident in fast memory, and minimise data movement. Evaluating our work on Intel's Knights Landing Platform as well as NVIDIA P100 GPUs, we demonstrate that it is possible to solve 3 times larger problems than the on-chip memory size with at most 15% loss in efficiency.","PeriodicalId":131928,"journal":{"name":"Proceedings of the Workshop on Memory Centric Programming for HPC","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115210528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the Workshop on Memory Centric Programming for HPC","authors":"","doi":"10.1145/3145617","DOIUrl":"https://doi.org/10.1145/3145617","url":null,"abstract":"","PeriodicalId":131928,"journal":{"name":"Proceedings of the Workshop on Memory Centric Programming for HPC","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124132200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}