{"title":"SMT-Aware Instantaneous Footprint Optimization","authors":"Probir Roy, Xu Liu, S. Song","doi":"10.1145/2907294.2907308","DOIUrl":"https://doi.org/10.1145/2907294.2907308","url":null,"abstract":"Modern architectures employ simultaneous multithreading (SMT) to increase thread-level parallelism. SMT threads share many functional units and the entire memory hierarchy of a physical core. Without a careful code design, SMT threads can easily contend with each other for these shared resources, causing severe performance degradation. Minimizing SMT thread contention for HPC applications running on dedicated platforms is very challenging because they typically spawn threads within Single Program Multiple Data (SPMD) models. Since these threads have similar resource requirements, their contention cannot be easily mitigated through simple thread scheduling. To address this important issue, we first vigorously conduct a systematic performance evaluation on a wide-range of representative HPC and CMP applications on three mainstream SMT architectures, and quantify their performance sensitivity to SMT effects. Then we introduce a simple scheme for SMT-aware code optimization which aims to reduce the memory contention across SMT threads. Finally, we develop a lightweight performance tool, named SMTAnalyzer, to effectively identify the optimization opportunities in the source code of multithreaded programs. Experiments on three SMT architectures (i.e., Intel Xeon, IBM POWER7, and Intel Xeon Phi) demonstrate that our proposed SMT-aware optimization scheme can significantly improve the performance for general HPC applications.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"2 8","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91419059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: High Performance Networks","authors":"P. Balaji","doi":"10.1145/3257969","DOIUrl":"https://doi.org/10.1145/3257969","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74968796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"With Extreme Scale Computing the Rules Have Changed","authors":"J. Dongarra","doi":"10.1007/978-3-319-42432-3_1","DOIUrl":"https://doi.org/10.1007/978-3-319-42432-3_1","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80690988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panruo Wu, Dong Li, Zizhong Chen, J. Vetter, Sparsh Mittal
{"title":"Algorithm-Directed Data Placement in Explicitly Managed Non-Volatile Memory","authors":"Panruo Wu, Dong Li, Zizhong Chen, J. Vetter, Sparsh Mittal","doi":"10.1145/2907294.2907321","DOIUrl":"https://doi.org/10.1145/2907294.2907321","url":null,"abstract":"The emergence of many non-volatile memory (NVM) techniques is poised to revolutionize main memory systems because of the relatively high capacity and low lifetime power consumption of NVM. However, to avoid the typical limitation of NVM as the main memory, NVM is usually combined with DRAM to form a hybrid NVM/DRAM system to gain the benefits of each. However, this integrated memory system raises a question on how to manage data placement and movement across NVM and DRAM, which is critical for maximizing the benefits of this integration. The existing solutions have several limitations, which obstruct adoption of these solutions in the high performance computing (HPC) domain. In particular, they cannot take advantage of application semantics, thus losing critical optimization opportunities and demanding extensive hardware extensions; they implement persistent semantics for resilience purpose while suffering large performance and energy overhead. In this paper, we re-examine the current hybrid memory designs from the HPC perspective, and aim to leverage the knowledge of numerical algorithms to direct data placement. With explicit algorithm management and limited hardware support, we optimize data movement between NVM and DRAM, improve data locality, and implement a relaxed memory persistency scheme in NVM. Our work demonstrates significant benefits of integrating algorithm knowledge into the hybrid memory design to achieve multi-dimensional optimization (performance, energy, and resilience) in HPC.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"91 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80991788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implications of Heterogeneous Memories in Next Generation Server Systems","authors":"Ada Gavrilovska","doi":"10.1145/2907294.2911993","DOIUrl":"https://doi.org/10.1145/2907294.2911993","url":null,"abstract":"Next generation datacenter and exascale machines will include significantly larger amounts of memory, greater heterogeneity in the performance, persistence or sharing properties of the memory components they encompass, and increase in the relative cost and complexity of the data paths in the resulting memory topology. This poses several challenges to the systems software stacks managing these memory-centric platform designs. First, technology advances in novel memory technologies shift the data access bottlenecks into the software stack. Second, current systems software lacks capabilities to bridge the multi-dimensional non-uniformity in the memory subsystem to the dynamic nature of the workloads it must support. In addition, current memory management solutions have limited ability to explicitly reason about the costs and tradeoffs associated with data movement operations, leading to limited efficiency of their interconnect use. To address these problems, next generation systems software stacks require new data structures, abstractions and mechanisms in order to enable new levels of efficiency in the data placement, movement, and transformation decisions that govern the underlying memory use. In this talk, I will present our approach to rearchitecting systems software and services in response to both node-level and system-wide memory heterogeneity and scale, particularly concerning the presence of non-volatile memories, and will demonstrate the resulting performance and efficiency gains using several scientific and data-intensive workloads.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81842240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Self-configuring Software-defined Overlay Bypass for Seamless Inter- and Intra-cloud Virtual Networking","authors":"Kyu-Young Jeong, R. Figueiredo","doi":"10.1145/2907294.2907318","DOIUrl":"https://doi.org/10.1145/2907294.2907318","url":null,"abstract":"Many techniques have been proposed to provide, transparently, the abstraction of a layer-2 virtual network environment within a provider, e.g. by leveraging Software-Defined Networking (SDN). However, cloud providers often constrain layer-2 communication across instances; furthermore, SDN integration and layer-2 messaging between distinct domains distributed across the Internet is not possible, hindering the ability for tenants to deploy their virtual networks across providers. In contrast, overlay networks provide a flexible foundation for inter-cloud virtual private networking (VPN), by tunneling virtual network traffic through private, authenticated end-to-end overlay links. However, overlays inherently incur network virtualization overheads, including header encapsulation and user/kernel boundary crossing. This paper proposes a novel system -- VIAS (VIrtualization Acceleration over SDN) -- that delivers the flexibility of overlays for inter-cloud virtual private networking, while transparently applying SDN techniques (available in existing OpenFlow hardware or software switches) to selectively bypass overlay tunneling and achieve near-native performance for TCP/UDP flows within a provider. Architecturally, VIAS is unique in how it integrates SDN and overlay controllers in a distributed fashion to coordinate the management of virtual network links and flows. The approach is self-organizing, whereby overlay nodes can detect that peer endpoints are in the same network and program bypass flows between OpenFlow switches. While generally applicable, VIAS in particular applies to nested VMs/containers across cloud providers, supporting seamless communication within and across providers. VIAS has been implemented as an extension to an existing virtual network overlay platform (IP-over-P2P, IPOP) by integrating OpenFlow controller functionality with distributed overlay controllers. We evaluate the performance of VIAS in realistic cloud environments using an implementation based on IPOP, the RYU SDN framework, Open vSwitch, and LXC containers across various cloud environment including Amazon, Google compute engine, and CloudLab.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85405468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Cloud and Resource Management","authors":"Ming Zhao","doi":"10.1145/3257974","DOIUrl":"https://doi.org/10.1145/3257974","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"4 7","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91417381","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform","authors":"M. Grossman, Vivek Sarkar","doi":"10.1145/2907294.2907307","DOIUrl":"https://doi.org/10.1145/2907294.2907307","url":null,"abstract":"The field of data analytics is currently going through a renaissance as a result of ever-increasing dataset sizes, the value of the models that can be trained from those datasets, and a surge in flexible, distributed programming models. In particular, the Apache Hadoop and Spark programming systems, as well as their supporting projects (e.g. HDFS, SparkSQL), have greatly simplified the analysis and transformation of datasets whose size exceeds the capacity of a single machine. While these programming models facilitate the use of distributed systems to analyze large datasets, they have been plagued by performance issues. The I/O performance bottlenecks of Hadoop are partially responsible for the creation of Spark. Performance bottlenecks in Spark due to the JVM object model, garbage collection, interpreted/managed execution, and other abstraction layers are responsible for the creation of additional optimization layers, such as Project Tungsten. Indeed, the Project Tungsten issue tracker states that the \"majority of Spark workloads are not bottlenecked by I/O or network, but rather CPU and memory\". In this work, we address the CPU and memory performance bottlenecks that exist in Apache Spark by accelerating user-written computational kernels using accelerators. We refer to our approach as Spark With Accelerated Tasks (SWAT). SWAT is an accelerated data analytics (ADA) framework that enables programmers to natively execute Spark applications on high performance hardware platforms with co-processors, while continuing to write their applications in a JVM-based language like Java or Scala. Runtime code generation creates OpenCL kernels from JVM bytecode, which are then executed on OpenCL accelerators. In our work we emphasize 1) full compatibility with a modern, existing, and accepted data analytics platform, 2) an asynchronous, event-driven, and resource-aware runtime, 3) multi-GPU memory management and caching, and 4) ease-of-use and programmability. Our performance evaluation demonstrates up to 3.24x overall application speedup relative to Spark across six machine learning benchmarks, with a detailed investigation of these performance improvements.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75880588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Keynote Address","authors":"Jack Lange","doi":"10.1145/3257968","DOIUrl":"https://doi.org/10.1145/3257968","url":null,"abstract":"","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76229097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"IMPACC: A Tightly Integrated MPI+OpenACC Framework Exploiting Shared Memory Parallelism","authors":"Jungwon Kim, Seyong Lee, J. Vetter","doi":"10.1145/2907294.2907302","DOIUrl":"https://doi.org/10.1145/2907294.2907302","url":null,"abstract":"We propose IMPACC, an MPI+OpenACC framework for heterogeneous accelerator clusters. IMPACC tightly integrates MPI and OpenACC, while exploiting the shared memory parallelism in the target system. IMPACC dynamically adapts the input MPI+OpenACC applications on the target heterogeneous accelerator clusters to fully exploit target system-specific features. IMPACC provides the programmers with the unified virtual address space, automatic NUMA-friendly task-device mapping, efficient integrated communication routines, seamless streamlining of asynchronous executions, and transparent memory sharing. We have implemented IMPACC and evaluated its performance using three heterogeneous accelerator systems, including Titan supercomputer. Results show that IMPACC can achieve easier programming, higher performance, and better scalability than the current MPI+OpenACC model.","PeriodicalId":20515,"journal":{"name":"Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing","volume":"89 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2016-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80388217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}