{"title":"Enabling Transformers to Understand Low-Level Programs","authors":"Z. Guo, William S. Moses","doi":"10.1109/HPEC55821.2022.9926313","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926313","url":null,"abstract":"Unlike prior approaches to machine learning, Transformer models can first be trained on a large corpus of unlabeled data with a generic objective and then on a smaller task-specific dataset. This versatility has led to both larger models and datasets. Consequently, Transformers have led to breakthroughs in the field of natural language processing. Generic program optimization presently operates on low-level programs such as LLVM. Unlike the high-level languages (e.g. C, Python, Java), which have seen initial success in machine-learning analyses, lower-level languages tend to be more verbose and repetitive to precisely specify program behavior, provide more details about microarchitecture, and derive properties necessary for optimization, all of which makes it difficult for machine learning. In this work, we apply transfer learning to low-level (LLVM) programs and study how low-level programs can be made more amenable to Transformer models through various techniques, including preprocessing, infix/prefix operators, and information deduplication. We evaluate the effectiveness of these techniques through a series of ablation studies on the task of translating C to both unoptimized (-O0) and optimized (-01) LLVM IR. On the AnghaBench dataset, our model achieves a 49.57% verbatim match and BLEU score of 87.68 against Clang -O0 and 38.73% verbatim match and BLEU score of 77.03 against Clang -O1.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127382338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Site-Wide HPC Data Center Demand Response","authors":"D. Wilson, I. Paschalidis, A. Coskun","doi":"10.1109/HPEC55821.2022.9926322","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926322","url":null,"abstract":"As many electricity markets are trending towards greater renewable energy generation, there will be an increased need for electrical grids to cooperatively balance electricity supply and demand. Data centers are one large consumer of electricity on a global scale, and they are well-suited to act as a grid load stabilizer via performing “demand response.” Prior investigations in this space have demonstrated how data centers can continue to meet their users' quality of service (QoS) needs by modeling relationships between cluster job queues, server power properties, and application performance. While server power is a major factor in data center power consumption, other components such as cooling systems contribute a non-nealiaible amount of electricity demand. This work proposes using a simple site-wide (i.e., including all components of the data center) power model on top of QoS-aware demand response solutions to achieve the QoS benefits of those solutions while improving the cost-saving opportunities in demand response. We demonstrate 1.3x cost savings compared to QoS-aware demand response policies that do not utilize site-wide power models, and show similar savings in cases of severely under-predicted site-wide power consumption if 1.5x relaxed QoS constraints are allowed.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130751615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Beebe, Brody Williams, Stephen Devaney, John D. Leidel, Yong Chen, Stephen Poole
{"title":"RaiderSTREAM: Adapting the STREAM Benchmark to Modern HPC Systems","authors":"Michael Beebe, Brody Williams, Stephen Devaney, John D. Leidel, Yong Chen, Stephen Poole","doi":"10.1109/HPEC55821.2022.9926292","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926292","url":null,"abstract":"Sustaining high memory bandwidth utilization is a common bottleneck to maximizing the performance of scien-tific applications, with the dominating factor of the runtime being the speed at which data can be loaded from memory into the CPU and results can be written back to memory, particularly for increasingly critical data-intensive workloads. The prevalence of irregular memory access patterns within these applications, exemplified by kernels such as those found in sparse matrix and graph applications, significantly degrade the achievable performance of a system's memory hierarchy. As such, it is highly desirable to be able to accurately measure a given memory hierarchy's sustainable memory bandwidth when designing applications as well as future high-performance computing (HPC) systems. STREAM is a de facto standard benchmark for measuring sustained memory bandwidth and has garnered widespread adoption. In this work, we discuss current limitations of the STREAM benchmark in the context of high-performance and scientific computing. We then introduce a new version of STREAM, called RaiderSTREAM, built on the OpenSHMEM and MPI programming models in tandem with OpenMP, that include additional kernels which better model irregular memory access patterns in order to address these shortcomings.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"73 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121132069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samuel Thomas, Jiwon Choe, Ofir Gordon, E. Petrank, T. Moreshet, M. Herlihy, R. I. Bahar
{"title":"Towards Hardware Accelerated Garbage Collection with Near-Memory Processing","authors":"Samuel Thomas, Jiwon Choe, Ofir Gordon, E. Petrank, T. Moreshet, M. Herlihy, R. I. Bahar","doi":"10.1109/HPEC55821.2022.9926323","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926323","url":null,"abstract":"Garbage collection is widely available in popular programming languages, yet it may incur high performance overheads in applications. Prior works have proposed specialized hardware acceleration implementations to offload garbage collection overheads off the main processor, but these solutions have yet to be implemented in practice. In this paper, we propose using off-the-shelf hardware to accelerate off-the-shelf garbage collection algorithms. Furthermore, our work is latency oriented as opposed to other works that focus on bandwidth. We demonstrate that we can get a 2 x performance improvement in some workloads and a 2.3 x reduction in LLC traffic by integrating generic Near-Memory Processing (NMP) into the built-in Java garbage collector. We will discuss architectural implications of these results and consider directions for future work.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122923955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Zeng, Kang Yang, Haoran Cai, Jinhua Zhou, Rongqian Zhao, Xin Chen
{"title":"HTC: Hybrid vertex-parallel and edge-parallel Triangle Counting","authors":"Li Zeng, Kang Yang, Haoran Cai, Jinhua Zhou, Rongqian Zhao, Xin Chen","doi":"10.1109/HPEC55821.2022.9926383","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926383","url":null,"abstract":"Graph algorithms (e.g., triangle counting) are widely used to find the deep association of data in various real-world applications such as friend recommendation and junk mail detection. However, even if using the massive parallelism of GPU, existing methods fail to run triangle counting queries efficiently on various large graphs. In this paper, we propose a fast hybrid algorithm HTC, which can utilize both vertex-parallel and edge-parallel paradigm and deliver much better performance on GPU. Different from current GPU implementations, HTC adaptively selects different parallel paradigm for different vertices. Also, bitwise-based intersection on segmented bitmap is proposed instead of naive binary search. Furthermore, preprocessing techniques like graph reordering and recursive clipping are adopted to optimize the graph structure. Extensive experiments show that HTC outperforms all state-of-the-art triangle counting implementations on GPU by 1.2x~42x.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115318421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shaoxian Xu, Minkang Wu, Long Zheng, Zhiyuan Shao, Xiangyu Ye, Xiaofei Liao, Hai Jin
{"title":"Towards Fast GPU-based Sparse DNN Inference: A Hybrid Compute Model","authors":"Shaoxian Xu, Minkang Wu, Long Zheng, Zhiyuan Shao, Xiangyu Ye, Xiaofei Liao, Hai Jin","doi":"10.1109/HPEC55821.2022.9926290","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926290","url":null,"abstract":"As the model scale of Deep Neural Networks (DNNs) increases, the memory and computational cost of DNNs become overwhelmingly large. Sparse Deep Neural Networks (SpDNNs) are promising to cope with this challenge by using fewer weights while preserving the accuracy. However, the sparsity nature of SpDNN models makes it difficult to run efficiently on GPUs. To stimulate technical advances for improving the efficiency of SpDNN inference, the MIT/IEEE/Amazon GraphChallenge proposes the SpDNN Challenge in 2019. In this paper, we present a hybrid compute model to improve the efficiency of Sparse Matrix Multiplications (SpMMs), the core computation of SpDNN inference. First, the given sparse weight matrix will be divided to generate many (sparse and dense) submatrices. For sparse submatrices, we leverage compile-time data embedding to compile the sparse data together with their corresponding computations into instructions and hence the number of random accesses can be reduced significantly. For dense submatrices, we follow the traditional computing mode where the data is obtained from the memory to exploit the high memory bandwidth of GPU. This hybrid compute model effectively balances the memory and instruction bottlenecks, and offers more scheduling opportunities to overlap computing operations and memory accesses on GPU. To determine whether a sub matrix is sparse, we present a cost model to estimate its time cost under the traditional computing mode and the data-embedded computing mode in an accurate and efficient manner. Once the computing mode for all submatrices is determined, customized codes will be generated for the SpDNN inference. Experimental results on the SpDNN Challenge benchmarks show that our approach achieves up to 197.86 tera-edges per second inference throughput on a single NVIDIA A100 GPU. Compared to the 2021 and 2020 champions, our approach offers up to 6.37x and 89.94x speedups on a single GPU, respectively. We also implement a 16-GPU version, showing up to 9.49x and 80.11x speedups over the former 16-GPU baselines of the 2021 and 2020 champions.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126281743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Processing Particle Data Flows with SmartNICs","authors":"Jianshen Liu, C. Maltzahn, M. Curry, C. Ulmer","doi":"10.1109/HPEC55821.2022.9926325","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926325","url":null,"abstract":"Many distributed applications implement complex data flows and need a flexible mechanism for routing data between producers and consumers. Recent advances in programmable network interface cards, or SmartNICs, represent an opportunity to offload data-flow tasks into the network fabric, thereby freeing the hosts to perform other work. System architects in this space face multiple questions about the best way to leverage SmartNICs as processing elements in data flows. In this paper, we advocate the use of Apache Arrow as a foundation for implementing data-flow tasks on SmartNICs. We report on our experiences adapting a partitioning algorithm for particle data to Apache Arrow and measure the on-card processing performance for the BlueField-2 SmartNIC. Our experiments confirm that the BlueField-2's (de)compression hardware can have a significant impact on in-transit workflows where data must be unpacked, processed, and repacked.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121928033","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrew Wood, Moshik Hershcovitch, Ilias Ennmouri, Weiyu Zong, Saurav Chennuri, S. Cohen, S. Sundararaman, Daniel Waddington, Peter Chin
{"title":"Towards Fast Crash-Consistent Cluster Checkpointing","authors":"Andrew Wood, Moshik Hershcovitch, Ilias Ennmouri, Weiyu Zong, Saurav Chennuri, S. Cohen, S. Sundararaman, Daniel Waddington, Peter Chin","doi":"10.1109/HPEC55821.2022.9926330","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926330","url":null,"abstract":"Machine Learning models are expensive to train: they require expensive high-compute hardware and have long training times. Therefore, models are extra sensitive to program faults or unexpected system crashes, which can erase hours if not days worth of work. While there are plenty of strategies designed to mitigate the risk of unexpected system downtime, the most popular strategy in machine learning is called checkpointing: periodically saving the state of the model to persistent storage. Checkpointing is an effective strategy, however, it requires carefully balancing two operations: how often a checkpoint is made (the checkpointing schedule), and the cost of creating a checkpoint itself. In this paper, we leverage Python Memory Manager (PyMM), which provides Python support for Persistent Memory and emerging Persistent Memory technology (Optane DC) to accelerate the checkpointing operation while maintaining crash consistency. We first show that when checkpointing models, PyMM with persistent memory can save from minutes to days of checkpointing runtime. We then further optimize the checkpointing operation with PyMM and demonstrate our approach with the KMeans and Gaussian Mixture Model algorithms on two real-world datasets, MNIST and MusicNet. Through evaluation, we show that these two algorithms achieve a checkpointing speedup of a factor between 10 and 75x for KMeans and over 3x for GMM against the current state-of-the-art checkpointing approaches. We also verify that our solution recovers from crashes, while traditional approaches cannot.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121144400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trends in Energy Estimates for Computing in AI/Machine Learning Accelerators, Supercomputers, and Compute-Intensive Applications","authors":"S. Shankar, A. Reuther","doi":"10.1109/HPEC55821.2022.9926296","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.9926296","url":null,"abstract":"We examine the computational energy requirements of different systems driven by the geometrical scaling law (known as Moore's law or Dennard Scaling for geometry) and increasing use of Artificial Intelligence/ Machine Learning (AI/ML) over the last decade. With more scientific and technology applications based on data-driven discovery, machine learning methods, especially deep neural networks, have become widely used. In order to enable such applications, both hardware accelerators and advanced AI/ML methods have led to the introduction of new architectures, system designs, algorithms, and software. Our analysis of energy trends indicates three important observations: 1) Energy efficiency due to geometrical scaling is slowing down; 2) The energy efficiency at the bit-level does not translate into efficiency at the instruction-level, or at the system-level for a variety of systems, especially for large-scale AI/ML accelerators or supercomputers; 3) At the application level, general-purpose AI/ML methods can be computationally energy intensive, off-setting the gains in energy from geometrical scaling and special purpose accelerators. Further, our analysis provides specific pointers for integrating energy efficiency with performance analysis for enabling high-performance and sustainable computing in the future.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125718243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kelsie Edie, K. Keville, Lauren Milechin, Chris Hill
{"title":"SuperCloud Lite in the Cloud - lightweight, secure, self-service, on-demand mechanisms for creating customizable research computing environments","authors":"Kelsie Edie, K. Keville, Lauren Milechin, Chris Hill","doi":"10.1109/HPEC55821.2022.10089529","DOIUrl":"https://doi.org/10.1109/HPEC55821.2022.10089529","url":null,"abstract":"We describe and examine an automation for deploying on-demand, OAuth2 secured virtual machine instances. Our approach does not require any expert security and web service knowledge to create a secure instance. The approach allows non-experts to launch web-accessible virtual machine services that are automatically secured through OAuth2 authentication, an authentication standard widely employed in academic and enterprise environments. We demonstrate the approach through an example of creating secure commercial cloud instances of the MIT SuperCloud modern research computing oriented software stack. A small example of a use case is examined and compared with native MIT SuperCloud experience as a preliminary evaluation. The example illustrates several useful features. It retains OAuth2 security guarantees and leverages a simple OAuth2 proxy architecture that in turn employs simple DNS based service limits to manage access to the proxy service. The system has the potential to provide a default secure environment in which access is, in theory, limited to a narrow trust circle. It leverages WebSockets to provide a pure browser enabled, zero install base service. For the user, it is entirely self-service so that a non-expert, non-privileged user can launch instances, while supporting access to a familiar environment on a broad selection of hardware, including high-end GPUs and isolated bare-metal resources. The environment includes pre-configured browser based desktop GUI and notebook configurations. It can provide the option of end-user privileged access to the VM for flexible customization. It integrates with a simplified cost-monitoring and machine management framework that provides visibility to commercial cloud charges and some budget guard rails, and supports instance stop, restart, and pausing features to allow intermittent use and cost reduction.","PeriodicalId":200071,"journal":{"name":"2022 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133886132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}