SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

筛选
英文 中文
[Copyright notice] (版权)
{"title":"[Copyright notice]","authors":"","doi":"10.1109/sc41405.2020.00002","DOIUrl":"https://doi.org/10.1109/sc41405.2020.00002","url":null,"abstract":"","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129515017","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures 通过矩阵签名的高效平铺稀疏矩阵乘法
Süreyya Emre Kurt, Aravind Sukumaran-Rajam, F. Rastello, P. Sadayappan
{"title":"Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures","authors":"Süreyya Emre Kurt, Aravind Sukumaran-Rajam, F. Rastello, P. Sadayappan","doi":"10.1109/SC41405.2020.00091","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00091","url":null,"abstract":"Tiling is a key technique to reduce data movement in matrix computations. While tiling is well understood and widely used for dense matrix/tensor computations, effective tiling of sparse matrix computations remains a challenging problem. This paper proposes a novel method to efficiently summarize the impact of the sparsity structure of a matrix on achievable data reuse as a one-dimensional signature, which is then used to build an analytical cost model for tile size optimization for sparse matrix computations. The proposed model-driven approach to sparse tiling is evaluated on two key sparse matrix kernels: Sparse Matrix - Dense Matrix Multiplication (SpMM) and Sampled Dense-Dense Matrix Multiplication (SDDMM). Experimental results demonstrate that model-based tiled SpMM and SDDMM achieve high performance relative to the current state-of-the-art.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129704610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
GVPROF: A Value Profiler for GPU-Based Clusters GVPROF:一个基于gpu集群的值分析器
K. Zhou, Yueming Hao, J. Mellor-Crummey, Xiaozhu Meng, Xu Liu
{"title":"GVPROF: A Value Profiler for GPU-Based Clusters","authors":"K. Zhou, Yueming Hao, J. Mellor-Crummey, Xiaozhu Meng, Xu Liu","doi":"10.1109/SC41405.2020.00093","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00093","url":null,"abstract":"GPGPUs are widely used in high-performance computing systems to accelerate scientific and machine learning workloads. Developing efficient GPU kernels is critically important to obtain “bare-metal” performance on GPU-based clusters. In this paper, we describe the design and implementation of GVPROF, the first value profiler that pinpoints value-related inefficiencies in applications running on NVIDIA GPU-based clusters. The novelty of GVPROF resides in its ability to detect temporal and spatial value redundancies, which provides useful information to guide code optimization. GVPROF can monitor production multi-node multi-GPU executions in clusters. Our experiments with well-known GPU benchmarks and HPC applications show that GVPROF incurs acceptable overhead and scales to large executions. Using GVPROF, we optimized several HPC and machine learning workloads on one NVIDIA V100 GPU. In one case study of LAMMPS, optimizations based on information from GVProf led to whole-program speedups ranging from 1.37x on a single GPU to 1.08x on 64 GPUs.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129736694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
RDMP-KV: Designing Remote Direct Memory Persistence based Key-Value Stores with PMEM rmpp - kv:设计基于PMEM的远程直接内存持久性键值存储
Tianxi Li, D. Shankar, Shashank Gugnani, Xiaoyi Lu
{"title":"RDMP-KV: Designing Remote Direct Memory Persistence based Key-Value Stores with PMEM","authors":"Tianxi Li, D. Shankar, Shashank Gugnani, Xiaoyi Lu","doi":"10.1109/SC41405.2020.00056","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00056","url":null,"abstract":"Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMPKV). RDMP-KV employs a hybrid ‘server-reply/server-bypass’ approach to ‘durably’ store individual key-value objects on PMEM-equipped servers. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-‘Server-Reply’ protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a recent RDMA-to-PMEM framework by up to 71%.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems 用于深度学习训练系统的高效非侵入式GPU调度框架
Shaoqi Wang, O. J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, Thomas Y. C. Woo
{"title":"An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems","authors":"Shaoqi Wang, O. J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, Thomas Y. C. Woo","doi":"10.1109/SC41405.2020.00094","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00094","url":null,"abstract":"Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks.We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and nonintrusive method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU clusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116682632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Improving All-to-Many Personalized Communication in Two-Phase I/O 改进两阶段I/O中所有对多的个性化通信
Qiao Kang, R. Ross, R. Latham, Sunwoo Lee, Ankit Agrawal, A. Choudhary, W. Liao
{"title":"Improving All-to-Many Personalized Communication in Two-Phase I/O","authors":"Qiao Kang, R. Ross, R. Latham, Sunwoo Lee, Ankit Agrawal, A. Choudhary, W. Liao","doi":"10.1109/SC41405.2020.00014","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00014","url":null,"abstract":"As modern parallel computers enter the exascale era, the communication cost for redistributing requests becomes a significant bottleneck in MPIIO routines. The communication kernel for request redistribution, which has an all-to-many personalized communication pattern for application programs with a large number of noncontiguous requests, plays an essential role in the overall performance. This paper explores the available communication kernels for two-phase I/O communication. We generalize the spread-out algorithm to adapt to the all-to-many communication pattern of two-phase I/O by reducing the communication straggler effect. Communication throttling methods that reduce communication contention for asynchronous MPI implementation are adopted to improve communication performance further. Experimental results are presented using different communication kernels running on Cray XC40 Cori and IBM AC922 Summit supercomputers with different I/O patterns. Our study shows that adjusting communication kernel algorithms for different I/O patterns can improve the end-to-end performance up to 10 times compared with default MPI-IO implementations.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134436358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications 大型系统的工作特征:长期分析、量化和影响
Tirthak Patel, Zhengchun Liu, R. Kettimuthu, P. Rich, W. Allcock, Devesh Tiwari
{"title":"Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications","authors":"Tirthak Patel, Zhengchun Liu, R. Kettimuthu, P. Rich, W. Allcock, Devesh Tiwari","doi":"10.1109/SC41405.2020.00088","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00088","url":null,"abstract":"HPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions, and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective work-load analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization, and analysis of job characteristics on two supercomputers: Intrepid and Mira. This study is one of the largest of its kind – covering trends and characteristics for over three billion compute hours, 750 thousand jobs, and spanning a decade. We confirm several long-held conventional wisdom, and identify many previously undiscovered trends and its implications. We also introduce a learning based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to the job submission and without requiring any application-specific tracing or application-intrusive instrumentation.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124906358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short FatPaths:当最短路径不足时,超级计算机和数据中心中的路由
Maciej Besta, Marcel Schneider, Marek Konieczny, Karolina Cynk, Erik Henriksson, S. D. Girolamo, Ankit Singla, T. Hoefler
{"title":"FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short","authors":"Maciej Besta, Marcel Schneider, Marek Konieczny, Karolina Cynk, Erik Henriksson, S. D. Girolamo, Ankit Singla, T. Hoefler","doi":"10.1109/SC41405.2020.00031","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00031","url":null,"abstract":"We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich (“fat”) diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths uses a redesigned “purified” transport layer that removes virtually all TCP performance issues (e.g., the slow start), and incorporates flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2” lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.Extended paper version: https://arxiv.org/abs/1906.10885","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124195517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds Alita:通过偏见资源管理实现公共云的全面性能隔离
Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, M. Guo
{"title":"Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds","authors":"Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, M. Guo","doi":"10.1109/SC41405.2020.00036","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00036","url":null,"abstract":"The tenants of public cloud platforms share hard-ware resources on the same node, resulting in the potential for performance interference (or malicious attacks). A tenant is able to degrade the performance of its neighbors on the same node significantly through overuse of the shared memory bus, last level cache (LLC)/memory bandwidth, and power. To eliminate such unfairness we propose Alita, a runtime system consisting of an online interference identifier and adaptive interference eliminator. The interference identifier monitors hardware and system-level event statistics to identify resource polluters. The eliminator improves the performance of normal applications by throttling only the resource usage of polluters. Specifically, Alita adopts bus lock sparsification, bias LLC/bandwidth isolation, and selective power throttling to throttle the resource usage of polluters. Results for an experimental platform and in-production cloud platform with 30,000 nodes demonstrate that Alita significantly improves the performance of co-located virtual machines in the presence of resource polluters based on system-level knowledge.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132457760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Cost-Aware Prediction of Uncorrected DRAM Errors in the Field 现场未校正DRAM错误的成本感知预测
Isaac Boixaderas, D. Zivanovic, Sergi Moré, Javier Bartolome, David Vicente, Marc Casas, P. Carpenter, Petar Radojkovic, E. Ayguadé
{"title":"Cost-Aware Prediction of Uncorrected DRAM Errors in the Field","authors":"Isaac Boixaderas, D. Zivanovic, Sergi Moré, Javier Bartolome, David Vicente, Marc Casas, P. Carpenter, Petar Radojkovic, E. Ayguadé","doi":"10.1109/SC41405.2020.00065","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00065","url":null,"abstract":"This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node–hours per year. We release all source code as open source. We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost–benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost–benefit calculation.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"46 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130797225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书