SC20: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献_第3页

[Copyright notice] (版权)

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/sc41405.2020.00002

引用次数: 0

Efficient Tiled Sparse Matrix Multiplication through Matrix Signatures 通过矩阵签名的高效平铺稀疏矩阵乘法

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00091

Süreyya Emre Kurt, Aravind Sukumaran-Rajam, F. Rastello, P. Sadayappan

引用次数: 16

GVPROF: A Value Profiler for GPU-Based Clusters GVPROF:一个基于gpu集群的值分析器

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00093

K. Zhou, Yueming Hao, J. Mellor-Crummey, Xiaozhu Meng, Xu Liu

引用次数: 16

RDMP-KV: Designing Remote Direct Memory Persistence based Key-Value Stores with PMEM rmpp - kv:设计基于PMEM的远程直接内存持久性键值存储

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00056

Tianxi Li, D. Shankar, Shashank Gugnani, Xiaoyi Lu

{"title":"RDMP-KV: Designing Remote Direct Memory Persistence based Key-Value Stores with PMEM","authors":"Tianxi Li, D. Shankar, Shashank Gugnani, Xiaoyi Lu","doi":"10.1109/SC41405.2020.00056","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00056","url":null,"abstract":"Byte-addressable persistent memory (PMEM) can be directly manipulated by Remote Direct Memory Access (RDMA) capable networks. However, existing studies to combine RDMA and PMEM can not deliver the desired performance due to their PMEM-oblivious communication protocols. In this paper, we propose novel PMEM-aware RDMA-based communication protocols for persistent key-value stores, referred to as Remote Direct Memory Persistence based Key-Value stores (RDMPKV). RDMP-KV employs a hybrid ‘server-reply/server-bypass’ approach to ‘durably’ store individual key-value objects on PMEM-equipped servers. RDMP-KV’s runtime can easily adapt to existing (server-assisted durability) and emerging (appliance durability) RDMA-capable interconnects, while ensuring server scalability through a lightweight consistency scheme. Performance evaluations show that RDMP-KV can improve the server-side performance with different persistent key-value storage architectures by up to 22x, as compared with PMEM-oblivious RDMA-‘Server-Reply’ protocols. Our evaluations also show that RDMP-KV outperforms a distributed PMEM-based filesystem by up to 65% and a recent RDMA-to-PMEM framework by up to 71%.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"200 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133156757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems 用于深度学习训练系统的高效非侵入式GPU调度框架

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00094

Shaoqi Wang, O. J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, Thomas Y. C. Woo

{"title":"An Efficient and Non-Intrusive GPU Scheduling Framework for Deep Learning Training Systems","authors":"Shaoqi Wang, O. J. Gonzalez, Xiaobo Zhou, Thomas Williams, Brian D. Friedman, Martin Havemann, Thomas Y. C. Woo","doi":"10.1109/SC41405.2020.00094","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00094","url":null,"abstract":"Efficient GPU scheduling is the key to minimizing the execution time of the Deep Learning (DL) training workloads. DL training system schedulers typically allocate a fixed number of GPUs to each job, which inhibits high resource utilization and often extends the overall training time. The recent introduction of schedulers that can dynamically reallocate GPUs has achieved better cluster efficiency. This dynamic nature, however, introduces additional overhead by terminating and restarting jobs or requires modification to the DL training frameworks.We propose and develop an efficient, non-intrusive GPU scheduling framework that employs a combination of an adaptive GPU scheduler and an elastic GPU allocation mechanism to reduce the completion time of DL training workloads and improve resource utilization. Specifically, the adaptive GPU scheduler includes a scheduling algorithm that uses training job progress information to determine the most efficient allocation and reallocation of GPUs for incoming and running jobs at any given time. The elastic GPU allocation mechanism works in concert with the scheduler. It offers a lightweight and nonintrusive method to reallocate GPUs based on a “SideCar” process that temporarily stops and restarts the job’s DL training process with a different number of GPUs. We implemented the scheduling framework as plugins in Kubernetes and conducted evaluations on two 16-GPU clusters with multiple training jobs based on TensorFlow. Results show that our proposed scheduling framework reduces the overall execution time and the average job completion time by up to 45% and 63%, respectively, compared to the Kubernetes default scheduler. Compared to a termination based scheduler, our framework reduces the overall execution time and the average job completion time by up to 20% and 37%, respectively.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116682632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Improving All-to-Many Personalized Communication in Two-Phase I/O 改进两阶段I/O中所有对多的个性化通信

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00014

Qiao Kang, R. Ross, R. Latham, Sunwoo Lee, Ankit Agrawal, A. Choudhary, W. Liao

{"title":"Improving All-to-Many Personalized Communication in Two-Phase I/O","authors":"Qiao Kang, R. Ross, R. Latham, Sunwoo Lee, Ankit Agrawal, A. Choudhary, W. Liao","doi":"10.1109/SC41405.2020.00014","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00014","url":null,"abstract":"As modern parallel computers enter the exascale era, the communication cost for redistributing requests becomes a significant bottleneck in MPIIO routines. The communication kernel for request redistribution, which has an all-to-many personalized communication pattern for application programs with a large number of noncontiguous requests, plays an essential role in the overall performance. This paper explores the available communication kernels for two-phase I/O communication. We generalize the spread-out algorithm to adapt to the all-to-many communication pattern of two-phase I/O by reducing the communication straggler effect. Communication throttling methods that reduce communication contention for asynchronous MPI implementation are adopted to improve communication performance further. Experimental results are presented using different communication kernels running on Cray XC40 Cori and IBM AC922 Summit supercomputers with different I/O patterns. Our study shows that adjusting communication kernel algorithms for different I/O patterns can improve the end-to-end performance up to 10 times compared with default MPI-IO implementations.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134436358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications 大型系统的工作特征:长期分析、量化和影响

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00088

Tirthak Patel, Zhengchun Liu, R. Kettimuthu, P. Rich, W. Allcock, Devesh Tiwari

{"title":"Job Characteristics on Large-Scale Systems: Long-Term Analysis, Quantification, and Implications","authors":"Tirthak Patel, Zhengchun Liu, R. Kettimuthu, P. Rich, W. Allcock, Devesh Tiwari","doi":"10.1109/SC41405.2020.00088","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00088","url":null,"abstract":"HPC workload analysis and resource consumption characteristics are the key to driving better operation practices, system procurement decisions, and designing effective resource management techniques. Unfortunately, the HPC community does not have easy accessibility to long-term introspective work-load analysis and characterization for production-scale HPC systems. This study bridges this gap by providing detailed long-term quantification, characterization, and analysis of job characteristics on two supercomputers: Intrepid and Mira. This study is one of the largest of its kind – covering trends and characteristics for over three billion compute hours, 750 thousand jobs, and spanning a decade. We confirm several long-held conventional wisdom, and identify many previously undiscovered trends and its implications. We also introduce a learning based technique to predict the resource requirement of future jobs with high accuracy, using features available prior to the job submission and without requiring any application-specific tracing or application-intrusive instrumentation.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124906358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 35

FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short FatPaths:当最短路径不足时，超级计算机和数据中心中的路由

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00031

Maciej Besta, Marcel Schneider, Marek Konieczny, Karolina Cynk, Erik Henriksson, S. D. Girolamo, Ankit Singla, T. Hoefler

{"title":"FatPaths: Routing in Supercomputers and Data Centers when Shortest Paths Fall Short","authors":"Maciej Besta, Marcel Schneider, Marek Konieczny, Karolina Cynk, Erik Henriksson, S. D. Girolamo, Ankit Singla, T. Hoefler","doi":"10.1109/SC41405.2020.00031","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00031","url":null,"abstract":"We introduce FatPaths: a simple, generic, and robust routing architecture that enables state-of-the-art low-diameter topologies such as Slim Fly to achieve unprecedented performance. FatPaths targets Ethernet stacks in both HPC supercomputers as well as cloud data centers and clusters. FatPaths exposes and exploits the rich (“fat”) diversity of both minimal and non-minimal paths for high-performance multi-pathing. Moreover, FatPaths uses a redesigned “purified” transport layer that removes virtually all TCP performance issues (e.g., the slow start), and incorporates flowlet switching, a technique used to prevent packet reordering in TCP networks, to enable very simple and effective load balancing. Our design enables recent low-diameter topologies to outperform powerful Clos designs, achieving 15% higher net throughput at 2” lower latency for comparable cost. FatPaths will significantly accelerate Ethernet clusters that form more than 50% of the Top500 list and it may become a standard routing scheme for modern topologies.Extended paper version: https://arxiv.org/abs/1906.10885","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124195517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds Alita:通过偏见资源管理实现公共云的全面性能隔离

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00036

Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, M. Guo

{"title":"Alita: Comprehensive Performance Isolation through Bias Resource Management for Public Clouds","authors":"Quan Chen, Shuai Xue, Shang Zhao, Shanpei Chen, Yihao Wu, Yu Xu, Zhuo Song, Tao Ma, Yong Yang, M. Guo","doi":"10.1109/SC41405.2020.00036","DOIUrl":"https://doi.org/10.1109/SC41405.2020.00036","url":null,"abstract":"The tenants of public cloud platforms share hard-ware resources on the same node, resulting in the potential for performance interference (or malicious attacks). A tenant is able to degrade the performance of its neighbors on the same node significantly through overuse of the shared memory bus, last level cache (LLC)/memory bandwidth, and power. To eliminate such unfairness we propose Alita, a runtime system consisting of an online interference identifier and adaptive interference eliminator. The interference identifier monitors hardware and system-level event statistics to identify resource polluters. The eliminator improves the performance of normal applications by throttling only the resource usage of polluters. Specifically, Alita adopts bus lock sparsification, bias LLC/bandwidth isolation, and selective power throttling to throttle the resource usage of polluters. Results for an experimental platform and in-production cloud platform with 30,000 nodes demonstrate that Alita significantly improves the performance of co-located virtual machines in the presence of resource polluters based on system-level knowledge.","PeriodicalId":424429,"journal":{"name":"SC20: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132457760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Cost-Aware Prediction of Uncorrected DRAM Errors in the Field 现场未校正DRAM错误的成本感知预测

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis Pub Date : 2020-11-01 DOI: 10.1109/SC41405.2020.00065

Isaac Boixaderas, D. Zivanovic, Sergi Moré, Javier Bartolome, David Vicente, Marc Casas, P. Carpenter, Petar Radojkovic, E. Ayguadé

引用次数: 15