Reliability of Large Scale GPU Clusters for Deep Learning Workloads

Companion Proceedings of the Web Conference 2021 Pub Date : 2021-04-19 DOI:10.1145/3442442.3452056

Junjie Qian, Taeyoon Kim, Myeongjae Jeon

引用次数: 2

Abstract

Recent advances on deep learning technologies have made GPU clusters popular as training platforms. In this paper, we study reliability issues while focusing on training job failures from analyzing logs collected from running deep learning workloads on a large-scale GPU cluster in production. These failures are largely grouped into two categories, infrastructure and user, based on their sources, and reveal diverse reasons causing the failures. With insights obtained from the failure analysis, we suggest several different ways to improve the stability of shared GPU clusters designed for DL training and optimize user experience by reducing failure occurrences.

查看原文本刊更多论文

面向深度学习工作负载的大规模GPU集群可靠性研究

深度学习技术的最新进展使GPU集群成为流行的训练平台。在本文中，我们研究了可靠性问题，同时通过分析在生产中的大规模GPU集群上运行深度学习工作负载收集的日志来关注训练作业失败。这些故障根据其来源大致分为基础设施和用户两类，并揭示了导致故障的各种原因。根据从故障分析中获得的见解，我们提出了几种不同的方法来提高为深度学习训练设计的共享GPU集群的稳定性，并通过减少故障发生来优化用户体验。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Companion Proceedings of the Web Conference 2021

自引率

0.00%

发文量