Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE) Pub Date : 2020-09-01 DOI:10.1145/3324884.3416545

H. Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, Nachiappan Nagappan

{"title":"Problems and Opportunities in Training Deep Learning Software Systems: An Analysis of Variance","authors":"H. Pham, Shangshu Qian, Jiannan Wang, Thibaud Lutellier, Jonathan Rosenthal, Lin Tan, Yaoliang Yu, Nachiappan Nagappan","doi":"10.1145/3324884.3416545","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) training algorithms utilize nondeterminism to improve models' accuracy and training efficiency. Hence, multiple identical training runs (e.g., identical training data, algorithm, and network) produce different models with different accuracies and training times. In addition to these algorithmic factors, DL libraries (e.g., TensorFlow and cuDNN) introduce additional variance (referred to as implementation-level variance) due to parallelism, optimization, and floating-point computation. This work is the first to study the variance of DL systems and the awareness of this variance among researchers and practitioners. Our experiments on three datasets with six popular networks show large overall accuracy differences among identical training runs. Even after excluding weak models, the accuracy difference is 10.8%. In addition, implementation-level factors alone cause the accuracy difference across identical training runs to be up to 2.9%, the per-class accuracy difference to be up to 52.4%, and the training time difference to be up to 145.3%. All core libraries (TensorFlow, CNTK, and Theano) and low-level libraries (e.g., cuDNN) exhibit implementation-level variance across all evaluated versions. Our researcher and practitioner survey shows that 83.8% of the 901 participants are unaware of or unsure about any implementation-level variance. In addition, our literature survey shows that only 19.5±3% of papers in recent top software engineering (SE), artificial intelligence (AI), and systems conferences use multiple identical training runs to quantify the variance of their DL approaches. This paper raises awareness of DL variance and directs SE researchers to challenging tasks such as creating deterministic DL implementations to facilitate debugging and improving the reproducibility of DL software and results.","PeriodicalId":106337,"journal":{"name":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"92","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3324884.3416545","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 92

Abstract

Deep learning (DL) training algorithms utilize nondeterminism to improve models' accuracy and training efficiency. Hence, multiple identical training runs (e.g., identical training data, algorithm, and network) produce different models with different accuracies and training times. In addition to these algorithmic factors, DL libraries (e.g., TensorFlow and cuDNN) introduce additional variance (referred to as implementation-level variance) due to parallelism, optimization, and floating-point computation. This work is the first to study the variance of DL systems and the awareness of this variance among researchers and practitioners. Our experiments on three datasets with six popular networks show large overall accuracy differences among identical training runs. Even after excluding weak models, the accuracy difference is 10.8%. In addition, implementation-level factors alone cause the accuracy difference across identical training runs to be up to 2.9%, the per-class accuracy difference to be up to 52.4%, and the training time difference to be up to 145.3%. All core libraries (TensorFlow, CNTK, and Theano) and low-level libraries (e.g., cuDNN) exhibit implementation-level variance across all evaluated versions. Our researcher and practitioner survey shows that 83.8% of the 901 participants are unaware of or unsure about any implementation-level variance. In addition, our literature survey shows that only 19.5±3% of papers in recent top software engineering (SE), artificial intelligence (AI), and systems conferences use multiple identical training runs to quantify the variance of their DL approaches. This paper raises awareness of DL variance and directs SE researchers to challenging tasks such as creating deterministic DL implementations to facilitate debugging and improving the reproducibility of DL software and results.

查看原文本刊更多论文

训练深度学习软件系统的问题与机遇:方差分析

深度学习(DL)训练算法利用不确定性来提高模型的准确性和训练效率。因此，多次相同的训练运行(例如，相同的训练数据、算法和网络)会产生具有不同精度和训练时间的不同模型。除了这些算法因素，深度学习库(例如，TensorFlow和cuDNN)由于并行性、优化和浮点计算而引入了额外的方差(称为实现级方差)。这项工作是第一次研究深度学习系统的差异以及研究人员和从业者对这种差异的认识。我们在六个流行网络的三个数据集上的实验显示，在相同的训练运行中，总体准确率存在很大差异。即使在排除弱模型后，准确率也相差10.8%。此外，仅实现层面的因素就会导致相同训练运行之间的准确率差异高达2.9%，每类准确率差异高达52.4%，训练时间差异高达145.3%。所有核心库(TensorFlow, CNTK和Theano)和底层库(例如cuDNN)在所有评估版本中都表现出实现级别的差异。我们的研究人员和从业者调查显示，901名参与者中有83.8%的人不知道或不确定任何实现级别的差异。此外，我们的文献调查显示，在最近的顶级软件工程(SE)、人工智能(AI)和系统会议上，只有19.5±3%的论文使用多次相同的训练运行来量化他们的深度学习方法的方差。本文提高了对深度学习差异的认识，并指导SE研究人员完成具有挑战性的任务，例如创建确定性的深度学习实现，以促进调试和提高深度学习软件和结果的可重复性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE)

自引率

0.00%

发文量