基于TensorFlow的Intel至强架构分布式MLPerf ResNet50训练

The International Conference on High Performance Computing in Asia-Pacific Region Companion Pub Date : 2021-01-20 DOI:10.1145/3440722.3440880

Wei Wang, N. Hasabnis

{"title":"基于TensorFlow的Intel至强架构分布式MLPerf ResNet50训练","authors":"Wei Wang, N. Hasabnis","doi":"10.1145/3440722.3440880","DOIUrl":null,"url":null,"abstract":"MLPerf benchmarks, which measure training and inference performance of ML hardware and software, have published three sets of ML training results so far. In all sets of results, ResNet50v1.5 was used as a standard benchmark to showcase the latest developments on image recognition tasks. The latest MLPerf training round (v0.7) featured Intel’s submission with TensorFlow. In this paper, we describe the recent optimization work that enabled this submission. In particular, we enabled BFloat16 data type in ResNet50v1.5 model as well as in Intel-optimized TensorFlow to exploit full potential of 3rd generation Intel Xeon scalable processors that have built-in BFloat16 support. We also describe the performance optimizations as well as the state-of-the-art accuracy/convergence results of ResNet50v1.5 model, achieved with large-scale distributed training (with upto 256 MPI workers) with Horovod. These results lay great foundation to support future MLPerf training submissions with large scale Intel Xeon clusters.","PeriodicalId":183674,"journal":{"name":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","volume":"105 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow\",\"authors\":\"Wei Wang, N. Hasabnis\",\"doi\":\"10.1145/3440722.3440880\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"MLPerf benchmarks, which measure training and inference performance of ML hardware and software, have published three sets of ML training results so far. In all sets of results, ResNet50v1.5 was used as a standard benchmark to showcase the latest developments on image recognition tasks. The latest MLPerf training round (v0.7) featured Intel’s submission with TensorFlow. In this paper, we describe the recent optimization work that enabled this submission. In particular, we enabled BFloat16 data type in ResNet50v1.5 model as well as in Intel-optimized TensorFlow to exploit full potential of 3rd generation Intel Xeon scalable processors that have built-in BFloat16 support. We also describe the performance optimizations as well as the state-of-the-art accuracy/convergence results of ResNet50v1.5 model, achieved with large-scale distributed training (with upto 256 MPI workers) with Horovod. These results lay great foundation to support future MLPerf training submissions with large scale Intel Xeon clusters.\",\"PeriodicalId\":183674,\"journal\":{\"name\":\"The International Conference on High Performance Computing in Asia-Pacific Region Companion\",\"volume\":\"105 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The International Conference on High Performance Computing in Asia-Pacific Region Companion\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3440722.3440880\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The International Conference on High Performance Computing in Asia-Pacific Region Companion","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3440722.3440880","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

MLPerf基准测试衡量机器学习硬件和软件的训练和推理性能，迄今为止已经发布了三组机器学习训练结果。在所有结果集中，ResNet50v1.5被用作标准基准，以展示图像识别任务的最新发展。最新的MLPerf培训轮(v0.7)以英特尔提交的TensorFlow为特色。在本文中，我们描述了支持此提交的最新优化工作。特别是，我们在ResNet50v1.5模型以及英特尔优化的TensorFlow中启用了BFloat16数据类型，以充分利用内置BFloat16支持的第三代英特尔至强可扩展处理器的全部潜力。我们还描述了ResNet50v1.5模型的性能优化以及最先进的精度/收敛结果，这些结果是通过Horovod的大规模分布式训练(多达256个MPI工人)实现的。这些结果为支持未来大规模Intel至强集群的MLPerf培训提交奠定了良好的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Distributed MLPerf ResNet50 Training on Intel Xeon Architectures with TensorFlow

MLPerf benchmarks, which measure training and inference performance of ML hardware and software, have published three sets of ML training results so far. In all sets of results, ResNet50v1.5 was used as a standard benchmark to showcase the latest developments on image recognition tasks. The latest MLPerf training round (v0.7) featured Intel’s submission with TensorFlow. In this paper, we describe the recent optimization work that enabled this submission. In particular, we enabled BFloat16 data type in ResNet50v1.5 model as well as in Intel-optimized TensorFlow to exploit full potential of 3rd generation Intel Xeon scalable processors that have built-in BFloat16 support. We also describe the performance optimizations as well as the state-of-the-art accuracy/convergence results of ResNet50v1.5 model, achieved with large-scale distributed training (with upto 256 MPI workers) with Horovod. These results lay great foundation to support future MLPerf training submissions with large scale Intel Xeon clusters.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The International Conference on High Performance Computing in Asia-Pacific Region Companion

自引率

0.00%

发文量