Efficient Dual Batch Size Deep Learning for Distributed Parameter Server Systems

2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC) Pub Date : 2022-06-01 DOI:10.1109/COMPSAC54236.2022.00110

Kuan-Wei Lu, Pangfeng Liu, Ding-Yong Hong, Jan-Jan Wu

{"title":"Efficient Dual Batch Size Deep Learning for Distributed Parameter Server Systems","authors":"Kuan-Wei Lu, Pangfeng Liu, Ding-Yong Hong, Jan-Jan Wu","doi":"10.1109/COMPSAC54236.2022.00110","DOIUrl":null,"url":null,"abstract":"Distributed machine learning is essential for applying deep learning models with many data and parameters. Current researches on distributed machine learning focus on using more hardware devices powerful computing units for fast training. Consequently, the model training prefers a larger batch size to accelerate the training speed. However, the large batch training often suffers from poor accuracy due to poor generalization ability. Researchers have come up with many sophisticated methods to address this accuracy issue due to large batch sizes. These methods usually have complex mechanisms, thus making training more difficult. In addition, powerful training hardware for large batch sizes is expensive, and not all researchers can afford it. We propose a dual batch size learning scheme to address the batch size issue. We use the maximum batch size of our hardware for maximum training efficiency we can afford. In addition, we introduce a smaller batch size during the training to improve the model generalization ability. Using two different batch sizes in the same training simultaneously will reduce the testing loss and obtain a good generalization ability, with only a slight increase in the training time. We implement our dual batch size learning scheme and conduct experiments. By increasing 5% of the training time, we can reduce the loss from 1.429 to 1.246 in some cases. In addition, by appropriately adjusting the percentage of large and small batch sizes, we can increase the accuracy by 2.8% in some cases. With the additional 10% increase in training time, we can reduce the loss from 1.429 to 1.193. And after moderately adjusting the number of large batches and small batches used by GPUs, the accuracy can increase by 2.9%. Using two different batch sizes in the same training introduces two complications. First, the data processing speeds for two different batch sizes are different, so we must assign the data proportionally to maximize the overall processing speed. In addition, since the smaller batches will see fewer data due to the overall processing speed consideration, we proportionally adjust their contribution towards the global weight update in the parameter server. We use the ratio of data between the small and large batches to adjust the contribution. Experimental results indicate that this contribution adjustment increases the final accuracy by another 0.9%.","PeriodicalId":330838,"journal":{"name":"2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSAC54236.2022.00110","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Distributed machine learning is essential for applying deep learning models with many data and parameters. Current researches on distributed machine learning focus on using more hardware devices powerful computing units for fast training. Consequently, the model training prefers a larger batch size to accelerate the training speed. However, the large batch training often suffers from poor accuracy due to poor generalization ability. Researchers have come up with many sophisticated methods to address this accuracy issue due to large batch sizes. These methods usually have complex mechanisms, thus making training more difficult. In addition, powerful training hardware for large batch sizes is expensive, and not all researchers can afford it. We propose a dual batch size learning scheme to address the batch size issue. We use the maximum batch size of our hardware for maximum training efficiency we can afford. In addition, we introduce a smaller batch size during the training to improve the model generalization ability. Using two different batch sizes in the same training simultaneously will reduce the testing loss and obtain a good generalization ability, with only a slight increase in the training time. We implement our dual batch size learning scheme and conduct experiments. By increasing 5% of the training time, we can reduce the loss from 1.429 to 1.246 in some cases. In addition, by appropriately adjusting the percentage of large and small batch sizes, we can increase the accuracy by 2.8% in some cases. With the additional 10% increase in training time, we can reduce the loss from 1.429 to 1.193. And after moderately adjusting the number of large batches and small batches used by GPUs, the accuracy can increase by 2.9%. Using two different batch sizes in the same training introduces two complications. First, the data processing speeds for two different batch sizes are different, so we must assign the data proportionally to maximize the overall processing speed. In addition, since the smaller batches will see fewer data due to the overall processing speed consideration, we proportionally adjust their contribution towards the global weight update in the parameter server. We use the ratio of data between the small and large batches to adjust the contribution. Experimental results indicate that this contribution adjustment increases the final accuracy by another 0.9%.

查看原文本刊更多论文

分布式参数服务器系统的高效双批处理深度学习

分布式机器学习对于应用具有许多数据和参数的深度学习模型至关重要。目前分布式机器学习的研究主要集中在使用更多的硬件设备、强大的计算单元来进行快速训练。因此，模型训练倾向于更大的批大小来加快训练速度。然而，由于泛化能力差，大批量训练往往存在准确率不高的问题。研究人员已经提出了许多复杂的方法来解决由于大量批量而导致的准确性问题。这些方法通常具有复杂的机制，因此使训练更加困难。此外，用于大批量的强大训练硬件是昂贵的，并不是所有的研究人员都能负担得起。我们提出了一种双批大小学习方案来解决批大小问题。我们使用硬件的最大批处理大小来获得我们所能负担得起的最大训练效率。此外，我们在训练过程中引入了较小的批大小，以提高模型的泛化能力。在同一训练中同时使用两个不同的批大小可以减少测试损失并获得良好的泛化能力，而训练时间只会稍微增加。我们实现了双批大小的学习方案并进行了实验。通过增加5%的训练时间，我们可以在某些情况下将损失从1.429降低到1.246。此外，通过适当调整大批量和小批量的百分比，我们可以在某些情况下将准确率提高2.8%。再增加10%的训练时间，我们可以将损失从1.429降低到1.193。适度调整gpu使用的大批量和小批量数量后，准确率可提高2.9%。在同一训练中使用两种不同的批大小会引入两个复杂性。首先，两种不同批大小的数据处理速度是不同的，因此我们必须按比例分配数据以最大限度地提高整体处理速度。此外，由于考虑到整体处理速度，较小的批将看到更少的数据，因此我们按比例调整它们对参数服务器中全局权重更新的贡献。我们使用小批量和大批量数据之间的比例来调整贡献。实验结果表明，这种贡献调整使最终精度又提高了0.9%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 IEEE 46th Annual Computers, Software, and Applications Conference (COMPSAC)

自引率

0.00%

发文量