Iterative machine learning (IterML) for effective parameter pruning and tuning in accelerators

Proceedings of the 16th ACM International Conference on Computing Frontiers Pub Date : 2019-04-30 DOI:10.1145/3310273.3321563

Xuewen Cui, Wu-chun Feng

{"title":"Iterative machine learning (IterML) for effective parameter pruning and tuning in accelerators","authors":"Xuewen Cui, Wu-chun Feng","doi":"10.1145/3310273.3321563","DOIUrl":null,"url":null,"abstract":"With the rise of accelerators (e.g., GPUs, FPGAs, and APUs) in computing systems, the parallel computing community needs better tools and mechanisms with which to productively extract performance. While modern compilers provide flags to activate different optimizations to improve performance, the effectiveness of such automated optimization depends on the algorithm and its mapping to the underlying accelerator architecture. Currently, however, extracting the best performance from an algorithm on an accelerator requires significant expertise and manual effort to exploit both spatial and temporal sharing of computing resources in order to improve overall performance. In particular, maximizing the performance on an algorithm on an accelerator requires extensive hyperparameter (e.g., thread-block size) selection and tuning. Given the myriad of hyperparameter dimensions to optimize across, the search space of optimizations is generally extremely large, making it infeasible to exhaustively evaluate each optimization configuration. This paper proposes an approach that uses statistical analysis with iterative machine learning (IterML) to prune and tune hyper-parameters to achieve better performance. During each iteration, we leverage machine-learning (ML) models to provide pruning and tuning guidance for the subsequent iterations. We evaluate our IterML approach on the selection of the GPU thread-block size across many benchmarks running on an NVIDIA P100 or V100 GPU. The experimental results show that our IterML approach can significantly reduce (i.e., improve) the search effort by 40% to 80%.","PeriodicalId":431860,"journal":{"name":"Proceedings of the 16th ACM International Conference on Computing Frontiers","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 16th ACM International Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3310273.3321563","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

With the rise of accelerators (e.g., GPUs, FPGAs, and APUs) in computing systems, the parallel computing community needs better tools and mechanisms with which to productively extract performance. While modern compilers provide flags to activate different optimizations to improve performance, the effectiveness of such automated optimization depends on the algorithm and its mapping to the underlying accelerator architecture. Currently, however, extracting the best performance from an algorithm on an accelerator requires significant expertise and manual effort to exploit both spatial and temporal sharing of computing resources in order to improve overall performance. In particular, maximizing the performance on an algorithm on an accelerator requires extensive hyperparameter (e.g., thread-block size) selection and tuning. Given the myriad of hyperparameter dimensions to optimize across, the search space of optimizations is generally extremely large, making it infeasible to exhaustively evaluate each optimization configuration. This paper proposes an approach that uses statistical analysis with iterative machine learning (IterML) to prune and tune hyper-parameters to achieve better performance. During each iteration, we leverage machine-learning (ML) models to provide pruning and tuning guidance for the subsequent iterations. We evaluate our IterML approach on the selection of the GPU thread-block size across many benchmarks running on an NVIDIA P100 or V100 GPU. The experimental results show that our IterML approach can significantly reduce (i.e., improve) the search effort by 40% to 80%.

查看原文本刊更多论文

迭代机器学习(IterML)在加速器中的有效参数修剪和调整

随着计算系统中加速器(例如gpu、fpga和apu)的兴起，并行计算社区需要更好的工具和机制来有效地提取性能。虽然现代编译器提供标志来激活不同的优化以提高性能，但这种自动优化的有效性取决于算法及其到底层加速器体系结构的映射。然而，目前，从加速器上的算法中提取最佳性能需要大量的专业知识和手工工作，以利用计算资源的空间和时间共享来提高整体性能。特别是，在加速器上最大化算法的性能需要大量的超参数(例如，线程块大小)选择和调优。由于要进行优化的超参数维数众多，因此优化的搜索空间通常非常大，因此无法详尽地评估每个优化配置。本文提出了一种使用统计分析和迭代机器学习(IterML)来修剪和调整超参数以获得更好性能的方法。在每次迭代中，我们利用机器学习(ML)模型为后续迭代提供修剪和调优指导。我们在NVIDIA P100或V100 GPU上运行的许多基准测试中评估了我们的IterML方法对GPU线程块大小的选择。实验结果表明，我们的IterML方法可以显著减少(即提高)40%到80%的搜索工作量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 16th ACM International Conference on Computing Frontiers

自引率

0.00%

发文量