Efficient Convex Optimization on GPUs for Embedded Model Predictive Control

Proceedings of the General Purpose GPUs Pub Date : 2017-02-04 DOI:10.1145/3038228.3038234

Leiming Yu, A. Goldsmith, S. D. Cairano

{"title":"Efficient Convex Optimization on GPUs for Embedded Model Predictive Control","authors":"Leiming Yu, A. Goldsmith, S. D. Cairano","doi":"10.1145/3038228.3038234","DOIUrl":null,"url":null,"abstract":"GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing architectures. In this paper, we evaluate the suitability of the Tegra X1 processor as a platform for embedded model predictive control. MPC relies on the real time solution of a convex optimization problem to compute the control input(s) to a system. Relative to traditional control techniques such as PID, MPC is very computationally demanding. Quadratic programming algorithms for the solution of convex optimization problems generally lend themselves to parallelization. However, until the introduction of the Tegra, there has never been an off-the-shelf embedded processor that would enable a massively parallel embedded implementation. We investigate two different gradient based algorithms, ADMM and PQP, for solving the QP that occurs in a large class of MPC problems. The performance of these algorithms is dominated by the performance of matrix-matrix and matrix-vector products. Our work focuses on maximizing the performance of these operations for relatively small matrices of 100 to 1000 elements per dimension, which are common in the MPC control implementations found in automotive and factory automation applications. Modern BLAS libraries for CPUs and GPUs are quantitatively evaluated. We create SGEMV kernels that can outperform the state-of-the-art cuBLAS by 2.3x on TX1. Different kernel fusion schemes utilizing concurrent kernel execution and zero copy mechanisms are investigated. For ADMM, our implementation achieves 46.6x speedup over the single threaded CPU version and 2.7x speedup over the optimized OpenBLAS version. For PQP, we achieve 41.2x speedup over the single threaded CPU version and 4.2x speedup over the OpenBLAS version.","PeriodicalId":108772,"journal":{"name":"Proceedings of the General Purpose GPUs","volume":"57 18","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the General Purpose GPUs","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3038228.3038234","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

GPU applications have traditionally run on PCs or in larger scale systems. With the introduction of the Tegra line of mobile processors, NVIDIA expanded the types of systems that can exploit the massive parallelism offered by GPU computing architectures. In this paper, we evaluate the suitability of the Tegra X1 processor as a platform for embedded model predictive control. MPC relies on the real time solution of a convex optimization problem to compute the control input(s) to a system. Relative to traditional control techniques such as PID, MPC is very computationally demanding. Quadratic programming algorithms for the solution of convex optimization problems generally lend themselves to parallelization. However, until the introduction of the Tegra, there has never been an off-the-shelf embedded processor that would enable a massively parallel embedded implementation. We investigate two different gradient based algorithms, ADMM and PQP, for solving the QP that occurs in a large class of MPC problems. The performance of these algorithms is dominated by the performance of matrix-matrix and matrix-vector products. Our work focuses on maximizing the performance of these operations for relatively small matrices of 100 to 1000 elements per dimension, which are common in the MPC control implementations found in automotive and factory automation applications. Modern BLAS libraries for CPUs and GPUs are quantitatively evaluated. We create SGEMV kernels that can outperform the state-of-the-art cuBLAS by 2.3x on TX1. Different kernel fusion schemes utilizing concurrent kernel execution and zero copy mechanisms are investigated. For ADMM, our implementation achieves 46.6x speedup over the single threaded CPU version and 2.7x speedup over the optimized OpenBLAS version. For PQP, we achieve 41.2x speedup over the single threaded CPU version and 4.2x speedup over the OpenBLAS version.

查看原文本刊更多论文

基于gpu的嵌入式模型预测控制高效凸优化

GPU应用程序传统上运行在pc或更大规模的系统上。随着Tegra系列移动处理器的推出，NVIDIA扩展了可以利用GPU计算架构提供的大规模并行性的系统类型。在本文中，我们评估了Tegra X1处理器作为嵌入式模型预测控制平台的适用性。MPC依赖于凸优化问题的实时解来计算系统的控制输入。与PID等传统控制技术相比，MPC对计算量的要求很高。求解凸优化问题的二次规划算法通常适合并行化。然而，在引入Tegra之前，从来没有现成的嵌入式处理器能够实现大规模并行嵌入式实现。我们研究了两种不同的基于梯度的算法，ADMM和PQP，用于解决发生在大型MPC问题中的QP。这些算法的性能主要取决于矩阵-矩阵和矩阵-向量乘积的性能。我们的工作重点是在每个维度100到1000个元素的相对较小的矩阵中最大限度地提高这些操作的性能，这在汽车和工厂自动化应用中的MPC控制实现中很常见。对cpu和gpu的现代BLAS库进行了定量评估。我们创建的SGEMV内核在TX1上的性能比最先进的cuBLAS高出2.3倍。研究了利用并行核执行和零拷贝机制的不同核融合方案。对于ADMM，我们的实现比单线程CPU版本加快了46.6倍，比优化的OpenBLAS版本加快了2.7倍。对于PQP，我们实现了比单线程CPU版本41.2倍的加速，比OpenBLAS版本4.2倍的加速。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the General Purpose GPUs

自引率

0.00%

发文量