Modeling GPU-CPU workloads and systems

GPGPU-3 Pub Date : 2010-03-14 DOI:10.1145/1735688.1735696

Andrew Kerr, G. Diamos, S. Yalamanchili

{"title":"Modeling GPU-CPU workloads and systems","authors":"Andrew Kerr, G. Diamos, S. Yalamanchili","doi":"10.1145/1735688.1735696","DOIUrl":null,"url":null,"abstract":"Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms.\n This paper reports on an empirical evaluation of 25 CUDA applications on four GPUs and three CPUs, leveraging the Ocelot dynamic compiler infrastructure which can execute and instrument the same CUDA applications on either target. Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous processors. These relationships are then fed into a modeling frame-work that attempts to predict the performance of similar classes of applications on different processors. Most significantly, this study identifies several non-intuitive relationships between program characteristics and demonstrates that it is possible to accurately model CUDA kernel performance using only metrics that are available before a kernel is executed.","PeriodicalId":381071,"journal":{"name":"GPGPU-3","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-03-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"109","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"GPGPU-3","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1735688.1735696","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 109

Abstract

Heterogeneous systems, systems with multiple processors tailored for specialized tasks, are challenging programming environments. While it may be possible for domain experts to optimize a high performance application for a very specific and well documented system, it may not perform as well or even function on a different system. Developers who have less experience with either the application domain or the system architecture may devote a significant effort to writing a program that merely functions correctly. We believe that a comprehensive analysis and modeling frame-work is necessary to ease application development and automate program optimization on heterogeneous platforms. This paper reports on an empirical evaluation of 25 CUDA applications on four GPUs and three CPUs, leveraging the Ocelot dynamic compiler infrastructure which can execute and instrument the same CUDA applications on either target. Using a combination of instrumentation and statistical analysis, we record 37 different metrics for each application and use them to derive relationships between program behavior and performance on heterogeneous processors. These relationships are then fed into a modeling frame-work that attempts to predict the performance of similar classes of applications on different processors. Most significantly, this study identifies several non-intuitive relationships between program characteristics and demonstrates that it is possible to accurately model CUDA kernel performance using only metrics that are available before a kernel is executed.

查看原文本刊更多论文

建模GPU-CPU工作负载和系统

异构系统，即具有为特定任务量身定制的多个处理器的系统，是具有挑战性的编程环境。虽然领域专家有可能针对特定的、文档完备的系统优化高性能应用程序，但它在不同的系统上可能执行得不太好，甚至不能正常工作。在应用程序领域或系统体系结构方面经验较少的开发人员可能会花费大量精力来编写仅仅能正常工作的程序。我们相信一个全面的分析和建模框架对于简化应用程序开发和自动化异构平台上的程序优化是必要的。本文报告了在四个gpu和三个cpu上对25个CUDA应用程序的经验评估，利用Ocelot动态编译器基础设施，可以在任何一个目标上执行和检测相同的CUDA应用程序。使用仪器和统计分析的组合，我们为每个应用程序记录了37个不同的指标，并使用它们来推导异质处理器上程序行为和性能之间的关系。然后将这些关系输入到一个建模框架中，该框架试图预测在不同处理器上类似类型的应用程序的性能。最重要的是，本研究确定了程序特征之间的几种非直观关系，并证明了仅使用内核执行之前可用的指标来准确建模CUDA内核性能是可能的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

GPGPU-3

自引率

0.00%

发文量