Automatic performance programming

Markus Püschel
{"title":"Automatic performance programming","authors":"Markus Püschel","doi":"10.1145/2048237.2048239","DOIUrl":null,"url":null,"abstract":"It has become extraordinarily difficult to write software that performs close to optimally on complex modern microarchitectures. Particularly plagued are domains that are data intensive and require complex mathematical computations such as information retrieval, scientific simulations, graphics, communication, control, and multimedia processing. In these domains, performance-critical components are usually written in C (with possible extensions) and often even in assembly, carefully \"tuned\" to the platform's architecture and microarchitecture. Specifically, the tuning includes optimization for the memory hierarchy and for different forms of parallelism. The result is usually long, rather unreadable code that needs to be re-written or re-tuned with every platform upgrade. On the other hand, the performance penalty for relying on straightforward, non-tuned, more elegant implementations is typically a factor of 10, 100, or even more. The reasons for this large gap are some (likely) inherent limitations of compilers including the lack of domain knowledge, and the lack of an efficient mechanism to explore the usually large set of transformation choices. The recent end of CPU frequency scaling, and thus the end of free software speed-up, and the advent of mainstream parallelism with its increasing diversity of platforms further aggravate the problem.\n No promising general solution (besides extensive and expensive hand-coding) to this problem is on the horizon. One approach that has emerged from the numerical computing and compiler community in the last decade is called automatic performance tuning, or autotuning [2, 3, 7--10, 15]. In its most common form it involves the consideration or enumeration of alternative implementations, usually controlled by parameters, coupled with algorithms for search to find the fastest. However, the search space still has to be identified manually, it may be very different even for related functionality, it is not clear how to handle parallelism, and a new platform may require a complete redesign of the autotuning framework.\n On the other hand, since the overall problem is one of productivity, maintainability, and quality (namely performance) it falls squarely into the domain of software engineering. However, even though a large set of sophisticated software engineering theory and tools exist, it appears that to date this community has not focused much on mathematical computations nor performance in the detailed, close-to-optimal sense above. The reason for the latter may be that performance, unlike various aspects of correctness, is not syntactic in nature (and in reality is often even unpredictable and, well, messy).\n The aim of this talk is to draw attention to the performance/productivity problem for mathematical applications and to make the case for a more interdisciplinary attack. As a set of thoughts in this direction we offer some of the lessons we have learned in the last decade in our own research on Spiral [1, 11, 12]. Spiral can be viewed as an automatic performance programming framework for a small, but important class of functions called linear transforms. Key techniques used in Spiral include staged declarative domain-specific languages to express algorithm knowledge and algorithm transformations, the use of platform-cognizant rewriting systems for parallelism and locality optimizations, and the use of search and machine learning techniques to navigate possible spaces of choices [4--6, 13, 14, 16]. Experimental results show that the code generated by Spiral competes with, and sometimes outperforms, the best available human-written code. Spiral has been used to generate part of Intel's commercial libraries IPP and MKL.","PeriodicalId":168332,"journal":{"name":"SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SIGPLAN symposium on New ideas, new paradigms, and reflections on programming and software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2048237.2048239","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

It has become extraordinarily difficult to write software that performs close to optimally on complex modern microarchitectures. Particularly plagued are domains that are data intensive and require complex mathematical computations such as information retrieval, scientific simulations, graphics, communication, control, and multimedia processing. In these domains, performance-critical components are usually written in C (with possible extensions) and often even in assembly, carefully "tuned" to the platform's architecture and microarchitecture. Specifically, the tuning includes optimization for the memory hierarchy and for different forms of parallelism. The result is usually long, rather unreadable code that needs to be re-written or re-tuned with every platform upgrade. On the other hand, the performance penalty for relying on straightforward, non-tuned, more elegant implementations is typically a factor of 10, 100, or even more. The reasons for this large gap are some (likely) inherent limitations of compilers including the lack of domain knowledge, and the lack of an efficient mechanism to explore the usually large set of transformation choices. The recent end of CPU frequency scaling, and thus the end of free software speed-up, and the advent of mainstream parallelism with its increasing diversity of platforms further aggravate the problem. No promising general solution (besides extensive and expensive hand-coding) to this problem is on the horizon. One approach that has emerged from the numerical computing and compiler community in the last decade is called automatic performance tuning, or autotuning [2, 3, 7--10, 15]. In its most common form it involves the consideration or enumeration of alternative implementations, usually controlled by parameters, coupled with algorithms for search to find the fastest. However, the search space still has to be identified manually, it may be very different even for related functionality, it is not clear how to handle parallelism, and a new platform may require a complete redesign of the autotuning framework. On the other hand, since the overall problem is one of productivity, maintainability, and quality (namely performance) it falls squarely into the domain of software engineering. However, even though a large set of sophisticated software engineering theory and tools exist, it appears that to date this community has not focused much on mathematical computations nor performance in the detailed, close-to-optimal sense above. The reason for the latter may be that performance, unlike various aspects of correctness, is not syntactic in nature (and in reality is often even unpredictable and, well, messy). The aim of this talk is to draw attention to the performance/productivity problem for mathematical applications and to make the case for a more interdisciplinary attack. As a set of thoughts in this direction we offer some of the lessons we have learned in the last decade in our own research on Spiral [1, 11, 12]. Spiral can be viewed as an automatic performance programming framework for a small, but important class of functions called linear transforms. Key techniques used in Spiral include staged declarative domain-specific languages to express algorithm knowledge and algorithm transformations, the use of platform-cognizant rewriting systems for parallelism and locality optimizations, and the use of search and machine learning techniques to navigate possible spaces of choices [4--6, 13, 14, 16]. Experimental results show that the code generated by Spiral competes with, and sometimes outperforms, the best available human-written code. Spiral has been used to generate part of Intel's commercial libraries IPP and MKL.
自动性能编程
在复杂的现代微架构上编写性能接近最佳的软件已经变得异常困难。特别困扰的是数据密集型和需要复杂数学计算的领域,如信息检索、科学模拟、图形、通信、控制和多媒体处理。在这些领域中,性能关键型组件通常是用C语言编写的(可能有扩展),甚至通常是用汇编语言编写的,仔细地根据平台的体系结构和微体系结构进行“调优”。具体来说,调优包括对内存层次结构和不同形式的并行性的优化。其结果通常是冗长且难以阅读的代码,每次平台升级都需要重新编写或重新调整。另一方面,依赖于直接的、未经调优的、更优雅的实现的性能损失通常是10倍、100倍,甚至更多。造成这种巨大差距的原因(可能)是编译器的一些固有限制,包括缺乏领域知识,以及缺乏有效的机制来探索通常较大的转换选择集。最近CPU频率缩放的结束,以及免费软件加速的结束,以及随着平台多样性的增加而出现的主流并行进一步加剧了这个问题。目前还没有解决这个问题的通用解决方案(除了大量且昂贵的手工编码)。在过去十年中,数值计算和编译器社区出现了一种方法,称为自动性能调优,或自动调优[2,3,7—10,15]。在其最常见的形式中,它涉及考虑或枚举替代实现,通常由参数控制,再加上搜索算法以找到最快的实现。然而,搜索空间仍然需要手动识别,即使对于相关的功能,也可能会有很大的不同,如何处理并行性还不清楚,新的平台可能需要完全重新设计自动调优框架。另一方面,由于整体问题是生产力、可维护性和质量(即性能)问题之一,因此它完全属于软件工程领域。然而,尽管存在大量复杂的软件工程理论和工具,但迄今为止,这个社区似乎并没有过多地关注数学计算,也没有关注上述详细的、接近最优的性能。后者的原因可能是,与正确性的各个方面不同,性能在本质上不是语法性的(实际上甚至经常是不可预测的,嗯,混乱的)。这次演讲的目的是引起人们对数学应用程序的性能/生产力问题的关注,并为跨学科的攻击提供理由。作为这一方向的一系列想法,我们提供了我们在过去十年中自己对Spiral[1,11,12]的研究中学到的一些经验教训。螺旋可以看作是一个小而重要的线性变换函数类的自动性能编程框架。螺旋中使用的关键技术包括阶段性声明性领域特定语言来表达算法知识和算法转换,使用平台认知重写系统进行并行性和局部性优化,以及使用搜索和机器学习技术来导航可能的选择空间[4- 6,13,14,16]。实验结果表明,螺旋生成的代码与最好的人类编写的代码竞争,有时甚至优于它们。螺旋已用于生成英特尔的商业库IPP和MKL的一部分。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信