IOOpt: automatic derivation of I/O complexity bounds for affine programs

Auguste Olivry, Guillaume Iooss, Nicolas Tollenaere, A. Rountev, P. Sadayappan, F. Rastello
{"title":"IOOpt: automatic derivation of I/O complexity bounds for affine programs","authors":"Auguste Olivry, Guillaume Iooss, Nicolas Tollenaere, A. Rountev, P. Sadayappan, F. Rastello","doi":"10.1145/3453483.3454103","DOIUrl":null,"url":null,"abstract":"Evaluating the complexity of an algorithm is an important step when developing applications, as it impacts both its time and energy performance. Computational complexity, which is the number of dynamic operations regardless of the execution order, is easy to characterize for affine programs. Data movement (or, I/O) complexity is more complex to evaluate as it refers, when considering all possible valid schedules, to the minimum required number of I/O between a slow (e.g. main memory) and a fast (e.g. local scratchpad) storage location. This paper presents IOOpt, a fully automated tool that automatically bounds the data movement of an affine (tilable) program. Given a tilable program described in a DSL, it automatically computes: 1. a lower bound of the I/O complexity as a symbolic expression of the cache size and program parameters; 2. an upper bound that allows one to assess the tightness of the lower bound; 3. a tiling recommendation (loop permutation and tile sizes) that matches the upper bound. For the lower bound algorithm which can be applied to any affine program, a substantial effort has been made to provide bounds that are as tight as possible for neural networks: In particular, it extends the previous work of Olivry et al. to handle multi-dimensional reductions and expose the constraints associated with small dimensions that are present in convolutions. For the upper bound algorithm that reasons on the tile band of the program (e.g. output of a polyhedral compiler such as PluTo), the algebraic computations involved have been tuned to behave well on tensor computations such as direct tensor contractions or direct convolutions. As a bonus, the upper bound algorithm that has been extended to multi-level cache can provide the programmer with a useful tiling recommendation. We demonstrate the effectiveness of our tool by deriving the symbolic lower and upper bounds for several tensor contraction and convolution kernels. Then we evaluate numerically the tightness of our bound using the convolution layers of Yolo9000 and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the running time of the recommended tiled code for the convolution layers of Yolo9000.","PeriodicalId":20557,"journal":{"name":"Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2021-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3453483.3454103","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Evaluating the complexity of an algorithm is an important step when developing applications, as it impacts both its time and energy performance. Computational complexity, which is the number of dynamic operations regardless of the execution order, is easy to characterize for affine programs. Data movement (or, I/O) complexity is more complex to evaluate as it refers, when considering all possible valid schedules, to the minimum required number of I/O between a slow (e.g. main memory) and a fast (e.g. local scratchpad) storage location. This paper presents IOOpt, a fully automated tool that automatically bounds the data movement of an affine (tilable) program. Given a tilable program described in a DSL, it automatically computes: 1. a lower bound of the I/O complexity as a symbolic expression of the cache size and program parameters; 2. an upper bound that allows one to assess the tightness of the lower bound; 3. a tiling recommendation (loop permutation and tile sizes) that matches the upper bound. For the lower bound algorithm which can be applied to any affine program, a substantial effort has been made to provide bounds that are as tight as possible for neural networks: In particular, it extends the previous work of Olivry et al. to handle multi-dimensional reductions and expose the constraints associated with small dimensions that are present in convolutions. For the upper bound algorithm that reasons on the tile band of the program (e.g. output of a polyhedral compiler such as PluTo), the algebraic computations involved have been tuned to behave well on tensor computations such as direct tensor contractions or direct convolutions. As a bonus, the upper bound algorithm that has been extended to multi-level cache can provide the programmer with a useful tiling recommendation. We demonstrate the effectiveness of our tool by deriving the symbolic lower and upper bounds for several tensor contraction and convolution kernels. Then we evaluate numerically the tightness of our bound using the convolution layers of Yolo9000 and representative tensor contractions from the TCCG benchmark suite. Finally, we show the pertinence of our I/O complexity model by reporting the running time of the recommended tiled code for the convolution layers of Yolo9000.
IOOpt:为仿射程序自动派生I/O复杂度界限
在开发应用程序时,评估算法的复杂性是一个重要步骤,因为它会影响其时间和能量性能。计算复杂性,即动态操作的数量,与执行顺序无关,很容易表征仿射程序。当考虑所有可能的有效调度时,数据移动(或I/O)复杂性的评估更加复杂,因为它指的是慢速(例如主存)和快速(例如本地刮擦板)存储位置之间所需的最小I/O数量。本文介绍了IOOpt,一个全自动工具,可以自动限制仿射(可调)程序的数据移动。给定一个用DSL描述的可编程程序,它会自动计算:I/O复杂度的下界,作为缓存大小和程序参数的符号表达式;2. 上界:允许人们评估下界的严密性的上界;3.匹配上界的平铺建议(循环排列和平铺大小)。对于可以应用于任何仿射程序的下界算法,已经做出了大量的努力来为神经网络提供尽可能紧密的边界:特别是,它扩展了Olivry等人的先前工作,以处理多维约简并暴露与卷积中存在的小维相关的约束。对于在程序的条带上进行推理的上界算法(例如,PluTo等多面体编译器的输出),所涉及的代数计算已被调整为在张量计算(如直接张量收缩或直接卷积)上表现良好。作为奖励,上界算法已经扩展到多级缓存,可以为程序员提供有用的平铺建议。我们通过推导几个张量收缩和卷积核的符号下界和上界来证明我们的工具的有效性。然后,我们使用Yolo9000的卷积层和TCCG基准套件的代表性张量收缩来数值评估边界的紧密性。最后,我们通过报告Yolo9000的卷积层的推荐平铺代码的运行时间来展示我们的I/O复杂性模型的相关性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信