{"title":"VLAG: A very fast locality approximation model for GPU kernels with regular access patterns","authors":"Mohsen Kiani, Amir Rajabzadeh","doi":"10.1109/ICCKE.2017.8167887","DOIUrl":null,"url":null,"abstract":"Performance modeling plays an important role for optimal hardware design and optimized application implementation. This paper presents a very low overhead performance model, called VLAG, to approximate the data localities exploited by GPU kernels. VLAG receives source code-level information to estimate per memory-access instruction, per data array, and per kernel localities within GPU kernels. VLAG is only applicable to kernels with regular memory access patterns. VLAG was experimentally evaluated using an NVIDIA Maxwell GPU. For two different Matrix Multiplication kernels, the average errors of 7.68% and 6.29%, was resulted, respectively. The slowdown of VLAG for MM was measured 1.4X which, comparing with other approaches such as trace-driven simulation, is negligible.","PeriodicalId":151934,"journal":{"name":"2017 7th International Conference on Computer and Knowledge Engineering (ICCKE)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 7th International Conference on Computer and Knowledge Engineering (ICCKE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCKE.2017.8167887","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Performance modeling plays an important role for optimal hardware design and optimized application implementation. This paper presents a very low overhead performance model, called VLAG, to approximate the data localities exploited by GPU kernels. VLAG receives source code-level information to estimate per memory-access instruction, per data array, and per kernel localities within GPU kernels. VLAG is only applicable to kernels with regular memory access patterns. VLAG was experimentally evaluated using an NVIDIA Maxwell GPU. For two different Matrix Multiplication kernels, the average errors of 7.68% and 6.29%, was resulted, respectively. The slowdown of VLAG for MM was measured 1.4X which, comparing with other approaches such as trace-driven simulation, is negligible.