Energy-Efficient Sparse Matrix Autotuning with CSX -- A Trade-off Study

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI:10.1109/IPDPSW.2013.219

J. Meyer, J. M. Cebrian, L. Natvig, V. Karakasis, D. Siakavaras, K. Nikas

{"title":"Energy-Efficient Sparse Matrix Autotuning with CSX -- A Trade-off Study","authors":"J. Meyer, J. M. Cebrian, L. Natvig, V. Karakasis, D. Siakavaras, K. Nikas","doi":"10.1109/IPDPSW.2013.219","DOIUrl":null,"url":null,"abstract":"In this paper, we apply a method for extracting a running power estimate of applications from hardware performance counters, producing power/time curves which can be integrated over particular intervals to estimate the energy consumption of individual application stages. We use this method to instrument executions of a conjugate gradient solver, to examine the energy and performance impacts of applying the Compressed Sparse eXtended (CSX) and classic Compressed Sparse Row (CSR) matrix compression methods to sparse linear systems from different application areas. The CSX format requires a preprocessing stage which identifies and exploits a range of matrix substructures, incurring a one-time cost which can facilitate more effective sparse matrix-vector multiplication (SpMV). As this numerical kernel is the primary performance bottleneck of conjugate gradient solvers, we take the approach of isolating the energy cost of preprocessing from a short sample of application iterations, obtaining measurements which enlighten the choice of which compression scheme is more appropriate to the input data. We examine the impact variable degrees of parallelism, processor clock frequency, and Hyper threading have on this trade-off. Our results include comparisons of empirically obtained results from all combinations of up to 8 threads on 4 hyper threaded cores, 3 clock frequencies, and 5 sample application matrices. We assess program-hardware interactions with views to structural properties of the data and hardware architectural features, and evaluate the approach with respect to integrating the energy instrumentation with present automatic performance tuning. Results show that our method is sufficiently precise to identify non-trivial tradeoffs in the parameter space, and may become suitable for a run-time automatic tuning scheme by applying a faster preprocessing mode of CSX.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2013.219","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

In this paper, we apply a method for extracting a running power estimate of applications from hardware performance counters, producing power/time curves which can be integrated over particular intervals to estimate the energy consumption of individual application stages. We use this method to instrument executions of a conjugate gradient solver, to examine the energy and performance impacts of applying the Compressed Sparse eXtended (CSX) and classic Compressed Sparse Row (CSR) matrix compression methods to sparse linear systems from different application areas. The CSX format requires a preprocessing stage which identifies and exploits a range of matrix substructures, incurring a one-time cost which can facilitate more effective sparse matrix-vector multiplication (SpMV). As this numerical kernel is the primary performance bottleneck of conjugate gradient solvers, we take the approach of isolating the energy cost of preprocessing from a short sample of application iterations, obtaining measurements which enlighten the choice of which compression scheme is more appropriate to the input data. We examine the impact variable degrees of parallelism, processor clock frequency, and Hyper threading have on this trade-off. Our results include comparisons of empirically obtained results from all combinations of up to 8 threads on 4 hyper threaded cores, 3 clock frequencies, and 5 sample application matrices. We assess program-hardware interactions with views to structural properties of the data and hardware architectural features, and evaluate the approach with respect to integrating the energy instrumentation with present automatic performance tuning. Results show that our method is sufficiently precise to identify non-trivial tradeoffs in the parameter space, and may become suitable for a run-time automatic tuning scheme by applying a faster preprocessing mode of CSX.

查看原文本刊更多论文

基于CSX的节能稀疏矩阵自整定——一个权衡研究

在本文中，我们应用了一种从硬件性能计数器中提取应用程序运行功率估计的方法，产生功率/时间曲线，该曲线可以在特定的间隔内集成，以估计各个应用程序阶段的能耗。我们使用该方法对共轭梯度求解器进行仪器执行，以检查将压缩稀疏扩展(CSX)和经典压缩稀疏行(CSR)矩阵压缩方法应用于不同应用领域的稀疏线性系统对能量和性能的影响。CSX格式需要一个识别和利用一系列矩阵子结构的预处理阶段，产生一次性成本，可以促进更有效的稀疏矩阵向量乘法(SpMV)。由于该数值核是共轭梯度解算器的主要性能瓶颈，我们采用了将预处理的能量成本与应用迭代的短样本隔离的方法，获得的测量值可以启发选择哪种压缩方案更适合输入数据。我们研究了不同程度的并行性、处理器时钟频率和超线程对这种权衡的影响。我们的结果包括在4个超线程内核、3个时钟频率和5个示例应用程序矩阵上最多8个线程的所有组合中获得的经验结果的比较。我们从数据的结构属性和硬件架构特征的角度评估程序-硬件交互，并评估关于将能源仪表与当前自动性能调优集成的方法。结果表明，我们的方法在参数空间中具有足够的精度来识别非琐碎的权衡，并且可以通过应用CSX更快的预处理模式来适用于运行时自动调优方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum

自引率

0.00%

发文量