Multicore-based vector coprocessor sharing for performance and energy gains

ACM Trans. Embed. Comput. Syst. Pub Date : 2013-09-01 DOI:10.1145/2514641.2514644

S. F. Beldianu, Sotirios G. Ziavras

{"title":"Multicore-based vector coprocessor sharing for performance and energy gains","authors":"S. F. Beldianu, Sotirios G. Ziavras","doi":"10.1145/2514641.2514644","DOIUrl":null,"url":null,"abstract":"For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. We present a robust design framework for vector coprocessor sharing in multicore environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For our adaptive vector unit, which is attached to multiple cores, we propose three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing. We benchmark these vector coprocessor sharing policies for a dual-core system and evaluate them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization shows that these coprocessor sharing policies yield high utilization and performance with low energy costs. The proposed policies provide 1.2--2 speedups and reduce the energy needs by about 50% as compared to a system having a single core with an attached vector coprocessor. With the performance expressed in clock cycles, the sharing policies demonstrate 3.62--7.92 speedups compared to optimized Xeon runs. We also introduce performance and empirical power models that can be used by the runtime system to estimate the effectiveness of each policy in a hybrid system that can simultaneously implement this suite of shared coprocessor policies.","PeriodicalId":183677,"journal":{"name":"ACM Trans. Embed. Comput. Syst.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Trans. Embed. Comput. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2514641.2514644","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

Abstract

For most of the applications that make use of a dedicated vector coprocessor, its resources are not highly utilized due to the lack of sustained data parallelism which often occurs due to vector-length variations in dynamic environments. The motivation of our work stems from: (a) the mandate for multicore designs to make efficient use of on-chip resources for low power and high performance; (b) the omnipresence of vector operations in high-performance scientific and emerging embedded applications; (c) the need to often handle a variety of vector sizes; and (d) vector kernels in application suites may have diverse computation needs. We present a robust design framework for vector coprocessor sharing in multicore environments that maximizes vector unit utilization and performance at substantially reduced energy costs. For our adaptive vector unit, which is attached to multiple cores, we propose three basic shared working policies that enforce coarse-grain, fine-grain, and vector-lane sharing. We benchmark these vector coprocessor sharing policies for a dual-core system and evaluate them using the floating-point performance, resource utilization, and power/energy consumption metrics. Benchmarking for FIR filtering, FFT, matrix multiplication, and LU factorization shows that these coprocessor sharing policies yield high utilization and performance with low energy costs. The proposed policies provide 1.2--2 speedups and reduce the energy needs by about 50% as compared to a system having a single core with an attached vector coprocessor. With the performance expressed in clock cycles, the sharing policies demonstrate 3.62--7.92 speedups compared to optimized Xeon runs. We also introduce performance and empirical power models that can be used by the runtime system to estimate the effectiveness of each policy in a hybrid system that can simultaneously implement this suite of shared coprocessor policies.

查看原文本刊更多论文

基于多核的矢量协处理器共享，以获得性能和能量增益

对于大多数使用专用矢量协处理器的应用程序，由于缺乏持续的数据并行性，其资源利用率不高，这通常是由于动态环境中矢量长度的变化造成的。我们工作的动机来自:(a)多核设计的要求，以有效利用片上资源，实现低功耗和高性能;(b)在高性能科学和新兴嵌入式应用中无处不在的矢量操作;(c)需要经常处理不同大小的向量;(d)应用程序套件中的向量核可能有不同的计算需求。我们提出了一个健壮的多核环境中矢量协处理器共享的设计框架，该框架在大大降低能源成本的情况下最大限度地提高了矢量单元的利用率和性能。对于附加到多个内核的自适应矢量单元，我们提出了三种基本的共享工作策略，分别执行粗粒度、细粒度和矢量通道共享。我们在双核系统中对这些矢量协处理器共享策略进行基准测试，并使用浮点性能、资源利用率和功耗/能耗指标对其进行评估。对FIR滤波、FFT、矩阵乘法和LU分解的基准测试表明，这些协处理器共享策略以低能源成本产生高利用率和性能。与带有附加矢量协处理器的单核系统相比，提议的策略提供1.2- 2的加速，并减少约50%的能源需求。在以时钟周期表示性能的情况下，与优化后的Xeon运行相比，共享策略的速度提高了3.62—7.92。我们还介绍了性能和经验功率模型，运行时系统可以使用这些模型来估计混合系统中每个策略的有效性，该混合系统可以同时实现这套共享协处理器策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Trans. Embed. Comput. Syst.

自引率

0.00%

发文量