Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing

IF 1.8 4区计算机科学 Q2 COMPUTER SCIENCE, THEORY & METHODS

ACM Transactions on Computer Systems Pub Date : 2023-10-27 DOI:10.1145/3630007

Michael Pellauer, Jason Clemons, Vignesh Balaji, Neal Crago, Aamer Jaleel, Donghyuk Lee, Mike O’Connor, Anghsuman Parashar, Sean Treichler, Po-An Tsai, Stephen W. Keckler, Joel S. Emer

{"title":"Symphony: Orchestrating Sparse and Dense Tensors with Hierarchical Heterogeneous Processing","authors":"Michael Pellauer, Jason Clemons, Vignesh Balaji, Neal Crago, Aamer Jaleel, Donghyuk Lee, Mike O’Connor, Anghsuman Parashar, Sean Treichler, Po-An Tsai, Stephen W. Keckler, Joel S. Emer","doi":"10.1145/3630007","DOIUrl":null,"url":null,"abstract":"Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too far through the memory hierarchy. To increase performance and efficiency, proposed domain-specific accelerators tailor their architectures to the data needs of a narrow application domain, but as a result cannot be applied to a wide range of algorithms or applications that contain a mix of sparse and dense algorithms. This paper proposes Symphony, a hybrid programmable/specialized architecture which focuses on the orchestration of data throughout the memory hierarchy to simultaneously reduce the movement of unnecessary data and data movement distances. Key elements of the Symphony architecture include (1) specialized reconfigurable units aimed not only at roofline floating-point computations, but at supporting data orchestration features such as address generation, data filtering, and sparse metadata processing; and (2) distribution of computation resources (both programmable and specialized) throughout the on-chip memory hierarchy. We demonstrate that Symphony can match non-programmable ASIC performance on sparse tensor algebra, and provide 31 × improved runtime and 44 × improved energy over a comparably provisioned GPU for these applications.","PeriodicalId":50918,"journal":{"name":"ACM Transactions on Computer Systems","volume":"16 4","pages":"0"},"PeriodicalIF":1.8000,"publicationDate":"2023-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Computer Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3630007","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Sparse tensor algorithms are becoming widespread, particularly in the domains of deep learning, graph and data analytics, and scientific computing. Current high-performance broad-domain architectures, such as GPUs, often suffer memory system inefficiencies by moving too much data or moving it too far through the memory hierarchy. To increase performance and efficiency, proposed domain-specific accelerators tailor their architectures to the data needs of a narrow application domain, but as a result cannot be applied to a wide range of algorithms or applications that contain a mix of sparse and dense algorithms. This paper proposes Symphony, a hybrid programmable/specialized architecture which focuses on the orchestration of data throughout the memory hierarchy to simultaneously reduce the movement of unnecessary data and data movement distances. Key elements of the Symphony architecture include (1) specialized reconfigurable units aimed not only at roofline floating-point computations, but at supporting data orchestration features such as address generation, data filtering, and sparse metadata processing; and (2) distribution of computation resources (both programmable and specialized) throughout the on-chip memory hierarchy. We demonstrate that Symphony can match non-programmable ASIC performance on sparse tensor algebra, and provide 31 × improved runtime and 44 × improved energy over a comparably provisioned GPU for these applications.

查看原文本刊更多论文

交响乐团:用层次异构处理编排稀疏和密集张量

稀疏张量算法正变得越来越广泛，特别是在深度学习、图和数据分析以及科学计算领域。当前的高性能广域架构(如gpu)经常由于移动太多数据或在内存层次结构中移动得太远而导致内存系统效率低下。为了提高性能和效率，所提出的特定于领域的加速器根据狭窄应用领域的数据需求定制其体系结构，但结果不能应用于广泛的算法或包含稀疏和密集算法混合的应用程序。本文提出了Symphony，这是一种可编程/专用的混合架构，专注于整个内存层次结构中的数据编排，同时减少不必要数据的移动和数据移动距离。Symphony架构的关键元素包括:(1)专门的可重构单元，不仅针对浮点计算，还支持数据编排功能，如地址生成、数据过滤和稀疏元数据处理;(2)在整个片上存储器层次结构中分配计算资源(包括可编程的和专用的)。我们证明Symphony可以在稀疏张量代数上匹配非可编程ASIC性能，并且在这些应用程序中提供比同等配置的GPU提高31倍的运行时间和44倍的能量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Computer Systems 工程技术-计算机：理论方法

CiteScore

4.00

自引率

0.00%

发文量

审稿时长

1 months

期刊介绍： ACM Transactions on Computer Systems (TOCS) presents research and development results on the design, implementation, analysis, evaluation, and use of computer systems and systems software. The term "computer systems" is interpreted broadly and includes operating systems, systems architecture and hardware, distributed systems, optimizing compilers, and the interaction between systems and computer networks. Articles appearing in TOCS will tend either to present new techniques and concepts, or to report on experiences and experiments with actual systems. Insights useful to system designers, builders, and users will be emphasized. TOCS publishes research and technical papers, both short and long. It includes technical correspondence to permit commentary on technical topics and on previously published papers.