Rigel:用于1000核加速器的架构和可扩展编程接口

Proceedings. International Symposium on Computer Architecture Pub Date : 2009-06-15 DOI:10.1145/1555754.1555774

J. H. Kelm, Daniel R. Johnson, Matthew R. Johnson, N. Crago, W. Tuohy, Aqeel Mahesri, S. Lumetta, M. Frank, Sanjay J. Patel

{"title":"Rigel:用于1000核加速器的架构和可扩展编程接口","authors":"J. H. Kelm, Daniel R. Johnson, Matthew R. Johnson, N. Crago, W. Tuohy, Aqeel Mahesri, S. Lumetta, M. Frank, Sanjay J. Patel","doi":"10.1145/1555754.1555774","DOIUrl":null,"url":null,"abstract":"This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications.\n We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.","PeriodicalId":91388,"journal":{"name":"Proceedings. International Symposium on Computer Architecture","volume":"63 1","pages":"140-151"},"PeriodicalIF":0.0000,"publicationDate":"2009-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"160","resultStr":"{\"title\":\"Rigel: an architecture and scalable programming interface for a 1000-core accelerator\",\"authors\":\"J. H. Kelm, Daniel R. Johnson, Matthew R. Johnson, N. Crago, W. Tuohy, Aqeel Mahesri, S. Lumetta, M. Frank, Sanjay J. Patel\",\"doi\":\"10.1145/1555754.1555774\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications.\\n We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.\",\"PeriodicalId\":91388,\"journal\":{\"name\":\"Proceedings. International Symposium on Computer Architecture\",\"volume\":\"63 1\",\"pages\":\"140-151\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-06-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"160\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. International Symposium on Computer Architecture\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1555754.1555774\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. International Symposium on Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1555754.1555774","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 160

摘要

Rigel是一种可编程加速器体系结构，用于广泛的数据和任务并行计算。Rigel由1000多个分层组织的核心组成，这些核心使用细粒度、动态调度的单程序多数据(SPMD)执行模型。Rigel的低级编程接口采用单一的全局地址空间模型，其中并行工作以任务为中心，使用最小的硬件支持以批量同步的方式表示。与现有的包含特定领域硬件、专用内存和/或限制性编程模型的加速器相比，Rigel更加灵活，并为更广泛的应用程序提供了一个直接的目标。我们对Rigel进行了设计分析，以量化初始设计的计算密度和功率效率。我们发现Rigel可以在45nm实现超过8个单精度GFLOPS/mm2的密度，这与高端gpu缩放到45nm相当。我们对移植到Rigel底层编程接口的几个应用程序进行了实验分析。我们使用软件技术和最小的专用硬件支持来研究与1000核加速器的工作分配、同步和负载平衡相关的可伸缩性问题。我们发现，虽然支持快速任务分发和屏障操作很重要，但这些操作可以在没有专用硬件的情况下使用灵活的硬件原语来实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Rigel: an architecture and scalable programming interface for a 1000-core accelerator

This paper considers Rigel, a programmable accelerator architecture for a broad class of data- and task-parallel computation. Rigel comprises 1000+ hierarchically-organized cores that use a fine-grained, dynamically scheduled single-program, multiple-data (SPMD) execution model. Rigel's low-level programming interface adopts a single global address space model where parallel work is expressed in a task-centric, bulk-synchronized manner using minimal hardware support. Compared to existing accelerators, which contain domain-specific hardware, specialized memories, and/or restrictive programming models, Rigel is more flexible and provides a straightforward target for a broader set of applications. We perform a design analysis of Rigel to quantify the compute density and power efficiency of our initial design. We find that Rigel can achieve a density of over 8 single-precision GFLOPS/mm2 in 45nm, which is comparable to high-end GPUs scaled to 45nm. We perform experimental analysis on several applications ported to the Rigel low-level programming interface. We examine scalability issues related to work distribution, synchronization, and load-balancing for 1000-core accelerators using software techniques and minimal specialized hardware support. We find that while it is important to support fast task distribution and barrier operations, these operations can be implemented without specialized hardware using flexible hardware primitives.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. International Symposium on Computer Architecture

自引率

0.00%

发文量