用于紧耦合多核集群的超低延迟轻量级DMA

Proceedings of the 11th ACM Conference on Computing Frontiers Pub Date : 2014-05-20 DOI:10.1145/2597917.2597922

D. Rossi, Igor Loi, Germain Haugou, L. Benini

{"title":"用于紧耦合多核集群的超低延迟轻量级DMA","authors":"D. Rossi, Igor Loi, Germain Haugou, L. Benini","doi":"10.1145/2597917.2597922","DOIUrl":null,"url":null,"abstract":"The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computational capabilities of embedded computing devices, while memory access is dominated by on-chip and off-chip interconnect delays which do not scale well. For this reason, the bottleneck of many applications is rapidly moving from computation to communication. More precisely, performance is often bound by the huge latency of direct memory accesses. In this scenario the challenge is to provide embedded multi and many-core systems with a powerful, low-latency, energy efficient and flexible way to move data through the memory hierarchy level. In this paper, a DMA engine optimized for clustered tightly coupled many-core systems is presented. The IP features a simple micro-coded programming interface and lock-free per-core command queues to improve flexibility while reducing the programming latency. Moreover it dramatically reduces the area and improves the energy efficiency with respect to conventional DMAs exploiting the cluster shared memory as local repository for data buffers. The proposed DMA engine improves the access and programming latency by one order of magnitude, it reduces IP area by 4x and power by 5x, with respect to a conventional DMA, while providing full bandwidth to 16 independent logical channels.","PeriodicalId":194910,"journal":{"name":"Proceedings of the 11th ACM Conference on Computing Frontiers","volume":"58 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"36","resultStr":"{\"title\":\"Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters\",\"authors\":\"D. Rossi, Igor Loi, Germain Haugou, L. Benini\",\"doi\":\"10.1145/2597917.2597922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computational capabilities of embedded computing devices, while memory access is dominated by on-chip and off-chip interconnect delays which do not scale well. For this reason, the bottleneck of many applications is rapidly moving from computation to communication. More precisely, performance is often bound by the huge latency of direct memory accesses. In this scenario the challenge is to provide embedded multi and many-core systems with a powerful, low-latency, energy efficient and flexible way to move data through the memory hierarchy level. In this paper, a DMA engine optimized for clustered tightly coupled many-core systems is presented. The IP features a simple micro-coded programming interface and lock-free per-core command queues to improve flexibility while reducing the programming latency. Moreover it dramatically reduces the area and improves the energy efficiency with respect to conventional DMAs exploiting the cluster shared memory as local repository for data buffers. The proposed DMA engine improves the access and programming latency by one order of magnitude, it reduces IP area by 4x and power by 5x, with respect to a conventional DMA, while providing full bandwidth to 16 independent logical channels.\",\"PeriodicalId\":194910,\"journal\":{\"name\":\"Proceedings of the 11th ACM Conference on Computing Frontiers\",\"volume\":\"58 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"36\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 11th ACM Conference on Computing Frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2597917.2597922\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 11th ACM Conference on Computing Frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2597917.2597922","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 36

摘要

多核和多核平台的发展迅速提高了嵌入式计算设备的片上计算能力，而内存访问主要受片上和片外互连延迟的影响，这些延迟不能很好地扩展。由于这个原因，许多应用程序的瓶颈正迅速从计算转移到通信。更准确地说，性能通常受到直接内存访问的巨大延迟的限制。在这种情况下，挑战是为嵌入式多核和多核系统提供一种强大、低延迟、节能和灵活的方式来通过内存层次结构级别移动数据。本文提出了一种针对集群紧密耦合多核系统进行优化的DMA引擎。该IP具有简单的微编码编程接口和无锁的每核命令队列，以提高灵活性，同时减少编程延迟。此外，与利用集群共享内存作为数据缓冲区的本地存储库的传统dma相比，它大大减少了面积并提高了能源效率。与传统的DMA相比，所提出的DMA引擎将访问和编程延迟提高了一个数量级，将IP面积减少了4倍，功耗减少了5倍，同时为16个独立的逻辑通道提供了全带宽。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Ultra-low-latency lightweight DMA for tightly coupled multi-core clusters

The evolution of multi- and many-core platforms is rapidly increasing the available on-chip computational capabilities of embedded computing devices, while memory access is dominated by on-chip and off-chip interconnect delays which do not scale well. For this reason, the bottleneck of many applications is rapidly moving from computation to communication. More precisely, performance is often bound by the huge latency of direct memory accesses. In this scenario the challenge is to provide embedded multi and many-core systems with a powerful, low-latency, energy efficient and flexible way to move data through the memory hierarchy level. In this paper, a DMA engine optimized for clustered tightly coupled many-core systems is presented. The IP features a simple micro-coded programming interface and lock-free per-core command queues to improve flexibility while reducing the programming latency. Moreover it dramatically reduces the area and improves the energy efficiency with respect to conventional DMAs exploiting the cluster shared memory as local repository for data buffers. The proposed DMA engine improves the access and programming latency by one order of magnitude, it reduces IP area by 4x and power by 5x, with respect to a conventional DMA, while providing full bandwidth to 16 independent logical channels.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 11th ACM Conference on Computing Frontiers

自引率

0.00%

发文量