基于区域的平铺mpsoc缓存一致性

2017 30th IEEE International System-on-Chip Conference (SOCC) Pub Date : 2017-09-01 DOI:10.1109/SOCC.2017.8226059

A. Srivatsa, Sven Rheindt, Thomas Wild, A. Herkersdorf

{"title":"基于区域的平铺mpsoc缓存一致性","authors":"A. Srivatsa, Sven Rheindt, Thomas Wild, A. Herkersdorf","doi":"10.1109/SOCC.2017.8226059","DOIUrl":null,"url":null,"abstract":"The need for faster and more energy efficient computing has led us to the multicore era with distributed shared memory hierarchies. The primary goal is to distribute parallel tasks onto multiple processing elements to collectively achieve shorter execution times at lower frequencies and supply voltages when compared to a single-core architecture. Major challenges of this approach are how to achieve local, low latency memory accesses and low overheads for coherence and synchronization management. We believe that enabling global coherence in tiled many-core architectures does not scale in a cost efficient manner and isn't even required for applications with limited degrees of parallelism. In this paper, we propose a novel region based cache coherence scheme, where coherence is provided by hardware directories within a flexibly sized but confined set of compute and memory tiles. We also show that data placement and task mapping have a huge impact on the application performance, and hence should be considered in conjunction with region based coherence. The approach is evaluated by means of a high level simulation model using workloads from PARSEC. Experiments demonstrate that our region based approach with multiple compute tiles increases performance by a factor of up to 2.5 compared to a single tile structure with nominally identical computing and memory resources. Thus the independent local memory accesses, which are effectively increasing the memory bandwidth, usually outweigh the penalties of inter-tile remote memory accesses. Our approach also reduces the directory structures significantly compared to traditional schemes, making it scalable for large MPSoCs (eg. by 41.4% for a 16 tile system with 4 tiles per region). Considering data-to-task-placement, our investigations show that it can lead to performance variations up to a factor of 12.7.","PeriodicalId":366264,"journal":{"name":"2017 30th IEEE International System-on-Chip Conference (SOCC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Region based cache coherence for tiled MPSoCs\",\"authors\":\"A. Srivatsa, Sven Rheindt, Thomas Wild, A. Herkersdorf\",\"doi\":\"10.1109/SOCC.2017.8226059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The need for faster and more energy efficient computing has led us to the multicore era with distributed shared memory hierarchies. The primary goal is to distribute parallel tasks onto multiple processing elements to collectively achieve shorter execution times at lower frequencies and supply voltages when compared to a single-core architecture. Major challenges of this approach are how to achieve local, low latency memory accesses and low overheads for coherence and synchronization management. We believe that enabling global coherence in tiled many-core architectures does not scale in a cost efficient manner and isn't even required for applications with limited degrees of parallelism. In this paper, we propose a novel region based cache coherence scheme, where coherence is provided by hardware directories within a flexibly sized but confined set of compute and memory tiles. We also show that data placement and task mapping have a huge impact on the application performance, and hence should be considered in conjunction with region based coherence. The approach is evaluated by means of a high level simulation model using workloads from PARSEC. Experiments demonstrate that our region based approach with multiple compute tiles increases performance by a factor of up to 2.5 compared to a single tile structure with nominally identical computing and memory resources. Thus the independent local memory accesses, which are effectively increasing the memory bandwidth, usually outweigh the penalties of inter-tile remote memory accesses. Our approach also reduces the directory structures significantly compared to traditional schemes, making it scalable for large MPSoCs (eg. by 41.4% for a 16 tile system with 4 tiles per region). Considering data-to-task-placement, our investigations show that it can lead to performance variations up to a factor of 12.7.\",\"PeriodicalId\":366264,\"journal\":{\"name\":\"2017 30th IEEE International System-on-Chip Conference (SOCC)\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 30th IEEE International System-on-Chip Conference (SOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SOCC.2017.8226059\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 30th IEEE International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SOCC.2017.8226059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

摘要

对更快、更节能的计算的需求将我们带到了多核时代，它具有分布式共享内存层次结构。与单核架构相比，其主要目标是将并行任务分配到多个处理元素上，从而在较低的频率和供电电压下共同实现更短的执行时间。这种方法的主要挑战是如何实现本地，低延迟内存访问和低开销的一致性和同步管理。我们认为，在平铺式多核架构中启用全局一致性并不能以一种低成本的方式进行扩展，甚至对于具有有限并行度的应用程序来说也是不必要的。在本文中，我们提出了一种新的基于区域的缓存一致性方案，其中一致性由硬件目录在灵活大小但受限制的计算和内存块集内提供。我们还表明，数据放置和任务映射对应用程序性能有巨大的影响，因此应该与基于区域的一致性一起考虑。通过使用PARSEC工作负载的高级仿真模型对该方法进行了评估。实验表明，与具有相同计算和内存资源的单一计算块结构相比，我们的基于区域的多计算块方法的性能提高了2.5倍。因此，独立的本地内存访问(有效地增加了内存带宽)通常比层间远程内存访问带来的损失要大。与传统方案相比，我们的方法还大大减少了目录结构，使其可扩展到大型mpsoc(例如。对于16块瓷砖的系统(每个区域4块瓷砖)，降低41.4%。考虑到数据到任务的放置，我们的调查表明，它可能导致性能变化高达12.7倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Region based cache coherence for tiled MPSoCs

The need for faster and more energy efficient computing has led us to the multicore era with distributed shared memory hierarchies. The primary goal is to distribute parallel tasks onto multiple processing elements to collectively achieve shorter execution times at lower frequencies and supply voltages when compared to a single-core architecture. Major challenges of this approach are how to achieve local, low latency memory accesses and low overheads for coherence and synchronization management. We believe that enabling global coherence in tiled many-core architectures does not scale in a cost efficient manner and isn't even required for applications with limited degrees of parallelism. In this paper, we propose a novel region based cache coherence scheme, where coherence is provided by hardware directories within a flexibly sized but confined set of compute and memory tiles. We also show that data placement and task mapping have a huge impact on the application performance, and hence should be considered in conjunction with region based coherence. The approach is evaluated by means of a high level simulation model using workloads from PARSEC. Experiments demonstrate that our region based approach with multiple compute tiles increases performance by a factor of up to 2.5 compared to a single tile structure with nominally identical computing and memory resources. Thus the independent local memory accesses, which are effectively increasing the memory bandwidth, usually outweigh the penalties of inter-tile remote memory accesses. Our approach also reduces the directory structures significantly compared to traditional schemes, making it scalable for large MPSoCs (eg. by 41.4% for a 16 tile system with 4 tiles per region). Considering data-to-task-placement, our investigations show that it can lead to performance variations up to a factor of 12.7.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 30th IEEE International System-on-Chip Conference (SOCC)

自引率

0.00%

发文量