Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine

Keun Sup Shim, Mieszko Lis, Myong Hyon Cho, Ilia A. Lebedev, S. Devadas
{"title":"Design tradeoffs for simplicity and efficient verification in the Execution Migration Machine","authors":"Keun Sup Shim, Mieszko Lis, Myong Hyon Cho, Ilia A. Lebedev, S. Devadas","doi":"10.1109/ICCD.2013.6657037","DOIUrl":null,"url":null,"abstract":"As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moore's law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy. Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM2), establishes a new design point. On the one hand, EM2 supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remote-access-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM2 chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM2 is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 man-months.","PeriodicalId":398811,"journal":{"name":"2013 IEEE 31st International Conference on Computer Design (ICCD)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 31st International Conference on Computer Design (ICCD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCD.2013.6657037","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

Abstract

As transistor technology continues to scale, the architecture community has experienced exponential growth in design complexity and significantly increasing implementation and verification costs. Moreover, Moore's law has led to a ubiquitous trend of an increasing number of cores on a single chip. Often, these large-core-count chips provide a shared memory abstraction via directories and coherence protocols, which have become notoriously error-prone and difficult to verify because of subtle data races and state space explosion. Although a very simple hardware shared memory implementation can be achieved by simply not allowing ad-hoc data replication and relying on remote accesses for remotely cached data (i.e., requiring no directories or coherence protocols), such remote-access-based directoryless architectures cannot take advantage of any data locality, and therefore suffer in both performance and energy. Our recently taped-out 110-core shared-memory processor, the Execution Migration Machine (EM2), establishes a new design point. On the one hand, EM2 supports shared memory but does not automatically replicate data, and thus preserves the simplicity of directoryless architectures. On the other hand, it significantly improves performance and energy over remote-access-only designs by exploiting data locality at remote cores via fast hardware-level thread migration. In this paper, we describe the design choices made in the EM2 chip as well as our choice of design methodology, and discuss how they combine to achieve design simplicity and verification efficiency. Even though EM2 is a fairly large design-110 cores using a total of 357 million transistors-the entire chip design and implementation process (RTL, verification, physical design, tapeout) took only 18 man-months.
在执行迁移机中为简单性和有效验证进行设计权衡
随着晶体管技术的不断扩展,架构社区经历了设计复杂性的指数级增长,并且显著增加了实现和验证成本。此外,摩尔定律导致了一个无处不在的趋势,即单个芯片上的核心数量不断增加。通常,这些大核数芯片通过目录和一致性协议提供共享内存抽象,由于微妙的数据竞争和状态空间爆炸,这变得非常容易出错且难以验证。虽然一个非常简单的硬件共享内存实现可以通过不允许临时数据复制和依赖远程缓存数据的远程访问(即,不需要目录或一致性协议)来实现,但是这种基于远程访问的无目录架构不能利用任何数据局域性,因此在性能和能量上都受到影响。我们最近推出的110核共享内存处理器,即执行迁移机(EM2),建立了一个新的设计点。一方面,EM2支持共享内存,但不自动复制数据,因此保留了无目录架构的简单性。另一方面,它通过快速硬件级线程迁移来利用远程核心的数据局部性,从而大大提高了仅远程访问设计的性能和能耗。在本文中,我们描述了在EM2芯片中所做的设计选择以及我们所选择的设计方法,并讨论了如何将它们结合起来以实现设计的简单性和验证效率。尽管EM2是一个相当大的设计——110个内核,总共使用了3.57亿个晶体管——整个芯片设计和实现过程(RTL、验证、物理设计、条带化)只花了18个月的时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信