斯坦福FLASH多处理器中灵活性对性能的影响

ASPLOS VI Pub Date : 1994-11-01 DOI:10.1145/195473.195569
M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy
{"title":"斯坦福FLASH多处理器中灵活性对性能的影响","authors":"M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy","doi":"10.1145/195473.195569","DOIUrl":null,"url":null,"abstract":"A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"136","resultStr":"{\"title\":\"The performance impact of flexibility in the Stanford FLASH multiprocessor\",\"authors\":\"M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy\",\"doi\":\"10.1145/195473.195569\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.\",\"PeriodicalId\":140481,\"journal\":{\"name\":\"ASPLOS VI\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1994-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"136\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ASPLOS VI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/195473.195569\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASPLOS VI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/195473.195569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 136

摘要

灵活的通信机制是多处理器的理想特性,因为它支持多种通信协议,扩展了性能监视功能,并简化了设计和调试过程。在斯坦福大学的FLASH多处理器中,通过要求节点中的所有事务通过一个称为MAGIC的可编程节点控制器来获得灵活性。在本文中,我们通过比较FLASH与理想硬连线机器在代表性并行应用程序和多道编程工作负载上的性能来评估灵活性的性能成本。为了测量FLASH的性能,我们使用了FLASH和MAGIC设计的详细模拟器,以及实现缓存一致性协议的代码序列。我们发现,对于一系列优化的并行应用程序,理想机器和FLASH之间的性能差异很小。对于这些程序,要么丢失率很小,要么可编程协议的延迟可以隐藏在存储器访问时间之后。对于导致大量远程失败或表现出大量热点的应用程序,两台机器的性能都很差,尽管增加的远程访问延迟或占用MAGIC会导致灵活设计的性能降低。然而,在大多数情况下,FLASH只比理想的机器慢2%-12%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The performance impact of flexibility in the Stanford FLASH multiprocessor
A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信