斯坦福FLASH多处理器中灵活性对性能的影响

ASPLOS VI Pub Date : 1994-11-01 DOI:10.1145/195473.195569

M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy

{"title":"斯坦福FLASH多处理器中灵活性对性能的影响","authors":"M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy","doi":"10.1145/195473.195569","DOIUrl":null,"url":null,"abstract":"A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.","PeriodicalId":140481,"journal":{"name":"ASPLOS VI","volume":"24 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1994-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"136","resultStr":"{\"title\":\"The performance impact of flexibility in the Stanford FLASH multiprocessor\",\"authors\":\"M. Heinrich, J. Kuskin, D. Ofelt, J. Heinlein, J. Baxter, J. Singh, R. Simoni, K. Gharachorloo, D. Nakahira, M. Horowitz, Anoop Gupta, M. Rosenblum, J. Hennessy\",\"doi\":\"10.1145/195473.195569\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.\",\"PeriodicalId\":140481,\"journal\":{\"name\":\"ASPLOS VI\",\"volume\":\"24 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1994-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"136\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ASPLOS VI\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/195473.195569\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASPLOS VI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/195473.195569","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 136

摘要

灵活的通信机制是多处理器的理想特性，因为它支持多种通信协议，扩展了性能监视功能，并简化了设计和调试过程。在斯坦福大学的FLASH多处理器中，通过要求节点中的所有事务通过一个称为MAGIC的可编程节点控制器来获得灵活性。在本文中，我们通过比较FLASH与理想硬连线机器在代表性并行应用程序和多道编程工作负载上的性能来评估灵活性的性能成本。为了测量FLASH的性能，我们使用了FLASH和MAGIC设计的详细模拟器，以及实现缓存一致性协议的代码序列。我们发现，对于一系列优化的并行应用程序，理想机器和FLASH之间的性能差异很小。对于这些程序，要么丢失率很小，要么可编程协议的延迟可以隐藏在存储器访问时间之后。对于导致大量远程失败或表现出大量热点的应用程序，两台机器的性能都很差，尽管增加的远程访问延迟或占用MAGIC会导致灵活设计的性能降低。然而，在大多数情况下，FLASH只比理想的机器慢2%-12%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The performance impact of flexibility in the Stanford FLASH multiprocessor

A flexible communication mechanism is a desirable feature in multiprocessors because it allows support for multiple communication protocols, expands performance monitoring capabilities, and leads to a simpler design and debug process. In the Stanford FLASH multiprocessor, flexibility is obtained by requiring all transactions in a node to pass through a programmable node controller, called MAGIC. In this paper, we evaluate the performance costs of flexibility by comparing the performance of FLASH to that of an idealized hardwired machine on representative parallel applications and a multiprogramming workload. To measure the performance of FLASH, we use a detailed simulator of the FLASH and MAGIC designs, together with the code sequences that implement the cache-coherence protocol. We find that for a range of optimized parallel applications the performance differences between the idealized machine and FLASH are small. For these programs, either the miss rates are small or the latency of the programmable protocol can be hidden behind the memory access time. For applications that incur a large number of remote misses or exhibit substantial hot-spotting, performance is poor for both machines, though the increased remote access latencies or the occupancy of MAGIC lead to lower performance for the flexible design. In most cases, however, FLASH is only 2%–12% slower than the idealized machine.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ASPLOS VI

自引率

0.00%

发文量