原始微处理器的评估:用于ILP和流的暴露线延迟架构

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004. Pub Date : 2004-06-19 DOI:10.1145/1028176.1006733

M. Taylor, Walter Lee, Jason E. Miller, D. Wentzlaff, Ian Bratt, B. Greenwald, H. Hoffmann, Paul R. Johnson, J. Kim, James Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, Saman P. Amarasinghe, A. Agarwal

{"title":"原始微处理器的评估:用于ILP和流的暴露线延迟架构","authors":"M. Taylor, Walter Lee, Jason E. Miller, D. Wentzlaff, Ian Bratt, B. Greenwald, H. Hoffmann, Paul R. Johnson, J. Kim, James Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, Saman P. Amarasinghe, A. Agarwal","doi":"10.1145/1028176.1006733","DOIUrl":null,"url":null,"abstract":"This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBM's 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw's ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.","PeriodicalId":268352,"journal":{"name":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","volume":"164 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"451","resultStr":"{\"title\":\"Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams\",\"authors\":\"M. Taylor, Walter Lee, Jason E. Miller, D. Wentzlaff, Ian Bratt, B. Greenwald, H. Hoffmann, Paul R. Johnson, J. Kim, James Psota, A. Saraf, N. Shnidman, V. Strumpen, M. Frank, Saman P. Amarasinghe, A. Agarwal\",\"doi\":\"10.1145/1028176.1006733\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBM's 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw's ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.\",\"PeriodicalId\":268352,\"journal\":{\"name\":\"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.\",\"volume\":\"164 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"451\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1028176.1006733\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1028176.1006733","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 451

摘要

本文对Raw微处理器进行了评价。Raw解决了构建一个通用架构的挑战，该架构在比现有微处理器更大的流和嵌入式计算应用程序上表现良好，同时在面对不断增加的线延迟时仍然运行现有的基于ilp的顺序程序，并具有合理的性能。Raw通过以平铺排列的方式实现大量片上资源(包括逻辑、线路和引脚)，并通过新的ISA公开它们来应对这一挑战，以便软件可以利用这些资源进行并行应用程序。Raw通过在点对点标量操作数网络上在架构公开的功能单元之间路由操作数来支持ILP和流。该网络为标量数据传输提供了低延迟。Raw通过暴露互连和使用软件编排标量和流数据传输来管理线路延迟的影响。我们已经在IBM的180纳米、6层铜、CMOS 7SF标准单元ASIC工艺中实现了一个原型Raw微处理器。我们还实现了ILP和流编译器。我们的评估试图确定Raw在多大程度上成功地实现了作为一个更通用的处理器的目标。实现这一目标的核心是Raw能够利用所有形式的并行性，包括ILP、DLP、TLP和流并行性。具体来说，我们评估了Raw在不同代码集上的性能，包括传统的顺序程序，流应用程序，服务器工作负载和位级嵌入式计算。我们的实验方法利用了一个周期精确的模拟器，对我们的实际硬件进行了验证。与使用商用PC内存系统组件的180nm Pentium-III相比，对于ILP非常低的顺序应用程序，Raw的性能在2/spl倍/之内，对于更高水平的ILP，大约2/spl倍/到9/spl倍/更好，当高度并行应用程序用流语言编码或手动优化时，10/spl倍/-100/spl倍/更好。本文还提出了一种新的通用性度量，并用它来讨论Raw的通用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluation of the Raw microprocessor: an exposed-wire-delay architecture for ILP and streams

This paper evaluates the Raw microprocessor. Raw addresses the challenge of building a general-purpose architecture that performs well on a larger class of stream and embedded computing applications than existing microprocessors, while still running existing ILP-based sequential programs with reasonable performance in the face of increasing wire delays. Raw approaches this challenge by implementing plenty of on-chip resources - including logic, wires, and pins - in a tiled arrangement, and exposing them through a new ISA, so that the software can take advantage of these resources for parallel applications. Raw supports both ILP and streams by routing operands between architecturally-exposed functional units over a point-to-point scalar operand network. This network offers low latency for scalar data transport. Raw manages the effect of wire delays by exposing the interconnect and using software to orchestrate both scalar and stream data transport. We have implemented a prototype Raw microprocessor in IBM's 180 nm, 6-layer copper, CMOS 7SF standard-cell ASIC process. We have also implemented ILP and stream compilers. Our evaluation attempts to determine the extent to which Raw succeeds in meeting its goal of serving as a more versatile, general-purpose processor. Central to achieving this goal is Raw's ability to exploit all forms of parallelism, including ILP, DLP, TLP, and Stream parallelism. Specifically, we evaluate the performance of Raw on a diverse set of codes including traditional sequential programs, streaming applications, server workloads and bit-level embedded computation. Our experimental methodology makes use of a cycle-accurate simulator validated against our real hardware. Compared to a 180nm Pentium-III, using commodity PC memory system components, Raw performs within a factor of 2/spl times/ for sequential applications with a very low degree of ILP, about 2/spl times/ to 9/spl times/ better for higher levels of ILP, and 10/spl times/-100/spl times/ better when highly parallel applications are coded in a stream language or optimized by hand. The paper also proposes a new versatility metric and uses it to discuss the generality of Raw.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. 31st Annual International Symposium on Computer Architecture, 2004.

自引率

0.00%

发文量