Tera MTA的多处理器性能

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI:10.1109/SC.1998.10049

A. Snavely, L. Carter, J. Boisseau, A. Majumdar, K. Gatlin, N. Mitchell, J. Feo, Brian D. Koblenz

{"title":"Tera MTA的多处理器性能","authors":"A. Snavely, L. Carter, J. Boisseau, A. Majumdar, K. Gatlin, N. Mitchell, J. Feo, Brian D. Koblenz","doi":"10.1109/SC.1998.10049","DOIUrl":null,"url":null,"abstract":"The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera MTA. In a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA codes. We then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the \"serial bottlenecks\" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of threads. Instead, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except for the EP benchmark, the tuned codes issue memory references at nearly the peak rate of one per instruction. Consequently, the network can support the memory references issued by one, but not two, processors. As a result, the parallel efficiency of EP is near- perfect, but the others are reduced accordingly. Another reason for imperfect speedup pertains to the compiler. While the definition of a thread in a single processor or multi-processor mode is essentially the same, there is a different implementation and an associated overhead with running on multiple processors. We characterize the overhead of running \"frays\" (a collection of threads running on a single processor) and \"crews\" (a collection of frays, one per processor.)","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"71","resultStr":"{\"title\":\"Multi-processor Performance on the Tera MTA\",\"authors\":\"A. Snavely, L. Carter, J. Boisseau, A. Majumdar, K. Gatlin, N. Mitchell, J. Feo, Brian D. Koblenz\",\"doi\":\"10.1109/SC.1998.10049\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera MTA. In a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA codes. We then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the \\\"serial bottlenecks\\\" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of threads. Instead, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except for the EP benchmark, the tuned codes issue memory references at nearly the peak rate of one per instruction. Consequently, the network can support the memory references issued by one, but not two, processors. As a result, the parallel efficiency of EP is near- perfect, but the others are reduced accordingly. Another reason for imperfect speedup pertains to the compiler. While the definition of a thread in a single processor or multi-processor mode is essentially the same, there is a different implementation and an associated overhead with running on multiple processors. We characterize the overhead of running \\\"frays\\\" (a collection of threads running on a single processor) and \\\"crews\\\" (a collection of frays, one per processor.)\",\"PeriodicalId\":113978,\"journal\":{\"name\":\"Proceedings of the IEEE/ACM SC98 Conference\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1998-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"71\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the IEEE/ACM SC98 Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC.1998.10049\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the IEEE/ACM SC98 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.1998.10049","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 71

摘要

Tera MTA是一款基于多线程处理器架构的革命性商用计算机。与许多其他并行架构相比，Tera MTA可以在单个处理器上有效地使用大量并行性。通过在单个处理器上运行多个线程，它可以容忍内存延迟并保持处理器饱和。如果计算量足够大，则可以从在多个处理器上运行中获益。MTA的主要体系结构目标是在多个处理器上提供可伸缩的性能。本文是对第一台多处理器Tera MTA的初步研究。在之前的一篇论文[1]中，我们报道了在内核NAS 2基准测试[2]上，以架构时钟速度运行的单处理器MTA系统在性能上与Cray T90的单处理器相似。我们发现，在对随机数生成器进行标准更改后，两台机器的编译器都能够找到必要的线程或向量操作。在本文中，我们以两种方式更新单处理器结果:我们只使用实际时钟速度，并且我们报告了通过进一步调优MTA代码所获得的改进。然后，我们研究在双处理器MTA上运行最佳单处理器代码时的性能，不做进一步的调优工作。码的并行效率在77% ~ 99%之间。一项分析表明，“串行瓶颈”——未并行的代码段以及分配和释放并行硬件资源的成本——占运行时的比例不到1%。因此，在有数百个处理器运行数千个线程之前，Amdahl定律不需要对NAS基准测试生效。相反，低效率的主要来源似乎是连接处理器和内存的网络不完善。理想情况下，网络可以支持每条指令一个内存引用。目前的硬件存在缺陷，使吞吐量降低到这个速率的85%左右。除了EP基准之外，调优后的代码几乎以每条指令一次的峰值速率发出内存引用。因此，网络可以支持由一个处理器而不是两个处理器发出的内存引用。结果表明，EP的并行效率接近完美，但其他效率相应降低。加速不完美的另一个原因与编译器有关。虽然在单处理器或多处理器模式下线程的定义本质上是相同的，但在多处理器上运行的实现和相关开销是不同的。我们描述了运行“frays”(在单个处理器上运行的线程集合)和“crews”(frays的集合，每个处理器一个)的开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Multi-processor Performance on the Tera MTA

The Tera MTA is a revolutionary commercial computer based on a multithreaded processor architecture. In contrast to many other parallel architectures, the Tera MTA can effectively use high amounts of parallelism on a single processor. By running multiple threads on a single processor, it can tolerate memory latency and to keep the processor saturated. If the computation is sufficiently large, it can benefit from running on multiple processors. A primary architectural goal of the MTA is that it provide scalable performance over multiple processors. This paper is a preliminary investigation of the first multi-processor Tera MTA. In a previous paper [1] we reported that on the kernel NAS 2 benchmarks [2], a single-processor MTA system running at the architected clock speed would be similar in performance to a single processor of the Cray T90. We found that the compilers of both machines were able to find the necessary threads or vector operations, after making standard changes to the random number generator. In this paper we update the single-processor results in two ways: we use only actual clock speeds, and we report improvements given by further tuning of the MTA codes. We then investigate the performance of the best single-processor codes when run on a two-processor MTA, making no further tuning effort. The parallel efficiency of the codes range from 77% to 99%. An analysis shows that the "serial bottlenecks" -- unparallelized code sections and the cost of allocating and freeing the parallel hardware resources -- account for less than a percent of the runtimes. Thus, Amdahl's Law needn't take effect on the NAS benchmarks until there are hundreds of processors running thousands of threads. Instead, the major source of inefficiency appears to be an imperfect network connecting the processors to the memory. Ideally, the network can support one memory reference per instruction. The current hardware has defects that reduce the throughput to about 85% of this rate. Except for the EP benchmark, the tuned codes issue memory references at nearly the peak rate of one per instruction. Consequently, the network can support the memory references issued by one, but not two, processors. As a result, the parallel efficiency of EP is near- perfect, but the others are reduced accordingly. Another reason for imperfect speedup pertains to the compiler. While the definition of a thread in a single processor or multi-processor mode is essentially the same, there is a different implementation and an associated overhead with running on multiple processors. We characterize the overhead of running "frays" (a collection of threads running on a single processor) and "crews" (a collection of frays, one per processor.)

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the IEEE/ACM SC98 Conference

自引率

0.00%

发文量