大规模SMP集群上矢量运算的运行时优化

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) Pub Date : 2008-10-25 DOI:10.1145/1454115.1454134

Costin Iancu, S. Hofmeyr

{"title":"大规模SMP集群上矢量运算的运行时优化","authors":"Costin Iancu, S. Hofmeyr","doi":"10.1145/1454115.1454134","DOIUrl":null,"url":null,"abstract":"“Vector” style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.","PeriodicalId":186773,"journal":{"name":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Runtime optimization of vector operations on large scale SMP clusters\",\"authors\":\"Costin Iancu, S. Hofmeyr\",\"doi\":\"10.1145/1454115.1454134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"“Vector” style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.\",\"PeriodicalId\":186773,\"journal\":{\"name\":\"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"volume\":\"91 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-10-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1454115.1454134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1454115.1454134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

“矢量”式通信操作在一个逻辑步骤内传输多个不相交的存储区域。这些操作在应用程序中被广泛使用，它们确实提高了应用程序的性能，并且它们的行为已经在各种各样的系统中使用不同的实现技术进行了研究和优化。在本文中，我们提出了一种从多个备选实现中选择性能最佳的矢量操作实现的方法。我们的方法设计用于具有宽SMP节点的系统，我们认为大多数已发表的研究都无法正确预测性能。由于多核处理器的出现，我们相信出于性能原因，类似于我们的技术将被纳入通信库或语言运行时。该方法依赖于对应用程序空间的探索，以及对该空间中特定实现方法表现最佳的区域进行分类。我们使用微基准测试来衡量应用程序空间中给定点的实现性能，然后编写概要文件来比较两个给定实现的性能。这些概要文件捕获了给定协议在高节点负载下性能下降的经验上限。在运行时，应用程序根据这些性能配置文件选择实现。我们的方法提供了性能可移植性，并且使用我们的动态多协议选择，我们已经能够在IBM大型集群上将NAS并行基准工作负载的性能提高22%。在大规模InfiniBand和Cray XT系统上也取得了非常积极的结果。这项工作表明，在广泛的SMP系统上，影响应用程序性能的最重要因素可能是成功地管理网络接口卡上的负载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Runtime optimization of vector operations on large scale SMP clusters

“Vector” style communication operations transfer multiple disjoint memory regions within one logical step. These operations are widely used in applications, they do improve application performance, and their behavior has been studied and optimized using different implementation techniques across a large variety of systems. In this paper we present a methodology for the selection of the best performing implementation of a vector operation from multiple alternative implementations. Our approach is designed to work for systems with wide SMP nodes where we believe that most published studies fail to correctly predict performance. Due to the emergence of multi-core processors we believe that techniques similar to ours will be incorporated for performance reasons in communication libraries or language runtimes. The methodology relies on the exploration of the application space and a classification of the regions within this space where a particular implementation method performs best. We use micro-benchmarks to measure the performance of an implementation for a given point in the application space and then compose profiles that compare the performance of two given implementations. These profiles capture an empirical upper bound for the performance degradation of a given protocol under heavy node load. At runtime, the application selects the implementation according to these performance profiles. Our approach provides performance portability and using our dynamic multi-protocol selection we have been able to improve the performance of a NAS Parallel Benchmarks workload by 22% on an IBM large scale cluster. Very positive results have also been obtained on large scale InfiniBand and Cray XT systems. This work indicates that perhaps the most important factor for application performance on wide SMP systems is the successful management of load on the Network Interface Cards.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

自引率

0.00%

发文量