基于迭代人工优化的多核平台代码部署方法

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum Pub Date : 2012-05-21 DOI:10.1109/IPDPSW.2012.178

Stuart McCool, P. Milligan, P. Sage

{"title":"基于迭代人工优化的多核平台代码部署方法","authors":"Stuart McCool, P. Milligan, P. Sage","doi":"10.1109/IPDPSW.2012.178","DOIUrl":null,"url":null,"abstract":"In recent years, there has been what can only be described as an explosion in the types of processing devices one can expect to find within a given computer system. These include the multi-core CPU, the General Purpose Graphics Processing Unit (GPGPU) and the Accelerated Processing Unit (APU), to name but a few. The widespread uptake of these systems presents would-be users with at least two problems. Firstly, each device exposes a complex underlying architecture which must be appreciated in order to attain optimal performance. This is coupled with the fact that a single system can support an arbitrary number of such devices. Consequently, fully leveraging the performance capabilities of such a system must come at a cost -- increasingly prolonged development times. Adhering to a methodology will have the significant industrial impact of reducing these development times. This paper describes the continued formulation of such a novel methodology. Two real world scientific programs are optimized for execution on the CUDA platform. Double precision accuracy and optimized speedups (which include PCI-E transfer times) of 15x and 17x are achieved.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Deriving a Methodology for Code Deployment on Multi-Core Platforms via Iterative Manual Optimizations\",\"authors\":\"Stuart McCool, P. Milligan, P. Sage\",\"doi\":\"10.1109/IPDPSW.2012.178\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In recent years, there has been what can only be described as an explosion in the types of processing devices one can expect to find within a given computer system. These include the multi-core CPU, the General Purpose Graphics Processing Unit (GPGPU) and the Accelerated Processing Unit (APU), to name but a few. The widespread uptake of these systems presents would-be users with at least two problems. Firstly, each device exposes a complex underlying architecture which must be appreciated in order to attain optimal performance. This is coupled with the fact that a single system can support an arbitrary number of such devices. Consequently, fully leveraging the performance capabilities of such a system must come at a cost -- increasingly prolonged development times. Adhering to a methodology will have the significant industrial impact of reducing these development times. This paper describes the continued formulation of such a novel methodology. Two real world scientific programs are optimized for execution on the CUDA platform. Double precision accuracy and optimized speedups (which include PCI-E transfer times) of 15x and 17x are achieved.\",\"PeriodicalId\":378335,\"journal\":{\"name\":\"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-05-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPDPSW.2012.178\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2012.178","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

近年来，在一个给定的计算机系统中，人们可以期望找到的处理设备的类型出现了爆炸式的增长。这些包括多核CPU，通用图形处理单元(GPGPU)和加速处理单元(APU)，仅举几例。这些系统的广泛采用给潜在用户带来了至少两个问题。首先，每个设备都暴露了一个复杂的底层架构，为了获得最佳性能，必须欣赏它。这与单个系统可以支持任意数量的此类设备的事实相结合。因此，充分利用这样一个系统的性能能力必须付出代价——不断延长的开发时间。坚持一种方法将对减少这些开发时间产生重大的工业影响。本文描述了这种新方法的持续制定。两个真实世界的科学程序在CUDA平台上进行了优化。双精度精度和优化的速度(包括PCI-E传输时间)达到15倍和17倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deriving a Methodology for Code Deployment on Multi-Core Platforms via Iterative Manual Optimizations

In recent years, there has been what can only be described as an explosion in the types of processing devices one can expect to find within a given computer system. These include the multi-core CPU, the General Purpose Graphics Processing Unit (GPGPU) and the Accelerated Processing Unit (APU), to name but a few. The widespread uptake of these systems presents would-be users with at least two problems. Firstly, each device exposes a complex underlying architecture which must be appreciated in order to attain optimal performance. This is coupled with the fact that a single system can support an arbitrary number of such devices. Consequently, fully leveraging the performance capabilities of such a system must come at a cost -- increasingly prolonged development times. Adhering to a methodology will have the significant industrial impact of reducing these development times. This paper describes the continued formulation of such a novel methodology. Two real world scientific programs are optimized for execution on the CUDA platform. Double precision accuracy and optimized speedups (which include PCI-E transfer times) of 15x and 17x are achieved.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum

自引率

0.00%

发文量