探索NUFFT数据翻译的并行化策略

International Conference on Embedded Software Pub Date : 2009-10-12 DOI:10.1145/1629335.1629361

Yuanrui Zhang, M. Kandemir, N. Pitsianis, Xiaobai Sun

{"title":"探索NUFFT数据翻译的并行化策略","authors":"Yuanrui Zhang, M. Kandemir, N. Pitsianis, Xiaobai Sun","doi":"10.1145/1629335.1629361","DOIUrl":null,"url":null,"abstract":"This paper introduces parallelization strategies for the Non-Uniform FFT (NUFFT) data translation on multicore architectures. The NUFFT enables the use of the celebrated FFT with un-equally spaced data in numerous situations in signal and image processing as well as in scientific computing. The critical extension lies at the translation of non-equally spaced or non-uniformly sampled data onto an equally spaced Cartesian grid or vice versa. The data translation can be made sufficiently accurate, with the arithmetic complexity linearly proportional to the size of the data ensemble. For large NUFFTs, however, the data translation is found substantially dominant in computation time on modern computers while it is expected to be dominated by the FFT. In order to match the FFT performance achieved by FFTW, data locality and parallelism in the data translation must be explored and exploited as well. We are concerned with two fundamental issues. First, the data translation can be described as a matrix-vector multiplication with a matrix of irregular sparsity. This is beyond the effective scope of the conventional tiling and parallelization schemes applied by a compiler for performance improvement on computation with dense matrices. Secondly, multicore processors exist and emerge in many different configurations, and are expected to evolve further in architectural variety. This may mean the end of performance tuning on a single type of architecture. In this paper, we introduce an automation tool that takes two specifications as input, one on an application-specific data translation algorithm, the other on a target multicore processor architecture. The tool generates a parallel code that explores the data locality and parallelism by utilizing both geometric structures in data translation and the processor-memory configurations in the target architecture. We present preliminary experimental results on both a simulator and a commercial multicore machine. The results show that our parallelization strategy brings significant performance improvement for the NUFFT data translation by efficiently exploiting the data locality and concurrency in the application.","PeriodicalId":143573,"journal":{"name":"International Conference on Embedded Software","volume":"76 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Exploring parallelization strategies for NUFFT data translation\",\"authors\":\"Yuanrui Zhang, M. Kandemir, N. Pitsianis, Xiaobai Sun\",\"doi\":\"10.1145/1629335.1629361\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper introduces parallelization strategies for the Non-Uniform FFT (NUFFT) data translation on multicore architectures. The NUFFT enables the use of the celebrated FFT with un-equally spaced data in numerous situations in signal and image processing as well as in scientific computing. The critical extension lies at the translation of non-equally spaced or non-uniformly sampled data onto an equally spaced Cartesian grid or vice versa. The data translation can be made sufficiently accurate, with the arithmetic complexity linearly proportional to the size of the data ensemble. For large NUFFTs, however, the data translation is found substantially dominant in computation time on modern computers while it is expected to be dominated by the FFT. In order to match the FFT performance achieved by FFTW, data locality and parallelism in the data translation must be explored and exploited as well. We are concerned with two fundamental issues. First, the data translation can be described as a matrix-vector multiplication with a matrix of irregular sparsity. This is beyond the effective scope of the conventional tiling and parallelization schemes applied by a compiler for performance improvement on computation with dense matrices. Secondly, multicore processors exist and emerge in many different configurations, and are expected to evolve further in architectural variety. This may mean the end of performance tuning on a single type of architecture. In this paper, we introduce an automation tool that takes two specifications as input, one on an application-specific data translation algorithm, the other on a target multicore processor architecture. The tool generates a parallel code that explores the data locality and parallelism by utilizing both geometric structures in data translation and the processor-memory configurations in the target architecture. We present preliminary experimental results on both a simulator and a commercial multicore machine. The results show that our parallelization strategy brings significant performance improvement for the NUFFT data translation by efficiently exploiting the data locality and concurrency in the application.\",\"PeriodicalId\":143573,\"journal\":{\"name\":\"International Conference on Embedded Software\",\"volume\":\"76 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-10-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Conference on Embedded Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1629335.1629361\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Embedded Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1629335.1629361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

本文介绍了多核结构下非均匀FFT (NUFFT)数据转换的并行化策略。在信号和图像处理以及科学计算的许多情况下，NUFFT可以使用著名的FFT处理不等间隔的数据。关键扩展在于将非等间距或非均匀采样数据转换为等间距笛卡尔网格，反之亦然。数据转换可以足够精确，算法复杂度与数据集合的大小成线性比例。然而，对于大型的FFT，在现代计算机上，数据转换在计算时间上占主导地位，而它预计将由FFT主导。为了达到FFTW所达到的FFT性能，必须探索和利用数据转换中的数据局部性和并行性。我们关心的是两个基本问题。首先，数据转换可以描述为不规则稀疏矩阵的矩阵-向量乘法。这超出了编译器用于提高密集矩阵计算性能的传统平铺和并行化方案的有效范围。其次，多核处理器存在并出现在许多不同的配置中，并且有望在架构多样性方面进一步发展。这可能意味着对单一类型的体系结构进行性能调优的结束。在本文中，我们介绍了一个自动化工具，它以两个规范作为输入，一个是特定于应用程序的数据转换算法，另一个是目标多核处理器体系结构。该工具生成一个并行代码，通过利用数据转换中的几何结构和目标体系结构中的处理器-内存配置来探索数据的局部性和并行性。我们介绍了在模拟器和商用多核机上的初步实验结果。结果表明，我们的并行化策略通过有效地利用应用程序中的数据局部性和并发性，显著提高了NUFFT数据翻译的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Exploring parallelization strategies for NUFFT data translation

This paper introduces parallelization strategies for the Non-Uniform FFT (NUFFT) data translation on multicore architectures. The NUFFT enables the use of the celebrated FFT with un-equally spaced data in numerous situations in signal and image processing as well as in scientific computing. The critical extension lies at the translation of non-equally spaced or non-uniformly sampled data onto an equally spaced Cartesian grid or vice versa. The data translation can be made sufficiently accurate, with the arithmetic complexity linearly proportional to the size of the data ensemble. For large NUFFTs, however, the data translation is found substantially dominant in computation time on modern computers while it is expected to be dominated by the FFT. In order to match the FFT performance achieved by FFTW, data locality and parallelism in the data translation must be explored and exploited as well. We are concerned with two fundamental issues. First, the data translation can be described as a matrix-vector multiplication with a matrix of irregular sparsity. This is beyond the effective scope of the conventional tiling and parallelization schemes applied by a compiler for performance improvement on computation with dense matrices. Secondly, multicore processors exist and emerge in many different configurations, and are expected to evolve further in architectural variety. This may mean the end of performance tuning on a single type of architecture. In this paper, we introduce an automation tool that takes two specifications as input, one on an application-specific data translation algorithm, the other on a target multicore processor architecture. The tool generates a parallel code that explores the data locality and parallelism by utilizing both geometric structures in data translation and the processor-memory configurations in the target architecture. We present preliminary experimental results on both a simulator and a commercial multicore machine. The results show that our parallelization strategy brings significant performance improvement for the NUFFT data translation by efficiently exploiting the data locality and concurrency in the application.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Conference on Embedded Software

自引率

0.00%

发文量