Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI:10.1145/1787275.1787298

A. Heinecke, C. Trinitis, J. Weidendorfer

{"title":"Porting existing cache-oblivious linear algebra HPC modules to larrabee architecture","authors":"A. Heinecke, C. Trinitis, J. Weidendorfer","doi":"10.1145/1787275.1787298","DOIUrl":null,"url":null,"abstract":"Cache-obliviousness represents an important but relatively new concept for cache optimization. As cache-oblivious algorithms perform well on architectures with arbitrary cache configurations, the programming effort required for porting and optimizing for future architectures can be significantly reduced. In [8] and [9], fast parallel cache-oblivious linear algebra modules have been presented. The underlying matrix storing schemes are based on space filling curves. For matrix multiplication, all cache misses can be avoided, whereas for the LU decomposition algorithm the number of cache misses is minimized. It has been shown that the resulting codes work very well on several kinds of systems ranging from laptops to supercomputers. In this paper, we will show that the runtime characteristics of our existing cache-oblivious codes can be preserved on newer Intel processors. Special emphasis is put on the first many-core processor architecture with complete hardware-based cache coherency: The Larrabee Architecture. As the latter is expected to be available as a PCIe card connected to the host system, porting had to take into account transfer of data structures between different memory address spaces. Unfortunately, Larrabee was canceled as a graphics device for 2010, but Intel is expected to outline futher steps about Larrabee during 2010.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th ACM international conference on Computing frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1787275.1787298","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Cache-obliviousness represents an important but relatively new concept for cache optimization. As cache-oblivious algorithms perform well on architectures with arbitrary cache configurations, the programming effort required for porting and optimizing for future architectures can be significantly reduced. In [8] and [9], fast parallel cache-oblivious linear algebra modules have been presented. The underlying matrix storing schemes are based on space filling curves. For matrix multiplication, all cache misses can be avoided, whereas for the LU decomposition algorithm the number of cache misses is minimized. It has been shown that the resulting codes work very well on several kinds of systems ranging from laptops to supercomputers. In this paper, we will show that the runtime characteristics of our existing cache-oblivious codes can be preserved on newer Intel processors. Special emphasis is put on the first many-core processor architecture with complete hardware-based cache coherency: The Larrabee Architecture. As the latter is expected to be available as a PCIe card connected to the host system, porting had to take into account transfer of data structures between different memory address spaces. Unfortunately, Larrabee was canceled as a graphics device for 2010, but Intel is expected to outline futher steps about Larrabee during 2010.

查看原文本刊更多论文

将现有的缓参无关线性代数HPC模块移植到larrabee架构

缓存遗忘是缓存优化的一个重要但相对较新的概念。由于缓存无关算法在具有任意缓存配置的体系结构上表现良好，因此可以显著减少移植和优化未来体系结构所需的编程工作。在[8]和[9]中，已经提出了快速并行缓参无关线性代数模块。底层矩阵存储方案基于空间填充曲线。对于矩阵乘法，可以避免所有缓存缺失，而对于LU分解算法，可以最小化缓存缺失的数量。结果表明，生成的代码在从笔记本电脑到超级计算机的几种系统上都能很好地工作。在本文中，我们将展示现有缓存无关代码的运行时特征可以在较新的英特尔处理器上保留。特别强调了第一个具有完全基于硬件的缓存一致性的多核处理器体系结构:Larrabee体系结构。由于后者有望作为连接到主机系统的PCIe卡可用，因此移植必须考虑到在不同内存地址空间之间传输数据结构。不幸的是，Larrabee在2010年被取消了作为图形设备的计划，但英特尔有望在2010年概述Larrabee的进一步步骤。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 7th ACM international conference on Computing frontiers

自引率

0.00%

发文量