Memory access scheduling to reduce thread migrations

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction Pub Date : 2022-03-18 DOI:10.1145/3497776.3517768

S. Damani, Prithayan Barua, Vivek Sarkar

{"title":"Memory access scheduling to reduce thread migrations","authors":"S. Damani, Prithayan Barua, Vivek Sarkar","doi":"10.1145/3497776.3517768","DOIUrl":null,"url":null,"abstract":"It has been widely observed that data movement is emerging as the primary bottleneck to scalability and energy efficiency in future hardware, especially for applications and algorithms that are not cache-friendly and achieve below 1% of peak performance on today’s systems. The idea of “moving compute to data” has been suggested as one approach to address this challenge. While there are approaches that can achieve this migration in software, hardware support is a promising direction from the perspectives of lower overheads and programmer productivity. Migratory thread architectures migrate lightweight hardware thread contexts to the location of the data instead of transferring data to the requesting processor. However, while transporting thread contexts is cheaper than moving data, thread migrations still incur energy and bandwidth overheads and can be particularly expensive if threads frequently migrate in a ping-pong manner between processors due to poor locality of access. In this paper, we propose Memory Access Scheduling, a new compiler optimization that aims to reduce the number of overall thread migrations when executing a program on migratory thread architectures. Our experiments show performance improvements with a geometric mean speedup of 1.23× for a set of 7 explicitly-parallelized kernels, and of 1.10× for a set of 15 automatically-parallelized kernels. We believe that memory access scheduling will also be an important optimization for other locality-centric architectures that benefit from software thread migrations, such as multi-threaded NUMA architectures.","PeriodicalId":333281,"journal":{"name":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3497776.3517768","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

It has been widely observed that data movement is emerging as the primary bottleneck to scalability and energy efficiency in future hardware, especially for applications and algorithms that are not cache-friendly and achieve below 1% of peak performance on today’s systems. The idea of “moving compute to data” has been suggested as one approach to address this challenge. While there are approaches that can achieve this migration in software, hardware support is a promising direction from the perspectives of lower overheads and programmer productivity. Migratory thread architectures migrate lightweight hardware thread contexts to the location of the data instead of transferring data to the requesting processor. However, while transporting thread contexts is cheaper than moving data, thread migrations still incur energy and bandwidth overheads and can be particularly expensive if threads frequently migrate in a ping-pong manner between processors due to poor locality of access. In this paper, we propose Memory Access Scheduling, a new compiler optimization that aims to reduce the number of overall thread migrations when executing a program on migratory thread architectures. Our experiments show performance improvements with a geometric mean speedup of 1.23× for a set of 7 explicitly-parallelized kernels, and of 1.10× for a set of 15 automatically-parallelized kernels. We believe that memory access scheduling will also be an important optimization for other locality-centric architectures that benefit from software thread migrations, such as multi-threaded NUMA architectures.

查看原文本刊更多论文

内存访问调度以减少线程迁移

人们普遍认为，数据移动正在成为未来硬件可伸缩性和能源效率的主要瓶颈，特别是对于不支持缓存的应用程序和算法，以及在当今系统中实现低于峰值性能1%的应用程序和算法。“将计算转移到数据”的想法已经被建议作为解决这一挑战的一种方法。虽然有一些方法可以在软件中实现这种迁移，但从降低开销和程序员生产力的角度来看，硬件支持是一个有希望的方向。迁移线程体系结构将轻量级硬件线程上下文迁移到数据位置，而不是将数据传输到请求处理器。然而，虽然传输线程上下文比移动数据便宜，但线程迁移仍然会产生能量和带宽开销，如果线程由于访问的局部性差而经常在处理器之间以乒乓方式迁移，那么线程迁移的成本可能特别高。在本文中，我们提出了内存访问调度，这是一种新的编译器优化，旨在减少在迁移线程架构上执行程序时总体线程迁移的数量。我们的实验表明，对于一组7个显式并行化内核，性能提高了1.23倍的几何平均加速，对于一组15个自动并行化内核，性能提高了1.10倍。我们相信，内存访问调度对于其他受益于软件线程迁移的以位置为中心的体系结构(例如多线程NUMA体系结构)也将是一个重要的优化。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction

自引率

0.00%

发文量