SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI:10.1145/3545008.3545044

Qinzhe Wu, Ashen Ekanayake, Ruihao Li, J. Beard, L. John

{"title":"SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems","authors":"Qinzhe Wu, Ashen Ekanayake, Ruihao Li, J. Beard, L. John","doi":"10.1145/3545008.3545044","DOIUrl":null,"url":null,"abstract":"With increasing core counts and multiple levels of cache memories, scaling multi-threaded and task-level parallel workloads is continuously becoming a challenge. A key challenge to scaling the number of communicating tasks (or threads) is the rate at which existing communication mechanisms scale (in terms of latency and bandwidth). Architectures with hardware accelerated queuing operations have the potential to reduce the latency and improve scalability of moving data between processing elements, reducing synchronization penalties, and thereby improving the performance of task-level parallel workloads. While hardware queues reduce synchronization penalties, they cannot fully hide load-to-use latency, i.e., perfect pipelines often are not realized. There is the potential, however, for better overlap. If the inter-processor communication latency is equal to or less than the time spent processing a message at the consumer, any and all latency may be overlapped while the consumer is processing. We exploit this property to speedup parallel applications above and beyond existing hardware queues. In this paper, we present SPAMeR, a speculation mechanism built on top of a state-of-the-art hardware-driven message queue architecture. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR adds algorithms that attempt to predict this. We evaluate the effectiveness of SPAMeR with a set of diverse task-parallel benchmarks utilizing the gem5 full system simulator, and observe a 1.33 × average speedup.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545044","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

With increasing core counts and multiple levels of cache memories, scaling multi-threaded and task-level parallel workloads is continuously becoming a challenge. A key challenge to scaling the number of communicating tasks (or threads) is the rate at which existing communication mechanisms scale (in terms of latency and bandwidth). Architectures with hardware accelerated queuing operations have the potential to reduce the latency and improve scalability of moving data between processing elements, reducing synchronization penalties, and thereby improving the performance of task-level parallel workloads. While hardware queues reduce synchronization penalties, they cannot fully hide load-to-use latency, i.e., perfect pipelines often are not realized. There is the potential, however, for better overlap. If the inter-processor communication latency is equal to or less than the time spent processing a message at the consumer, any and all latency may be overlapped while the consumer is processing. We exploit this property to speedup parallel applications above and beyond existing hardware queues. In this paper, we present SPAMeR, a speculation mechanism built on top of a state-of-the-art hardware-driven message queue architecture. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR adds algorithms that attempt to predict this. We evaluate the effectiveness of SPAMeR with a set of diverse task-parallel benchmarks utilizing the gem5 full system simulator, and observe a 1.33 × average speedup.

查看原文本刊更多论文

SPAMeR:多核系统中预期消息请求的推测性推送

随着核心数量的增加和多级缓存内存的增加，扩展多线程和任务级并行工作负载不断成为一项挑战。扩展通信任务(或线程)数量的一个关键挑战是现有通信机制的扩展速度(在延迟和带宽方面)。具有硬件加速排队操作的体系结构有可能减少延迟，提高在处理元素之间移动数据的可伸缩性，减少同步惩罚，从而提高任务级并行工作负载的性能。虽然硬件队列减少了同步的代价，但它们不能完全隐藏负载使用延迟，也就是说，通常无法实现完美的管道。然而，有可能实现更好的重叠。如果处理器间通信延迟等于或小于在消费者处处理消息所花费的时间，则在消费者处理消息时，任何和所有延迟都可能重叠。我们利用这一特性来加速并行应用程序，使其超越现有的硬件队列。在本文中，我们介绍了SPAMeR，这是一种构建在最先进的硬件驱动消息队列体系结构之上的推测机制。SPAMeR具有预测消费者消息请求的推测性推送消息的能力。不像预取方法预测下一个要取的地址，用队列我们确切地知道下一个需要什么数据，但不知道什么时候需要;SPAMeR添加了尝试预测这种情况的算法。我们利用gem5全系统模拟器使用一组不同的任务并行基准来评估SPAMeR的有效性，并观察到平均加速提高了1.33倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量