Using Multiple Threads to Accelerate Single Thread Performance

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI:10.1109/IPDPS.2014.104

Zehra Sura, K. O'Brien, J. Brunheroto

{"title":"Using Multiple Threads to Accelerate Single Thread Performance","authors":"Zehra Sura, K. O'Brien, J. Brunheroto","doi":"10.1109/IPDPS.2014.104","DOIUrl":null,"url":null,"abstract":"Computing systems are being designed with an increasing number of hardware cores. To effectively use these cores, applications need to maximize the amount of parallel processing and minimize the time spent in sequential execution. In this work, we aim to exploit fine-grained parallelism beyond the parallelism already encoded in an application. We define an execution model using a primary core and some number of secondary cores that collaborate to speed up the execution of sequential code regions. This execution model relies on cores that are physically close to each other and have fast communication paths between them. For this purpose, we introduce dedicated hardware queues for low-latency transfer of values between cores, and define special \"enque\" and \"deque\" instructions to use the queues. Further, we develop compiler analyses and transformations to automatically derive fine-grained parallel code from sequential code regions. We implemented this model for exploiting fine-grained parallelization in the IBM XL compiler framework and in a simulator for the Blue Gene/Q system. We also studied the Sequoia benchmarks to determine code sections where our techniques are applicable. We evaluated our work using these code sections, and observed an average speedup of 1.32 on 2 cores, and an average speedup of 2.05 on 4 cores. Since these code sections are otherwise sequentially executed, we conclude that our approach is useful for accelerating single thread performance.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPS.2014.104","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Computing systems are being designed with an increasing number of hardware cores. To effectively use these cores, applications need to maximize the amount of parallel processing and minimize the time spent in sequential execution. In this work, we aim to exploit fine-grained parallelism beyond the parallelism already encoded in an application. We define an execution model using a primary core and some number of secondary cores that collaborate to speed up the execution of sequential code regions. This execution model relies on cores that are physically close to each other and have fast communication paths between them. For this purpose, we introduce dedicated hardware queues for low-latency transfer of values between cores, and define special "enque" and "deque" instructions to use the queues. Further, we develop compiler analyses and transformations to automatically derive fine-grained parallel code from sequential code regions. We implemented this model for exploiting fine-grained parallelization in the IBM XL compiler framework and in a simulator for the Blue Gene/Q system. We also studied the Sequoia benchmarks to determine code sections where our techniques are applicable. We evaluated our work using these code sections, and observed an average speedup of 1.32 on 2 cores, and an average speedup of 2.05 on 4 cores. Since these code sections are otherwise sequentially executed, we conclude that our approach is useful for accelerating single thread performance.

查看原文本刊更多论文

使用多线程加速单线程性能

计算机系统正在设计越来越多的硬件核心。为了有效地使用这些核心，应用程序需要最大化并行处理的数量，并最小化顺序执行所花费的时间。在这项工作中，我们的目标是在应用程序中已经编码的并行性之外利用细粒度的并行性。我们使用一个主核和一些辅助核来定义一个执行模型，这些辅助核协作来加速顺序代码区域的执行。这种执行模型依赖于物理上彼此靠近并在它们之间具有快速通信路径的核心。为此，我们引入了专用的硬件队列，用于在内核之间进行低延迟的值传输，并定义了特殊的“enque”和“deque”指令来使用队列。此外，我们开发了编译器分析和转换，以从顺序代码区域自动派生细粒度并行代码。我们实现这个模型是为了在IBM XL编译器框架和Blue Gene/Q系统的模拟器中利用细粒度并行化。我们还研究了Sequoia基准测试，以确定我们的技术适用的代码部分。我们使用这些代码段评估了我们的工作，并观察到2核上的平均加速速度为1.32,4核上的平均加速速度为2.05。由于这些代码段是依次执行的，因此我们得出结论，我们的方法对于加速单线程性能很有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 28th International Parallel and Distributed Processing Symposium

自引率

0.00%

发文量