批处理DOACROSS循环并行化算法

2015 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2015-07-20 DOI:10.1109/HPCSim.2015.7237079

D. C. S. Lucas, G. Araújo

{"title":"批处理DOACROSS循环并行化算法","authors":"D. C. S. Lucas, G. Araújo","doi":"10.1109/HPCSim.2015.7237079","DOIUrl":null,"url":null,"abstract":"Parallelizing loops containing loop-carried dependencies has been considered a very difficult task, mainly due to the overhead imposed by communicating dependencies between iterations. Despite the huge effort to devise effective parallelization techniques for such loops, the problem is still far from solved. For many loops, old (DOACROSS), and new (DSWP) techniques have not been able to offer a solution to this problem. This paper does a qualitative and quantitative analysis of synchronization costs of these two loop parallelization algorithms, on two modern computer architectures (ARM A9 MPCore and Intel Ivy Bridge). Our results show that at least 30% of the execution time of the programs we parallelized are spent on synchronization/data communication. We also show that, besides the problem being hard, these architectures are on opposite endpoints along the axis of commonly accepted requisites for efficient loop parallelization. As a consequence, both techniques struggle to effectively speed up several programs. Moreover, this paper presents a novel algorithm, called Batched DOACROSS (BDX), that capitalizes on the advantages of DSWP and DOACROSS, while minimizing their deficiencies. BDX does not require new hardware mechanisms (as DSWP does) and makes use of thread local buffers to reduce DOACROSS synchronization overheads.","PeriodicalId":134009,"journal":{"name":"2015 International Conference on High Performance Computing & Simulation (HPCS)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"The Batched DOACROSS loop parallelization algorithm\",\"authors\":\"D. C. S. Lucas, G. Araújo\",\"doi\":\"10.1109/HPCSim.2015.7237079\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Parallelizing loops containing loop-carried dependencies has been considered a very difficult task, mainly due to the overhead imposed by communicating dependencies between iterations. Despite the huge effort to devise effective parallelization techniques for such loops, the problem is still far from solved. For many loops, old (DOACROSS), and new (DSWP) techniques have not been able to offer a solution to this problem. This paper does a qualitative and quantitative analysis of synchronization costs of these two loop parallelization algorithms, on two modern computer architectures (ARM A9 MPCore and Intel Ivy Bridge). Our results show that at least 30% of the execution time of the programs we parallelized are spent on synchronization/data communication. We also show that, besides the problem being hard, these architectures are on opposite endpoints along the axis of commonly accepted requisites for efficient loop parallelization. As a consequence, both techniques struggle to effectively speed up several programs. Moreover, this paper presents a novel algorithm, called Batched DOACROSS (BDX), that capitalizes on the advantages of DSWP and DOACROSS, while minimizing their deficiencies. BDX does not require new hardware mechanisms (as DSWP does) and makes use of thread local buffers to reduce DOACROSS synchronization overheads.\",\"PeriodicalId\":134009,\"journal\":{\"name\":\"2015 International Conference on High Performance Computing & Simulation (HPCS)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-07-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 International Conference on High Performance Computing & Simulation (HPCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HPCSim.2015.7237079\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCSim.2015.7237079","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

并行化包含循环携带的依赖项的循环被认为是一项非常困难的任务，主要是由于在迭代之间通信依赖项所带来的开销。尽管为这种循环设计有效的并行化技术付出了巨大的努力，但这个问题仍然远远没有解决。对于许多循环，旧的(DOACROSS)和新的(DSWP)技术都不能解决这个问题。本文在两种现代计算机体系结构(ARM A9 MPCore和Intel Ivy Bridge)上对这两种循环并行化算法的同步成本进行了定性和定量分析。我们的结果表明，我们并行化的程序至少有30%的执行时间花在同步/数据通信上。我们还表明，除了困难的问题之外，这些体系结构沿着普遍接受的有效循环并行化必要条件轴的相反端点。因此，这两种技术都难以有效地提高几个程序的速度。此外，本文提出了一种新的算法，称为Batched DOACROSS (BDX)，它利用了DSWP和DOACROSS的优点，同时最大限度地减少了它们的不足。BDX不需要新的硬件机制(与DSWP不同)，并且利用线程本地缓冲区来减少DOACROSS同步开销。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Batched DOACROSS loop parallelization algorithm

Parallelizing loops containing loop-carried dependencies has been considered a very difficult task, mainly due to the overhead imposed by communicating dependencies between iterations. Despite the huge effort to devise effective parallelization techniques for such loops, the problem is still far from solved. For many loops, old (DOACROSS), and new (DSWP) techniques have not been able to offer a solution to this problem. This paper does a qualitative and quantitative analysis of synchronization costs of these two loop parallelization algorithms, on two modern computer architectures (ARM A9 MPCore and Intel Ivy Bridge). Our results show that at least 30% of the execution time of the programs we parallelized are spent on synchronization/data communication. We also show that, besides the problem being hard, these architectures are on opposite endpoints along the axis of commonly accepted requisites for efficient loop parallelization. As a consequence, both techniques struggle to effectively speed up several programs. Moreover, this paper presents a novel algorithm, called Batched DOACROSS (BDX), that capitalizes on the advantages of DSWP and DOACROSS, while minimizing their deficiencies. BDX does not require new hardware mechanisms (as DSWP does) and makes use of thread local buffers to reduce DOACROSS synchronization overheads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量