降低软件dsm中屏障同步的相干开销

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI:10.1109/SC.1998.10029

Jae Bum Lee, C. Jhon

{"title":"降低软件dsm中屏障同步的相干开销","authors":"Jae Bum Lee, C. Jhon","doi":"10.1109/SC.1998.10029","DOIUrl":null,"url":null,"abstract":"Software Distributed Shared Memory (SDSM) systems usually have the large coherence granularity that is imposed by the underlying virtual memory page size. To alleviate the coherence overheads such as the network traffic to preserve the coherence, or page misses caused by false sharing, relaxed memory models are widely accepted for the SDSM systems. In the relaxed memory models, when a shared page is modified, invalidation requests to other copies are deferred until a synchronization point and, in addition, the requests are transferred only to the processor acquiring the synchronization variable. On a barrier, however, the invalidation requests must be transferred to all the processors that participate in the barrier. As a result, it tends to induce heavy network traffic, and also may lead to useless page misses by false sharing. In this paper, we propose a method to alleviate the coherence overheads of barrier synchronization in shared-memory parallel programs. It performs static analysis to examine data dependency between processors across global barriers, and then inserts special primitives into the program in order to exploit the dependency information at run time. The static analysis finds out code regions where a processor modifies data that will be used only by some of the other processors. At run time, the coherence messages for the data are transferred only to the processors with the help of the inserted primitives. In particular, if the modified data will not be used by any other processors, the primitives enforce that the coherence messages are delivered only to master processor when the parallel execution of the program is finished. We evaluated the performance of this method in a 16-node software DSM system supporting AURC protocol. Program-driven simulation was performed with five benchmark programs: Jacobi, Red-black SOR, Expl, LU, and Water-nsquared. For the applications, the experimental results show that our method can reduce the coherence messages by up to about 98%, and also can improve the execution time by up to about 26%.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":"{\"title\":\"Reducing Coherence Overhead of Barrier Synchronization in Software DSMs\",\"authors\":\"Jae Bum Lee, C. Jhon\",\"doi\":\"10.1109/SC.1998.10029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Software Distributed Shared Memory (SDSM) systems usually have the large coherence granularity that is imposed by the underlying virtual memory page size. To alleviate the coherence overheads such as the network traffic to preserve the coherence, or page misses caused by false sharing, relaxed memory models are widely accepted for the SDSM systems. In the relaxed memory models, when a shared page is modified, invalidation requests to other copies are deferred until a synchronization point and, in addition, the requests are transferred only to the processor acquiring the synchronization variable. On a barrier, however, the invalidation requests must be transferred to all the processors that participate in the barrier. As a result, it tends to induce heavy network traffic, and also may lead to useless page misses by false sharing. In this paper, we propose a method to alleviate the coherence overheads of barrier synchronization in shared-memory parallel programs. It performs static analysis to examine data dependency between processors across global barriers, and then inserts special primitives into the program in order to exploit the dependency information at run time. The static analysis finds out code regions where a processor modifies data that will be used only by some of the other processors. At run time, the coherence messages for the data are transferred only to the processors with the help of the inserted primitives. In particular, if the modified data will not be used by any other processors, the primitives enforce that the coherence messages are delivered only to master processor when the parallel execution of the program is finished. We evaluated the performance of this method in a 16-node software DSM system supporting AURC protocol. Program-driven simulation was performed with five benchmark programs: Jacobi, Red-black SOR, Expl, LU, and Water-nsquared. For the applications, the experimental results show that our method can reduce the coherence messages by up to about 98%, and also can improve the execution time by up to about 26%.\",\"PeriodicalId\":113978,\"journal\":{\"name\":\"Proceedings of the IEEE/ACM SC98 Conference\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1998-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"10\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the IEEE/ACM SC98 Conference\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC.1998.10029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the IEEE/ACM SC98 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.1998.10029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

摘要

软件分布式共享内存(SDSM)系统通常具有很大的一致性粒度，这是由底层虚拟内存页面大小决定的。为了减轻一致性开销，如保持一致性的网络流量，或由于错误共享而导致的页面丢失，放松内存模型被广泛接受用于SDSM系统。在宽松内存模型中，当一个共享页被修改时，对其他副本的无效请求被延迟到一个同步点，此外，这些请求只被传输到获取同步变量的处理器。然而，在屏障上，必须将无效请求传输给参与该屏障的所有处理器。因此，它往往会导致繁重的网络流量，也可能导致无用的页面错过虚假共享。本文提出了一种减少共享内存并行程序中屏障同步的相干开销的方法。它执行静态分析以检查跨全局屏障的处理器之间的数据依赖关系，然后在程序中插入特殊的原语，以便在运行时利用依赖信息。静态分析找出处理器修改数据的代码区域，这些数据将仅由其他一些处理器使用。在运行时，数据的一致性消息仅在插入的原语的帮助下传输到处理器。特别是，如果修改后的数据不会被任何其他处理器使用，则原语强制一致性消息仅在程序并行执行完成时传递给主处理器。我们在一个支持AURC协议的16节点软件DSM系统中评估了该方法的性能。程序驱动的模拟使用五个基准程序:Jacobi、Red-black SOR、Expl、LU和Water-nsquared。在实际应用中，实验结果表明，该方法可将相干消息减少约98%，并可将执行时间提高约26%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Reducing Coherence Overhead of Barrier Synchronization in Software DSMs

Software Distributed Shared Memory (SDSM) systems usually have the large coherence granularity that is imposed by the underlying virtual memory page size. To alleviate the coherence overheads such as the network traffic to preserve the coherence, or page misses caused by false sharing, relaxed memory models are widely accepted for the SDSM systems. In the relaxed memory models, when a shared page is modified, invalidation requests to other copies are deferred until a synchronization point and, in addition, the requests are transferred only to the processor acquiring the synchronization variable. On a barrier, however, the invalidation requests must be transferred to all the processors that participate in the barrier. As a result, it tends to induce heavy network traffic, and also may lead to useless page misses by false sharing. In this paper, we propose a method to alleviate the coherence overheads of barrier synchronization in shared-memory parallel programs. It performs static analysis to examine data dependency between processors across global barriers, and then inserts special primitives into the program in order to exploit the dependency information at run time. The static analysis finds out code regions where a processor modifies data that will be used only by some of the other processors. At run time, the coherence messages for the data are transferred only to the processors with the help of the inserted primitives. In particular, if the modified data will not be used by any other processors, the primitives enforce that the coherence messages are delivered only to master processor when the parallel execution of the program is finished. We evaluated the performance of this method in a 16-node software DSM system supporting AURC protocol. Program-driven simulation was performed with five benchmark programs: Jacobi, Red-black SOR, Expl, LU, and Water-nsquared. For the applications, the experimental results show that our method can reduce the coherence messages by up to about 98%, and also can improve the execution time by up to about 26%.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the IEEE/ACM SC98 Conference

自引率

0.00%

发文量