Multicore Cache Coherence Control by a Parallelizing Compiler

H. Kasahara, K. Kimura, B. Adhi, Yuhei Hosokawa, Yohei Kishimoto, M. Mase
{"title":"Multicore Cache Coherence Control by a Parallelizing Compiler","authors":"H. Kasahara, K. Kimura, B. Adhi, Yuhei Hosokawa, Yohei Kishimoto, M. Mase","doi":"10.1109/COMPSAC.2017.174","DOIUrl":null,"url":null,"abstract":"A recent development in multicore technology has enabled development of hundreds or thousands core processor. However, on such multicore processor, an efficient hardware cache coherence scheme will become very complex and expensive to develop. This paper proposes a parallelizing compiler directed software coherence scheme for shared memory multicore systems without hardware cache coherence control. The general idea of the proposed method is that an automatic parallelizing compiler analyzes the control dependency and data dependency among coarse grain task in the program. Then based on the obtained information, task parallelization, false sharing detection and data restructuration to prevent false sharing are performed. Next the compiler inserts cache control code to handle stale data problem. The proposed method is built on OSCAR automatic parallelizing compiler and evaluated on Renesas RP2 with 8 SH-4A cores processor. The hardware cache coherence scheme on the RP2 processor is only available for up to 4 cores and the hardware cache coherence can be completely turned off for non-coherence cache mode. Performance evaluation is performed using 10 benchmark program from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB) and Mediabench II. The proposed method performs as good as or better than hardware cache coherence scheme. For example, 4 cores with the hardware coherence mechanism gave us speed up of 2.52 times against 1 core for SPEC2000 \"equake\", 2.9 times for SPEC2006 \"lbm\", 3.34 times for NPB \"cg\", and 3.17 times for MediaBench II MPEG2 Encoder. The proposed software cache coherence control gave us 2.63 times for 4 cores and 4.37 for 8 cores for \"equake\", 3.28 times for 4 cores and 4.76 times for 8 cores for lbm, 3.71 times for 4 cores and 4.92 times for 8 cores for \"MPEG2 Encoder\".","PeriodicalId":6556,"journal":{"name":"2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC)","volume":"49 2 1","pages":"492-497"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/COMPSAC.2017.174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

A recent development in multicore technology has enabled development of hundreds or thousands core processor. However, on such multicore processor, an efficient hardware cache coherence scheme will become very complex and expensive to develop. This paper proposes a parallelizing compiler directed software coherence scheme for shared memory multicore systems without hardware cache coherence control. The general idea of the proposed method is that an automatic parallelizing compiler analyzes the control dependency and data dependency among coarse grain task in the program. Then based on the obtained information, task parallelization, false sharing detection and data restructuration to prevent false sharing are performed. Next the compiler inserts cache control code to handle stale data problem. The proposed method is built on OSCAR automatic parallelizing compiler and evaluated on Renesas RP2 with 8 SH-4A cores processor. The hardware cache coherence scheme on the RP2 processor is only available for up to 4 cores and the hardware cache coherence can be completely turned off for non-coherence cache mode. Performance evaluation is performed using 10 benchmark program from SPEC2000, SPEC2006, NAS Parallel Benchmark (NPB) and Mediabench II. The proposed method performs as good as or better than hardware cache coherence scheme. For example, 4 cores with the hardware coherence mechanism gave us speed up of 2.52 times against 1 core for SPEC2000 "equake", 2.9 times for SPEC2006 "lbm", 3.34 times for NPB "cg", and 3.17 times for MediaBench II MPEG2 Encoder. The proposed software cache coherence control gave us 2.63 times for 4 cores and 4.37 for 8 cores for "equake", 3.28 times for 4 cores and 4.76 times for 8 cores for lbm, 3.71 times for 4 cores and 4.92 times for 8 cores for "MPEG2 Encoder".
并行编译器的多核缓存一致性控制
多核技术的最新发展使数百或数千核处理器的开发成为可能。然而,在这样的多核处理器上,高效的硬件缓存一致性方案的开发将变得非常复杂和昂贵。针对共享内存多核系统,提出了一种并行编译器导向的软件一致性方案。该方法的总体思想是利用自动并行编译器分析程序中粗粒度任务之间的控制依赖关系和数据依赖关系。然后根据获取的信息进行任务并行化、假共享检测和数据重构以防止假共享。接下来,编译器插入缓存控制代码来处理陈旧的数据问题。该方法在OSCAR自动并行编译器上进行了构建,并在带有8核SH-4A处理器的瑞萨RP2上进行了测试。RP2处理器上的硬件缓存一致性方案最多只能用于4核,并且在非相干缓存模式下可以完全关闭硬件缓存一致性。使用来自SPEC2000、SPEC2006、NAS Parallel benchmark (NPB)和mediabenbench II的10个基准程序进行性能评估。该方法的性能与硬件缓存一致性方案相当,甚至更好。例如,与1核的SPEC2000“equake”相比,4核的硬件相干机制使我们的速度提高了2.52倍,SPEC2006“lbm”提高了2.9倍,NPB“cg”提高了3.34倍,mediabbench II MPEG2 Encoder提高了3.17倍。提出的软件缓存一致性控制为“equake”提供了4核2.63次和8核4.37次,lbm为4核3.28次和8核4.76次,“MPEG2 Encoder”为4核3.71次和8核4.92次。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信