预执行编译器算法的设计与评估

ASPLOS X Pub Date : 2002-10-01 DOI:10.1145/605397.605415

Dongkeun Kim, D. Yeung

{"title":"预执行编译器算法的设计与评估","authors":"Dongkeun Kim, D. Yeung","doi":"10.1145/605397.605415","DOIUrl":null,"url":null,"abstract":"Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"121","resultStr":"{\"title\":\"Design and evaluation of compiler algorithms for pre-execution\",\"authors\":\"Dongkeun Kim, D. Yeung\",\"doi\":\"10.1145/605397.605415\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.\",\"PeriodicalId\":377379,\"journal\":{\"name\":\"ASPLOS X\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"121\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ASPLOS X\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/605397.605415\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASPLOS X","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/605397.605415","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 121

摘要

预执行是一种很有前途的延迟容忍技术，它使用在主计算之前在备用硬件上下文中运行的一个或多个辅助线程来提前触发长延迟内存操作，从而代表主计算吸收它们的延迟。本文研究了一个源到源的C编译器，用于自动提取预执行线程代码，从而将程序员或硬件从繁重的任务中解脱出来。编译器的核心是三个算法。首先，程序切片删除了用于计算缺少缓存的内存引用的非关键代码，减少了预执行开销。其次，预取转换用非阻塞的预取指令替换阻塞的内存引用，以最小化预执行线程的停顿。最后，线程方案选择为启动预执行线程选择最佳方案，在需要时推测并行循环以生成线程级别的并行性，以实现延迟容忍。我们使用斯坦福大学中间格式(SUIF)框架和一个公开可用的程序切片器(称为Unravel)对算法进行了原型设计[13]，并在SMT处理器的详细架构模拟器上评估了我们的编译器。我们的结果显示，基于编译器的预执行提高了13个应用程序中的9个的性能，减少了22.7%的执行时间。在所有13个应用程序中，我们的技术提供了17.0%的平均加速。这些性能提升是在传统的SMT硬件上完全自动实现的，只需要很少的修改就可以支持预执行线程。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Design and evaluation of compiler algorithms for pre-execution

Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ASPLOS X

自引率

0.00%

发文量