{"title":"Design and evaluation of compiler algorithms for pre-execution","authors":"Dongkeun Kim, D. Yeung","doi":"10.1145/605397.605415","DOIUrl":null,"url":null,"abstract":"Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.","PeriodicalId":377379,"journal":{"name":"ASPLOS X","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"121","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ASPLOS X","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/605397.605415","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 121
Abstract
Pre-execution is a promising latency tolerance technique that uses one or more helper threads running in spare hardware contexts ahead of the main computation to trigger long-latency memory operations early, hence absorbing their latency on behalf of the main computation. This paper investigates a source-to-source C compiler for extracting pre-execution thread code automatically, thus relieving the programmer or hardware from this onerous task. At the heart of our compiler are three algorithms. First, program slicing removes non-critical code for computing cache-missing memory references, reducing pre-execution overhead. Second, prefetch conversion replaces blocking memory references with non-blocking prefetch instructions to minimize pre-execution thread stalls. Finally, threading scheme selection chooses the best scheme for initiating pre-execution threads, speculatively parallelizing loops to generate thread-level parallelism when necessary for latency tolerance. We prototyped our algorithms using the Stanford University Intermediate Format (SUIF) framework and a publicly available program slicer, called Unravel [13], and we evaluated our compiler on a detailed architectural simulator of an SMT processor. Our results show compiler-based pre-execution improves the performance of 9 out of 13 applications, reducing execution time by 22.7%. Across all 13 applications, our technique delivers an average speedup of 17.0%. These performance gains are achieved fully automatically on conventional SMT hardware, with only minimal modifications to support pre-execution threads.