{"title":"Tyche: An Efficient and General Prefetcher for Indirect Memory Accesses","authors":"Feng Xue, Chenji Han, Xinyu Li, Junliang Wu, Tingting Zhang, Tianyi Liu, Yifan Hao, Zidong Du, Qi Guo, Fuxin Zhang","doi":"10.1145/3641853","DOIUrl":null,"url":null,"abstract":"<p>Indirect memory accesses (IMAs, i.e., <i>A</i>[<i>f</i>(<i>B</i>[<i>i</i>])]) are typical memory access patterns in applications such as graph analysis, machine learning, and database. IMAs are composed of producer-consumer pairs, where the consumers’ memory addresses are derived from the producers’ memory data. Due to the built-in value-dependent feature, IMAs exhibit poor locality, making prefetching ineffective. Hindered by the challenges of recording the potentially complex graphs of instruction dependencies among IMA producers and consumers, current state-of-the-art hardware prefetchers either <b>(a)</b> exhibit inadequate IMA identification abilities or <b>(b)</b> rely on the run-ahead mechanism to prefetch IMAs intermittently and insufficiently. </p><p>To solve this problem, we propose Tyche<sup>1</sup>, an efficient and general hardware prefetcher to enhance IMA performance. Tyche adopts a bilateral propagation mechanism to precisely excavate the instruction dependencies in simple chains with moderate length (rather than complex graphs). Based on the exact instruction dependencies, Tyche can accurately identify various IMA patterns, including nonlinear ones, and generate accurate prefetching requests continuously. Evaluated on broad benchmarks, Tyche achieves an average performance speedup of 16.2% over the state-of-the-art spatial prefetcher Berti. More importantly, Tyche outperforms the state-of-the-art IMA prefetchers IMP, Gretch, and Vector Runahead, by 15.9%, 12.8%, and 10.7%, respectively, with a lower storage overhead of only 0.57KB.</p>","PeriodicalId":50920,"journal":{"name":"ACM Transactions on Architecture and Code Optimization","volume":"7 1","pages":""},"PeriodicalIF":1.5000,"publicationDate":"2024-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Architecture and Code Optimization","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3641853","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0
Abstract
Indirect memory accesses (IMAs, i.e., A[f(B[i])]) are typical memory access patterns in applications such as graph analysis, machine learning, and database. IMAs are composed of producer-consumer pairs, where the consumers’ memory addresses are derived from the producers’ memory data. Due to the built-in value-dependent feature, IMAs exhibit poor locality, making prefetching ineffective. Hindered by the challenges of recording the potentially complex graphs of instruction dependencies among IMA producers and consumers, current state-of-the-art hardware prefetchers either (a) exhibit inadequate IMA identification abilities or (b) rely on the run-ahead mechanism to prefetch IMAs intermittently and insufficiently.
To solve this problem, we propose Tyche1, an efficient and general hardware prefetcher to enhance IMA performance. Tyche adopts a bilateral propagation mechanism to precisely excavate the instruction dependencies in simple chains with moderate length (rather than complex graphs). Based on the exact instruction dependencies, Tyche can accurately identify various IMA patterns, including nonlinear ones, and generate accurate prefetching requests continuously. Evaluated on broad benchmarks, Tyche achieves an average performance speedup of 16.2% over the state-of-the-art spatial prefetcher Berti. More importantly, Tyche outperforms the state-of-the-art IMA prefetchers IMP, Gretch, and Vector Runahead, by 15.9%, 12.8%, and 10.7%, respectively, with a lower storage overhead of only 0.57KB.
期刊介绍:
ACM Transactions on Architecture and Code Optimization (TACO) focuses on hardware, software, and system research spanning the fields of computer architecture and code optimization. Articles that appear in TACO will either present new techniques and concepts or report on experiences and experiments with actual systems. Insights useful to architects, hardware or software developers, designers, builders, and users will be emphasized.