Using Runahead Execution to Hide Memory Latency in High Level Synthesis

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) Pub Date : 2017-04-01 DOI:10.1109/FCCM.2017.33

Shane T. Fleming, David B. Thomas

{"title":"Using Runahead Execution to Hide Memory Latency in High Level Synthesis","authors":"Shane T. Fleming, David B. Thomas","doi":"10.1109/FCCM.2017.33","DOIUrl":null,"url":null,"abstract":"Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"103 32","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2017.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Reads and writes to global data in off-chip RAM can limit the performance achieved with HLS tools, as each access takes multiple cycles and usually blocks progress in the application state machine. This can be combated by using data prefetchers, which hide access time by predicting the next memory access and loading it into a cache before it's required. Unfortunately, current prefetchers are only useful for memory accesses with known regular patterns, such as walking arrays, and are ineffective for those that use irregular patterns over application-specific data structures. In this work, we demonstrate prefetchers that are tailor-made for applications, even if they have irregular memory accesses. This is achieved through program slicing, a static analysis technique that extracts the memory structure of the input code and automatically constructs an application-specific prefetcher. Both our analysis and tool are fully automated and implemented as a new compiler flag in LegUp, an open source HLS tool. In this work we create a theoretical model showing that speedup must be between 1x and 2x, we also evaluate five benchmarks, achieving an average speedup of 1.38x with an average resource overhead of 1.15x.

查看原文本刊更多论文

在高级合成中使用提前执行来隐藏内存延迟

读取和写入片外RAM中的全局数据可能会限制HLS工具实现的性能，因为每次访问需要多个周期，并且通常会阻塞应用程序状态机的进程。这可以通过使用数据预取器来解决，数据预取器通过预测下一次内存访问并在需要之前将其加载到缓存中来隐藏访问时间。不幸的是，当前的预取器仅对具有已知规则模式(例如遍历数组)的内存访问有用，对于在特定于应用程序的数据结构上使用不规则模式的内存访问无效。在这项工作中，我们演示了为应用程序量身定制的预取器，即使它们具有不规则的内存访问。这是通过程序切片实现的，这是一种静态分析技术，可以提取输入代码的内存结构并自动构建特定于应用程序的预取器。我们的分析和工具都是完全自动化的，并在开源HLS工具LegUp中作为新的编译器标志实现。在这项工作中，我们创建了一个理论模型，表明加速必须在1倍到2倍之间，我们还评估了五个基准，实现了1.38倍的平均加速，平均资源开销为1.15倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

自引率

0.00%

发文量