Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching

Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna
{"title":"Attention, Distillation, and Tabularization: Towards Practical Neural Network-Based Prefetching","authors":"Pengmiao Zhang, Neelesh Gupta, Rajgopal Kannan, Viktor K. Prasanna","doi":"arxiv-2401.06362","DOIUrl":null,"url":null,"abstract":"Attention-based Neural Networks (NN) have demonstrated their effectiveness in\naccurate memory access prediction, an essential step in data prefetching.\nHowever, the substantial computational overheads associated with these models\nresult in high inference latency, limiting their feasibility as practical\nprefetchers. To close the gap, we propose a new approach based on\ntabularization that significantly reduces model complexity and inference\nlatency without sacrificing prediction accuracy. Our novel tabularization\nmethodology takes as input a distilled, yet highly accurate attention-based\nmodel for memory access prediction and efficiently converts its expensive\nmatrix multiplications into a hierarchy of fast table lookups. As an exemplar\nof the above approach, we develop DART, a prefetcher comprised of a simple\nhierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99%\nof arithmetic operations from the large attention-based model and 91.83% from\nthe distilled model. DART accelerates the large model inference by 170x and the\ndistilled model by 9.4x. DART has comparable latency and storage costs as\nstate-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC\nimprovement, resulting in a 37.6% speed-up. DART outperforms state-of-the-art\nNN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC\nimprovement, primarily due to its low prefetching latency.","PeriodicalId":501333,"journal":{"name":"arXiv - CS - Operating Systems","volume":"27 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2023-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Operating Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2401.06362","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Attention-based Neural Networks (NN) have demonstrated their effectiveness in accurate memory access prediction, an essential step in data prefetching. However, the substantial computational overheads associated with these models result in high inference latency, limiting their feasibility as practical prefetchers. To close the gap, we propose a new approach based on tabularization that significantly reduces model complexity and inference latency without sacrificing prediction accuracy. Our novel tabularization methodology takes as input a distilled, yet highly accurate attention-based model for memory access prediction and efficiently converts its expensive matrix multiplications into a hierarchy of fast table lookups. As an exemplar of the above approach, we develop DART, a prefetcher comprised of a simple hierarchy of tables. With a modest 0.09 drop in F1-score, DART reduces 99.99% of arithmetic operations from the large attention-based model and 91.83% from the distilled model. DART accelerates the large model inference by 170x and the distilled model by 9.4x. DART has comparable latency and storage costs as state-of-the-art rule-based prefetcher BO but surpasses it by 6.1% in IPC improvement, resulting in a 37.6% speed-up. DART outperforms state-of-the-art NN-based prefetchers TransFetch by 33.1% and Voyager by 37.2% in terms of IPC improvement, primarily due to its low prefetching latency.
关注、提炼和表格化:基于神经网络的实用预取
基于注意力的神经网络(NN)在不准确的内存访问预测(数据预取的一个重要步骤)方面已经证明了其有效性。然而,与这些模型相关的大量计算开销导致了较高的推理延迟,限制了它们作为实用预取器的可行性。为了缩小差距,我们提出了一种基于表格化的新方法,它能在不牺牲预测准确性的前提下显著降低模型复杂度和推理延迟。我们新颖的表格化方法将经过提炼但高度精确的基于注意力的内存访问预测模型作为输入,并将其昂贵的矩阵乘法有效地转换为分层的快速表格查找。作为上述方法的范例,我们开发了 DART,一种由简单的表层次结构组成的预取器。在 F1 分数下降 0.09 的情况下,DART 从基于注意力的大型模型中减少了 99.99% 的算术运算,从经过提炼的模型中减少了 91.83%。DART 将大型模型的推理速度提高了 170 倍,将蒸馏模型的推理速度提高了 9.4 倍。DART 的延迟和存储成本与最先进的基于规则的预取器 BO 不相上下,但在 IPC 提升方面却比它高出 6.1%,速度提高了 37.6%。在 IPCimprovement 方面,DART 比基于最新网络的预取器 TransFetch 快 33.1%,比 Voyager 快 37.2%,这主要归功于其较低的预取延迟。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信