MIREncoder:基于预训练嵌入的多模态红外编码器的性能优化

Akash Dutta, Ali Jannesari
{"title":"MIREncoder:基于预训练嵌入的多模态红外编码器的性能优化","authors":"Akash Dutta, Ali Jannesari","doi":"arxiv-2407.02238","DOIUrl":null,"url":null,"abstract":"One of the primary areas of interest in High Performance Computing is the\nimprovement of performance of parallel workloads. Nowadays, compilable source\ncode-based optimization tasks that employ deep learning often exploit LLVM\nIntermediate Representations (IRs) for extracting features from source code.\nMost such works target specific tasks, or are designed with a pre-defined set\nof heuristics. So far, pre-trained models are rare in this domain, but the\npossibilities have been widely discussed. Especially approaches mimicking\nlarge-language models (LLMs) have been proposed. But these have prohibitively\nlarge training costs. In this paper, we propose MIREncoder, a M}ulti-modal\nIR-based Auto-Encoder that can be pre-trained to generate a learned embedding\nspace to be used for downstream tasks by machine learning-based approaches. A\nmulti-modal approach enables us to better extract features from compilable\nprograms. It allows us to better model code syntax, semantics and structure.\nFor code-based performance optimizations, these features are very important\nwhile making optimization decisions. A pre-trained model/embedding implicitly\nenables the usage of transfer learning, and helps move away from task-specific\ntrained models. Additionally, a pre-trained model used for downstream\nperformance optimization should itself have reduced overhead, and be easily\nusable. These considerations have led us to propose a modeling approach that i)\nunderstands code semantics and structure, ii) enables use of transfer learning,\nand iii) is small and simple enough to be easily re-purposed or reused even\nwith low resource availability. Our evaluations will show that our proposed\napproach can outperform the state of the art while reducing overhead.","PeriodicalId":501291,"journal":{"name":"arXiv - CS - Performance","volume":"29 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations\",\"authors\":\"Akash Dutta, Ali Jannesari\",\"doi\":\"arxiv-2407.02238\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"One of the primary areas of interest in High Performance Computing is the\\nimprovement of performance of parallel workloads. Nowadays, compilable source\\ncode-based optimization tasks that employ deep learning often exploit LLVM\\nIntermediate Representations (IRs) for extracting features from source code.\\nMost such works target specific tasks, or are designed with a pre-defined set\\nof heuristics. So far, pre-trained models are rare in this domain, but the\\npossibilities have been widely discussed. Especially approaches mimicking\\nlarge-language models (LLMs) have been proposed. But these have prohibitively\\nlarge training costs. In this paper, we propose MIREncoder, a M}ulti-modal\\nIR-based Auto-Encoder that can be pre-trained to generate a learned embedding\\nspace to be used for downstream tasks by machine learning-based approaches. A\\nmulti-modal approach enables us to better extract features from compilable\\nprograms. It allows us to better model code syntax, semantics and structure.\\nFor code-based performance optimizations, these features are very important\\nwhile making optimization decisions. A pre-trained model/embedding implicitly\\nenables the usage of transfer learning, and helps move away from task-specific\\ntrained models. Additionally, a pre-trained model used for downstream\\nperformance optimization should itself have reduced overhead, and be easily\\nusable. These considerations have led us to propose a modeling approach that i)\\nunderstands code semantics and structure, ii) enables use of transfer learning,\\nand iii) is small and simple enough to be easily re-purposed or reused even\\nwith low resource availability. Our evaluations will show that our proposed\\napproach can outperform the state of the art while reducing overhead.\",\"PeriodicalId\":501291,\"journal\":{\"name\":\"arXiv - CS - Performance\",\"volume\":\"29 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Performance\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.02238\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Performance","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.02238","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

提高并行工作负载的性能是高性能计算的主要关注领域之一。如今,采用深度学习的基于源代码的可编译优化任务通常利用 LLVMI 中间表征(IR)从源代码中提取特征。到目前为止,预训练模型在这一领域还很少见,但其可能性已被广泛讨论。特别是有人提出了模仿大型语言模型(LLM)的方法。但这些方法的训练成本过高。在本文中,我们提出了 MIREncoder,一种基于 M}多模态红外的自动编码器,它可以通过预训练来生成学习到的嵌入空间,以便通过基于机器学习的方法用于下游任务。多模态方法使我们能够更好地从可编译程序中提取特征。对于基于代码的性能优化来说,这些特征在做出优化决策时非常重要。预训练模型/嵌入隐含地允许使用迁移学习,有助于摆脱特定任务训练模型的束缚。此外,用于下游性能优化的预训练模型本身应减少开销,并易于使用。考虑到这些因素,我们提出了一种建模方法:i) 理解代码语义和结构;ii) 能够使用迁移学习;iii) 小巧、简单,即使在资源可用性较低的情况下也能方便地重新利用或重复使用。我们的评估结果将表明,我们提出的方法可以在降低开销的同时超越现有技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
MIREncoder: Multi-modal IR-based Pretrained Embeddings for Performance Optimizations
One of the primary areas of interest in High Performance Computing is the improvement of performance of parallel workloads. Nowadays, compilable source code-based optimization tasks that employ deep learning often exploit LLVM Intermediate Representations (IRs) for extracting features from source code. Most such works target specific tasks, or are designed with a pre-defined set of heuristics. So far, pre-trained models are rare in this domain, but the possibilities have been widely discussed. Especially approaches mimicking large-language models (LLMs) have been proposed. But these have prohibitively large training costs. In this paper, we propose MIREncoder, a M}ulti-modal IR-based Auto-Encoder that can be pre-trained to generate a learned embedding space to be used for downstream tasks by machine learning-based approaches. A multi-modal approach enables us to better extract features from compilable programs. It allows us to better model code syntax, semantics and structure. For code-based performance optimizations, these features are very important while making optimization decisions. A pre-trained model/embedding implicitly enables the usage of transfer learning, and helps move away from task-specific trained models. Additionally, a pre-trained model used for downstream performance optimization should itself have reduced overhead, and be easily usable. These considerations have led us to propose a modeling approach that i) understands code semantics and structure, ii) enables use of transfer learning, and iii) is small and simple enough to be easily re-purposed or reused even with low resource availability. Our evaluations will show that our proposed approach can outperform the state of the art while reducing overhead.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信