An empirical study on divergence of differently-sourced LLVM IRs

IF 1.7 3区 计算机科学 Q3 COMPUTER SCIENCE, SOFTWARE ENGINEERING
Zhenzhou Tian , Yuchen Gong , Chenhao Chang , Jiaze Sun , Yanping Chen , Lingwei Chen
{"title":"An empirical study on divergence of differently-sourced LLVM IRs","authors":"Zhenzhou Tian ,&nbsp;Yuchen Gong ,&nbsp;Chenhao Chang ,&nbsp;Jiaze Sun ,&nbsp;Yanping Chen ,&nbsp;Lingwei Chen","doi":"10.1016/j.cola.2024.101289","DOIUrl":null,"url":null,"abstract":"<div><p>In solving binary code similarity detection, many approaches choose to operate on certain unified intermediate representations (IRs), such as Low Level Virtual Machine (LLVM) IR, to overcome the cross-architecture analysis challenge induced by the significant morphological and syntactic gaps across the diverse instruction set architectures (ISAs). However, the LLVM IRs of the same program can be affected by diverse factors, such as the acquisition source, i.e., compiled from source code or disassembled and lifted from binary code. While the impact of compilation settings on binary code has been explored, the specific differences between LLVM IRs from varied sources remain underexamined. To this end, we pioneer an in-depth empirical study to assess the discrepancies in LLVM IRs derived from different sources. Correspondingly, an extensive dataset containing nearly 98 million LLVM IR instructions distributed in 808,431 functions is curated with respect to these potential IR-influential factors. On this basis, three types of code metrics detailing the syntactic, structural, and semantic aspects of the IR samples are devised and leveraged to assess the divergence of the IRs across different origins. The findings offer insights into how and to what extent the various factors affect the IRs, providing valuable guidance for assembling a training corpus aimed at developing robust LLVM IR-oriented pre-training models, as well as facilitating relevant program analysis studies that operate on the LLVM IRs.</p></div>","PeriodicalId":48552,"journal":{"name":"Journal of Computer Languages","volume":"81 ","pages":"Article 101289"},"PeriodicalIF":1.7000,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Languages","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2590118424000327","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}
引用次数: 0

Abstract

In solving binary code similarity detection, many approaches choose to operate on certain unified intermediate representations (IRs), such as Low Level Virtual Machine (LLVM) IR, to overcome the cross-architecture analysis challenge induced by the significant morphological and syntactic gaps across the diverse instruction set architectures (ISAs). However, the LLVM IRs of the same program can be affected by diverse factors, such as the acquisition source, i.e., compiled from source code or disassembled and lifted from binary code. While the impact of compilation settings on binary code has been explored, the specific differences between LLVM IRs from varied sources remain underexamined. To this end, we pioneer an in-depth empirical study to assess the discrepancies in LLVM IRs derived from different sources. Correspondingly, an extensive dataset containing nearly 98 million LLVM IR instructions distributed in 808,431 functions is curated with respect to these potential IR-influential factors. On this basis, three types of code metrics detailing the syntactic, structural, and semantic aspects of the IR samples are devised and leveraged to assess the divergence of the IRs across different origins. The findings offer insights into how and to what extent the various factors affect the IRs, providing valuable guidance for assembling a training corpus aimed at developing robust LLVM IR-oriented pre-training models, as well as facilitating relevant program analysis studies that operate on the LLVM IRs.

关于不同来源 LLVM IR 分歧的实证研究
在解决二进制代码相似性检测问题时,许多方法都选择对某些统一的中间表示(IR)(如低级虚拟机(LLVM)IR)进行操作,以克服由于不同指令集架构(ISA)之间存在明显的形态和语法差距而引起的跨架构分析难题。然而,同一程序的 LLVM IR 会受到不同因素的影响,例如获取源,即从源代码编译或从二进制代码反汇编和提取。虽然已经探讨了编译设置对二进制代码的影响,但对不同来源的 LLVM IR 之间的具体差异仍未进行深入研究。为此,我们率先开展了一项深入的实证研究,以评估不同来源的 LLVM IR 之间的差异。相应地,我们根据这些潜在的 IR 影响因素,对包含 808431 个函数中近 9800 万条 LLVM IR 指令的大量数据集进行了分析。在此基础上,我们设计了三种代码度量标准,详细描述了 IR 样本的语法、结构和语义方面,并利用这些标准来评估不同来源的 IR 的差异。研究结果深入揭示了各种因素如何以及在多大程度上影响了 IR,为组建旨在开发强大的 LLVM IR 面向预训练模型的训练语料库提供了宝贵的指导,同时也促进了以 LLVM IR 为基础的相关程序分析研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Computer Languages
Journal of Computer Languages Computer Science-Computer Networks and Communications
CiteScore
5.00
自引率
13.60%
发文量
36
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信