通过图形对齐改进使用表示学习的跨平台二进制分析

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis Pub Date : 2022-07-18 DOI:10.1145/3533767.3534383

Geunwoo Kim, Sanghyun Hong, M. Franz, Dokyung Song

{"title":"通过图形对齐改进使用表示学习的跨平台二进制分析","authors":"Geunwoo Kim, Sanghyun Hong, M. Franz, Dokyung Song","doi":"10.1145/3533767.3534383","DOIUrl":null,"url":null,"abstract":"Cross-platform binary analysis requires a common representation of binaries across platforms, on which a specific analysis can be performed. Recent work proposed to learn low-dimensional, numeric vector representations (i.e., embeddings) of disassembled binary code, and perform binary analysis in the embedding space. Unfortunately, however, existing techniques fall short in that they are either (i) specific to a single platform producing embeddings not aligned across platforms, or (ii) not designed to capture the rich contextual information available in a disassembled binary. We present a novel deep learning-based method, XBA, which addresses the aforementioned problems. To this end, we first abstract binaries as typed graphs, dubbed binary disassembly graphs (BDGs), which encode control-flow and other rich contextual information of different entities found in a disassembled binary, including basic blocks, external functions called, and string literals referenced. We then formulate binary code representation learning as a graph alignment problem, i.e., finding the node correspondences between BDGs extracted from two binaries compiled for different platforms. XBA uses graph convolutional networks to learn the semantics of each node, (i) using its rich contextual information encoded in the BDG, and (ii) aligning its embeddings across platforms. Our formulation allows XBA to learn semantic alignments between two BDGs in a semi-supervised manner, requiring only a limited number of node pairs be aligned across platforms for training. Our evaluation shows that XBA can learn semantically-rich embeddings of binaries aligned across platforms without apriori platform-specific knowledge. By training our model only with 50% of the oracle alignments, XBA was able to predict, on average, 75% of the rest. Our case studies further show that the learned embeddings encode knowledge useful for cross-platform binary analysis.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"349 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Improving cross-platform binary analysis using representation learning via graph alignment\",\"authors\":\"Geunwoo Kim, Sanghyun Hong, M. Franz, Dokyung Song\",\"doi\":\"10.1145/3533767.3534383\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cross-platform binary analysis requires a common representation of binaries across platforms, on which a specific analysis can be performed. Recent work proposed to learn low-dimensional, numeric vector representations (i.e., embeddings) of disassembled binary code, and perform binary analysis in the embedding space. Unfortunately, however, existing techniques fall short in that they are either (i) specific to a single platform producing embeddings not aligned across platforms, or (ii) not designed to capture the rich contextual information available in a disassembled binary. We present a novel deep learning-based method, XBA, which addresses the aforementioned problems. To this end, we first abstract binaries as typed graphs, dubbed binary disassembly graphs (BDGs), which encode control-flow and other rich contextual information of different entities found in a disassembled binary, including basic blocks, external functions called, and string literals referenced. We then formulate binary code representation learning as a graph alignment problem, i.e., finding the node correspondences between BDGs extracted from two binaries compiled for different platforms. XBA uses graph convolutional networks to learn the semantics of each node, (i) using its rich contextual information encoded in the BDG, and (ii) aligning its embeddings across platforms. Our formulation allows XBA to learn semantic alignments between two BDGs in a semi-supervised manner, requiring only a limited number of node pairs be aligned across platforms for training. Our evaluation shows that XBA can learn semantically-rich embeddings of binaries aligned across platforms without apriori platform-specific knowledge. By training our model only with 50% of the oracle alignments, XBA was able to predict, on average, 75% of the rest. Our case studies further show that the learned embeddings encode knowledge useful for cross-platform binary analysis.\",\"PeriodicalId\":412271,\"journal\":{\"name\":\"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis\",\"volume\":\"349 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3533767.3534383\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533767.3534383","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

跨平台二进制分析需要跨平台二进制文件的通用表示，可以在其上执行特定的分析。最近的研究提出学习反汇编二进制代码的低维数字向量表示(即嵌入)，并在嵌入空间中进行二进制分析。然而，不幸的是，现有的技术存在不足，因为它们要么(i)特定于单个平台，产生的嵌入没有跨平台对齐，要么(ii)没有设计用于捕获反汇编二进制文件中可用的丰富上下文信息。我们提出了一种新的基于深度学习的方法，XBA，它解决了上述问题。为此，我们首先将二进制文件抽象为类型化图，称为二进制反汇编图(bdg)，它对反汇编二进制文件中发现的不同实体的控制流和其他丰富的上下文信息进行编码，包括基本块、调用的外部函数和引用的字符串字面量。然后，我们将二进制代码表示学习表述为一个图对齐问题，即从为不同平台编译的两个二进制文件中提取bdg之间的节点对应关系。XBA使用图卷积网络来学习每个节点的语义，(i)使用BDG中编码的丰富上下文信息，以及(ii)跨平台对齐其嵌入。我们的公式允许XBA以半监督的方式学习两个bdg之间的语义对齐，只需要跨平台对齐有限数量的节点对进行训练。我们的评估表明，XBA可以学习语义丰富的跨平台二进制文件嵌入，而无需先验的特定于平台的知识。通过只训练我们的模型50%的oracle对齐，XBA平均能够预测其余75%的对齐。我们的案例研究进一步表明，学习到的嵌入编码知识对跨平台二进制分析有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving cross-platform binary analysis using representation learning via graph alignment

Cross-platform binary analysis requires a common representation of binaries across platforms, on which a specific analysis can be performed. Recent work proposed to learn low-dimensional, numeric vector representations (i.e., embeddings) of disassembled binary code, and perform binary analysis in the embedding space. Unfortunately, however, existing techniques fall short in that they are either (i) specific to a single platform producing embeddings not aligned across platforms, or (ii) not designed to capture the rich contextual information available in a disassembled binary. We present a novel deep learning-based method, XBA, which addresses the aforementioned problems. To this end, we first abstract binaries as typed graphs, dubbed binary disassembly graphs (BDGs), which encode control-flow and other rich contextual information of different entities found in a disassembled binary, including basic blocks, external functions called, and string literals referenced. We then formulate binary code representation learning as a graph alignment problem, i.e., finding the node correspondences between BDGs extracted from two binaries compiled for different platforms. XBA uses graph convolutional networks to learn the semantics of each node, (i) using its rich contextual information encoded in the BDG, and (ii) aligning its embeddings across platforms. Our formulation allows XBA to learn semantic alignments between two BDGs in a semi-supervised manner, requiring only a limited number of node pairs be aligned across platforms for training. Our evaluation shows that XBA can learn semantically-rich embeddings of binaries aligned across platforms without apriori platform-specific knowledge. By training our model only with 50% of the oracle alignments, XBA was able to predict, on average, 75% of the rest. Our case studies further show that the learned embeddings encode knowledge useful for cross-platform binary analysis.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

自引率

0.00%

发文量