Distributed Representation for Assembly Code

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computers Pub Date : 2023-11-01 DOI:10.3390/computers12110222

Kazuki Yoshida, Kaiyu Suzuki, Tomofumi Matsuzawa

{"title":"Distributed Representation for Assembly Code","authors":"Kazuki Yoshida, Kaiyu Suzuki, Tomofumi Matsuzawa","doi":"10.3390/computers12110222","DOIUrl":null,"url":null,"abstract":"In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"44 3-4","pages":"0"},"PeriodicalIF":2.6000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12110222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.

查看原文本刊更多论文

汇编代码的分布式表示

近年来，由于软件开发过程中源代码的重用和抄袭，类似的软件产品中有许多共性部件的数量不断增加。模式匹配是现有的一种相似度检测方法，但无法检测出这些软件产品与其他程序之间的相似度。例如，有必要根据功能和控制结构的共性来检测相似性。同时，详细的软件分析需要人工逆向工程。因此，提前自动识别软件产品中大量代码之间的相似性的技术可以减少这些负担。本文提出了一种表示学习模型，从静态分析得到的汇编代码中提取特征表达式，以确定软件产品之间的相似度。我们使用汇编代码来消除对源代码存在的依赖或开发语言的差异。所建议的方法利用了Asm2Vec(一种现有的方法)，该方法能够生成捕获汇编代码语义的矢量表示。所提出的方法还包含了程序控制结构的信息。控制结构可以用图形数据表示。因此，我们使用图嵌入(一种图向量表示方法)来生成反映汇编代码的语义和控制结构的表示向量。在我们的实验中，我们从多个程序中生成表达向量，并使用聚类来验证该方法将相似程序分类到同一聚类中的准确性。该方法在精度和执行时间上都优于仅考虑语义的现有方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊