Distributed Representation for Assembly Code

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Kazuki Yoshida, Kaiyu Suzuki, Tomofumi Matsuzawa
{"title":"Distributed Representation for Assembly Code","authors":"Kazuki Yoshida, Kaiyu Suzuki, Tomofumi Matsuzawa","doi":"10.3390/computers12110222","DOIUrl":null,"url":null,"abstract":"In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"44 3-4","pages":"0"},"PeriodicalIF":2.6000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12110222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, the number of similar software products with many common parts has been increasing due to the reuse and plagiarism of source code in the software development process. Pattern matching, which is an existing method for detecting similarity, cannot detect the similarities between these software products and other programs. It is necessary, for example, to detect similarities based on commonalities in both functionality and control structures. At the same time, detailed software analysis requires manual reverse engineering. Therefore, technologies that automatically identify similarities among the arge amounts of code present in software products in advance can reduce these oads. In this paper, we propose a representation earning model to extract feature expressions from assembly code obtained by statically analyzing such code to determine the similarity between software products. We use assembly code to eliminate the dependence on the existence of source code or differences in development anguage. The proposed approach makes use of Asm2Vec, an existing method, that is capable of generating a vector representation that captures the semantics of assembly code. The proposed method also incorporates information on the program control structure. The control structure can be represented by graph data. Thus, we use graph embedding, a graph vector representation method, to generate a representation vector that reflects both the semantics and the control structure of the assembly code. In our experiments, we generated expression vectors from multiple programs and used clustering to verify the accuracy of the approach in classifying similar programs into the same cluster. The proposed method outperforms existing methods that only consider semantics in both accuracy and execution time.
汇编代码的分布式表示
近年来,由于软件开发过程中源代码的重用和抄袭,类似的软件产品中有许多共性部件的数量不断增加。模式匹配是现有的一种相似度检测方法,但无法检测出这些软件产品与其他程序之间的相似度。例如,有必要根据功能和控制结构的共性来检测相似性。同时,详细的软件分析需要人工逆向工程。因此,提前自动识别软件产品中大量代码之间的相似性的技术可以减少这些负担。本文提出了一种表示学习模型,从静态分析得到的汇编代码中提取特征表达式,以确定软件产品之间的相似度。我们使用汇编代码来消除对源代码存在的依赖或开发语言的差异。所建议的方法利用了Asm2Vec(一种现有的方法),该方法能够生成捕获汇编代码语义的矢量表示。所提出的方法还包含了程序控制结构的信息。控制结构可以用图形数据表示。因此,我们使用图嵌入(一种图向量表示方法)来生成反映汇编代码的语义和控制结构的表示向量。在我们的实验中,我们从多个程序中生成表达向量,并使用聚类来验证该方法将相似程序分类到同一聚类中的准确性。该方法在精度和执行时间上都优于仅考虑语义的现有方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Computers
Computers COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS-
CiteScore
5.40
自引率
3.60%
发文量
153
审稿时长
11 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信