Authorship Identification of Binary and Disassembled Codes Using NLP Methods

Inf. Comput. Pub Date : 2023-06-25 DOI:10.3390/info14070361
Aleksandr Romanov, A. Kurtukova, A. Fedotova, A. Shelupanov
{"title":"Authorship Identification of Binary and Disassembled Codes Using NLP Methods","authors":"Aleksandr Romanov, A. Kurtukova, A. Fedotova, A. Shelupanov","doi":"10.3390/info14070361","DOIUrl":null,"url":null,"abstract":"This article is part of a series aimed at determining the authorship of source codes. Analyzing binary code is a crucial aspect of cybersecurity, software development, and computer forensics, particularly in identifying malware authors. Any program is machine code, which can be disassembled using specialized tools and analyzed for authorship identification, similar to natural language text using Natural Language Processing methods. We propose an ensemble of fastText, support vector machine (SVM), and the authors’ hybrid neural network developed in previous works in this research. The improved methodology was evaluated using a dataset of source codes written in C and C++ languages collected from GitHub and Google Code Jam. The collected source codes were compiled into executable programs and then disassembled using reverse engineering tools. The average accuracy of author identification for disassembled codes using the improved methodology exceeds 0.90. Additionally, the methodology was tested on the source codes, achieving an average accuracy of 0.96 in simple cases and over 0.85 in complex cases. These results validate the effectiveness of the developed methodology and its applicability to solving cybersecurity challenges.","PeriodicalId":13622,"journal":{"name":"Inf. Comput.","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2023-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Inf. Comput.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/info14070361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

This article is part of a series aimed at determining the authorship of source codes. Analyzing binary code is a crucial aspect of cybersecurity, software development, and computer forensics, particularly in identifying malware authors. Any program is machine code, which can be disassembled using specialized tools and analyzed for authorship identification, similar to natural language text using Natural Language Processing methods. We propose an ensemble of fastText, support vector machine (SVM), and the authors’ hybrid neural network developed in previous works in this research. The improved methodology was evaluated using a dataset of source codes written in C and C++ languages collected from GitHub and Google Code Jam. The collected source codes were compiled into executable programs and then disassembled using reverse engineering tools. The average accuracy of author identification for disassembled codes using the improved methodology exceeds 0.90. Additionally, the methodology was tested on the source codes, achieving an average accuracy of 0.96 in simple cases and over 0.85 in complex cases. These results validate the effectiveness of the developed methodology and its applicability to solving cybersecurity challenges.
基于NLP方法的二进制和反汇编码作者识别
本文是旨在确定源代码作者身份的系列文章的一部分。分析二进制代码是网络安全、软件开发和计算机取证的一个重要方面,特别是在识别恶意软件作者方面。任何程序都是机器代码,可以使用专门的工具对其进行反汇编,并分析其作者身份,类似于使用自然语言处理方法的自然语言文本。在本研究中,我们提出了一种集成fastText、支持向量机(SVM)和作者在先前工作中开发的混合神经网络的方法。改进后的方法使用从GitHub和Google Code Jam收集的C和c++语言编写的源代码数据集进行评估。收集到的源代码被编译成可执行程序,然后使用逆向工程工具进行反汇编。采用改进的方法对反汇编代码进行作者识别的平均准确率超过0.90。此外,该方法在源代码上进行了测试,在简单情况下达到0.96的平均精度,在复杂情况下达到0.85以上。这些结果验证了所开发方法的有效性及其在解决网络安全挑战方面的适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信