{"title":"BContext2Name: Naming Functions in Stripped Binaries with Multi-Label Learning and Neural Networks","authors":"Bing Xia, Yunxiang Ge, Ruinan Yang, Jiabin Yin, Jianmin Pang, Chongjun Tang","doi":"10.1109/CSCloud-EdgeCom58631.2023.00037","DOIUrl":null,"url":null,"abstract":"Conducting binary function naming helps reverse engineers understand the internal workings of the code and perform malicious code analysis without accessing the source code. However, the loss of debugging information poses the challenge of insufficient high-level semantic information description for stripping binary code function naming. Meanwhile, the existing binary function naming scheme has one function label for only one sample. The long-tail effect of function labels for a single sample makes the machine learning-based prediction models face the challenge. To obtain a function correlation label and improve the propensity score of uncommon tail labels, we propose a multi-label learning-based binary function naming model BContext2Name. This model automatically generates relevant labels for binary function naming by function context information with the help of PfastreXML model. The experimental results show that BContext2Name can enrich function labels and alleviate the long-tail effect that exists for a single sample class. To obtain high-level semantics of binary functions, we align pseudocode and basic blocks based on disassembly and decompilation, identify concrete or abstract values of API parameters by variable tracking, and construct API-enhanced control flow graphs. Finally, a seq2seq neural network translation model with attention mechanism is constructed between function multi-label learning and enhanced control flow graphs. Experiments on the dataset reveal that the F1 values of the BContext2Name model improve by 3.55% and 15.23% over the state-of-the-art XFL and Nero, respectively. This indicates that function multi-label learning can provide accurate labels for binary functions and can help reverse analysts understand the inner working mechanism of binary code. Code and data for this evaluation are available at https://github.com/CSecurityZhongYuan/BContext2Name.","PeriodicalId":56007,"journal":{"name":"Journal of Cloud Computing-Advances Systems and Applications","volume":"22 1","pages":"167-172"},"PeriodicalIF":3.7000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cloud Computing-Advances Systems and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/CSCloud-EdgeCom58631.2023.00037","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Conducting binary function naming helps reverse engineers understand the internal workings of the code and perform malicious code analysis without accessing the source code. However, the loss of debugging information poses the challenge of insufficient high-level semantic information description for stripping binary code function naming. Meanwhile, the existing binary function naming scheme has one function label for only one sample. The long-tail effect of function labels for a single sample makes the machine learning-based prediction models face the challenge. To obtain a function correlation label and improve the propensity score of uncommon tail labels, we propose a multi-label learning-based binary function naming model BContext2Name. This model automatically generates relevant labels for binary function naming by function context information with the help of PfastreXML model. The experimental results show that BContext2Name can enrich function labels and alleviate the long-tail effect that exists for a single sample class. To obtain high-level semantics of binary functions, we align pseudocode and basic blocks based on disassembly and decompilation, identify concrete or abstract values of API parameters by variable tracking, and construct API-enhanced control flow graphs. Finally, a seq2seq neural network translation model with attention mechanism is constructed between function multi-label learning and enhanced control flow graphs. Experiments on the dataset reveal that the F1 values of the BContext2Name model improve by 3.55% and 15.23% over the state-of-the-art XFL and Nero, respectively. This indicates that function multi-label learning can provide accurate labels for binary functions and can help reverse analysts understand the inner working mechanism of binary code. Code and data for this evaluation are available at https://github.com/CSecurityZhongYuan/BContext2Name.
期刊介绍:
The Journal of Cloud Computing: Advances, Systems and Applications (JoCCASA) will publish research articles on all aspects of Cloud Computing. Principally, articles will address topics that are core to Cloud Computing, focusing on the Cloud applications, the Cloud systems, and the advances that will lead to the Clouds of the future. Comprehensive review and survey articles that offer up new insights, and lay the foundations for further exploratory and experimental work, are also relevant.