Tristan Benoit, Yunru Wang, Moritz Dannehl, Johannes Kinder
{"title":"BLens: Contrastive Captioning of Binary Functions using Ensemble Embedding","authors":"Tristan Benoit, Yunru Wang, Moritz Dannehl, Johannes Kinder","doi":"arxiv-2409.07889","DOIUrl":null,"url":null,"abstract":"Function names can greatly aid human reverse engineers, which has spurred\ndevelopment of machine learning-based approaches to predicting function names\nin stripped binaries. Much current work in this area now uses transformers,\napplying a metaphor of machine translation from code to function names. Still,\nfunction naming models face challenges in generalizing to projects completely\nunrelated to the training set. In this paper, we take a completely new approach\nby transferring advances in automated image captioning to the domain of binary\nreverse engineering, such that different parts of a binary function can be\nassociated with parts of its name. We propose BLens, which combines multiple\nbinary function embeddings into a new ensemble representation, aligns it with\nthe name representation latent space via a contrastive learning approach, and\ngenerates function names with a transformer architecture tailored for function\nnames. In our experiments, we demonstrate that BLens significantly outperforms\nthe state of the art. In the usual setting of splitting per binary, we achieve\nan $F_1$ score of 0.77 compared to 0.67. Moreover, in the cross-project\nsetting, which emphasizes generalizability, we achieve an $F_1$ score of 0.46\ncompared to 0.29.","PeriodicalId":501301,"journal":{"name":"arXiv - CS - Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07889","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Function names can greatly aid human reverse engineers, which has spurred
development of machine learning-based approaches to predicting function names
in stripped binaries. Much current work in this area now uses transformers,
applying a metaphor of machine translation from code to function names. Still,
function naming models face challenges in generalizing to projects completely
unrelated to the training set. In this paper, we take a completely new approach
by transferring advances in automated image captioning to the domain of binary
reverse engineering, such that different parts of a binary function can be
associated with parts of its name. We propose BLens, which combines multiple
binary function embeddings into a new ensemble representation, aligns it with
the name representation latent space via a contrastive learning approach, and
generates function names with a transformer architecture tailored for function
names. In our experiments, we demonstrate that BLens significantly outperforms
the state of the art. In the usual setting of splitting per binary, we achieve
an $F_1$ score of 0.77 compared to 0.67. Moreover, in the cross-project
setting, which emphasizes generalizability, we achieve an $F_1$ score of 0.46
compared to 0.29.