{"title":"具有流感知的共享混合语言模型的二进制作者验证","authors":"Qi Song, Yongzheng Zhang, Linshu Ouyang, Yige Chen","doi":"10.48550/arXiv.2203.04472","DOIUrl":null,"url":null,"abstract":"Binary authorship analysis is a significant problem in many software engineering applications. In this paper, we formulate a binary authorship verification task to accurately reflect the real-world working process of software forensic experts. It aims to determine whether an anonymous binary is developed by a specific programmer with a small set of support samples, and the actual developer may not belong to the known candidate set but from the wild. We propose an effective binary authorship verification framework, BinMLM. BinMLM trains the RNN language model on consecutive opcode traces extracted from the control-flow-graph (CFG) to characterize the candidate developers' programming styles. We build a mixture-of-shared architecture with multiple shared encoders and author-specific gate layers, which can learn the developers' combination preferences of universal programming patterns and alleviate the problem of low training resources. Through an optimization pipeline of external pre-training, joint training, and fine-tuning, our framework can eliminate additional noise and accurately distill developers' unique styles. Extensive experiments show that BinMLM achieves promising results on Google Code Jam (GCJ) and Codeforces datasets with different numbers of programmers and supporting samples. It significantly outperforms the baselines built on the state-of-the-art feature set (4.73% to 19.46% improvement) and remains robust in multi-author collaboration scenarios. Furthermore, Bin-MLM can perform organization-level verification on a real-world APT malware dataset, which can provide valuable auxiliary information for exploring the group behind the APT attack.","PeriodicalId":437520,"journal":{"name":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"BinMLM: Binary Authorship Verification with Flow-aware Mixture-of-Shared Language Model\",\"authors\":\"Qi Song, Yongzheng Zhang, Linshu Ouyang, Yige Chen\",\"doi\":\"10.48550/arXiv.2203.04472\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Binary authorship analysis is a significant problem in many software engineering applications. In this paper, we formulate a binary authorship verification task to accurately reflect the real-world working process of software forensic experts. It aims to determine whether an anonymous binary is developed by a specific programmer with a small set of support samples, and the actual developer may not belong to the known candidate set but from the wild. We propose an effective binary authorship verification framework, BinMLM. BinMLM trains the RNN language model on consecutive opcode traces extracted from the control-flow-graph (CFG) to characterize the candidate developers' programming styles. We build a mixture-of-shared architecture with multiple shared encoders and author-specific gate layers, which can learn the developers' combination preferences of universal programming patterns and alleviate the problem of low training resources. Through an optimization pipeline of external pre-training, joint training, and fine-tuning, our framework can eliminate additional noise and accurately distill developers' unique styles. Extensive experiments show that BinMLM achieves promising results on Google Code Jam (GCJ) and Codeforces datasets with different numbers of programmers and supporting samples. It significantly outperforms the baselines built on the state-of-the-art feature set (4.73% to 19.46% improvement) and remains robust in multi-author collaboration scenarios. Furthermore, Bin-MLM can perform organization-level verification on a real-world APT malware dataset, which can provide valuable auxiliary information for exploring the group behind the APT attack.\",\"PeriodicalId\":437520,\"journal\":{\"name\":\"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2203.04472\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2203.04472","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
二进制作者身份分析是许多软件工程应用中的一个重要问题。在本文中,我们制定了一个二进制的作者身份验证任务,以准确地反映软件取证专家的真实工作过程。它的目的是确定匿名二进制文件是否由特定的程序员使用一小部分支持示例开发,而实际的开发人员可能不属于已知的候选集,而是来自野外。我们提出了一个有效的二进制作者身份验证框架,BinMLM。BinMLM在从控制流图(CFG)中提取的连续操作码轨迹上训练RNN语言模型,以表征候选开发人员的编程风格。我们构建了一个包含多个共享编码器和作者特定门层的混合共享架构,该架构可以了解开发人员对通用编程模式的组合偏好,并缓解了培训资源不足的问题。通过外部预训练、联合训练和微调的优化管道,我们的框架可以消除额外的噪音,并准确地提取开发人员的独特风格。大量的实验表明,BinMLM在具有不同数量的程序员和支持样本的谷歌Code Jam (GCJ)和Codeforces数据集上取得了令人满意的结果。它明显优于基于最先进的特性集构建的基线(4.73%到19.46%的改进),并且在多作者协作场景中保持健壮。此外,Bin-MLM可以在真实的APT恶意软件数据集上执行组织级验证,这可以为探索APT攻击背后的组织提供有价值的辅助信息。
BinMLM: Binary Authorship Verification with Flow-aware Mixture-of-Shared Language Model
Binary authorship analysis is a significant problem in many software engineering applications. In this paper, we formulate a binary authorship verification task to accurately reflect the real-world working process of software forensic experts. It aims to determine whether an anonymous binary is developed by a specific programmer with a small set of support samples, and the actual developer may not belong to the known candidate set but from the wild. We propose an effective binary authorship verification framework, BinMLM. BinMLM trains the RNN language model on consecutive opcode traces extracted from the control-flow-graph (CFG) to characterize the candidate developers' programming styles. We build a mixture-of-shared architecture with multiple shared encoders and author-specific gate layers, which can learn the developers' combination preferences of universal programming patterns and alleviate the problem of low training resources. Through an optimization pipeline of external pre-training, joint training, and fine-tuning, our framework can eliminate additional noise and accurately distill developers' unique styles. Extensive experiments show that BinMLM achieves promising results on Google Code Jam (GCJ) and Codeforces datasets with different numbers of programmers and supporting samples. It significantly outperforms the baselines built on the state-of-the-art feature set (4.73% to 19.46% improvement) and remains robust in multi-author collaboration scenarios. Furthermore, Bin-MLM can perform organization-level verification on a real-world APT malware dataset, which can provide valuable auxiliary information for exploring the group behind the APT attack.