Comprehensive comparisons of embedding approaches for cryptographic API completion: poster

Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings Pub Date : 2022-05-21 DOI:10.1145/3510454.3528645

Ya Xiao, Salman Ahmed, Xinyang Ge, Bimal Viswanath, Na Meng, D. Yao

{"title":"Comprehensive comparisons of embedding approaches for cryptographic API completion: poster","authors":"Ya Xiao, Salman Ahmed, Xinyang Ge, Bimal Viswanath, Na Meng, D. Yao","doi":"10.1145/3510454.3528645","DOIUrl":null,"url":null,"abstract":"In this paper, we conduct a measurement study to comprehensively compare the accuracy of Cryptographic API completion tasks trained with multiple API embedding options. Embedding is the process of automatically learning to represent program elements as low-dimensional vectors. Our measurement aims to uncover the impacts of applying program analysis, token-level embedding, and sequence-level embedding on the Cryptographic API completion accuracies. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.10% accuracy improvement on average when program analysis preprocessing is applied to transfer byte code sequences into API dependence paths. The best accuracy (93.52%) is achieved on API dependence paths with embedding techniques. On the contrary, the pure data-driven approach without program analysis only achieves a low accuracy (around 57.60%), even after the powerful sequence-level embedding is applied. Although sequence-level embedding shows slight accuracy advantages (0.55% on average) over token-level embedding in our basic data split setting, it is not recommended considering its expensive training cost. A more obvious accuracy improvement (5.10%) from sequence-level embedding is observed under the cross-project learning scenario when task data is insufficient. Hence, we recommend applying sequence-level embedding for cross-project learning with limited task-specific data.","PeriodicalId":326006,"journal":{"name":"Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings","volume":"63 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3510454.3528645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we conduct a measurement study to comprehensively compare the accuracy of Cryptographic API completion tasks trained with multiple API embedding options. Embedding is the process of automatically learning to represent program elements as low-dimensional vectors. Our measurement aims to uncover the impacts of applying program analysis, token-level embedding, and sequence-level embedding on the Cryptographic API completion accuracies. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.10% accuracy improvement on average when program analysis preprocessing is applied to transfer byte code sequences into API dependence paths. The best accuracy (93.52%) is achieved on API dependence paths with embedding techniques. On the contrary, the pure data-driven approach without program analysis only achieves a low accuracy (around 57.60%), even after the powerful sequence-level embedding is applied. Although sequence-level embedding shows slight accuracy advantages (0.55% on average) over token-level embedding in our basic data split setting, it is not recommended considering its expensive training cost. A more obvious accuracy improvement (5.10%) from sequence-level embedding is observed under the cross-project learning scenario when task data is insufficient. Hence, we recommend applying sequence-level embedding for cross-project learning with limited task-specific data.

查看原文本刊更多论文

加密API完成的嵌入方法的综合比较:海报

在本文中，我们进行了一项测量研究，以全面比较使用多个API嵌入选项训练的加密API完成任务的准确性。嵌入是自动学习将程序元素表示为低维向量的过程。我们的测量旨在揭示应用程序分析、令牌级嵌入和序列级嵌入对加密API补全精度的影响。我们的研究结果表明，即使在高级嵌入下，程序分析也是必要的。结果表明，应用程序分析预处理将字节码序列转换为API依赖路径，平均精度提高36.10%。采用嵌入技术在API依赖路径上获得了最高的准确率(93.52%)。相反，没有程序分析的纯数据驱动方法即使在强大的序列级嵌入之后，也只能达到较低的精度(约57.60%)。尽管在我们的基本数据分割设置中，序列级嵌入比令牌级嵌入显示出轻微的准确性优势(平均0.55%)，但考虑到其昂贵的训练成本，不建议使用它。在任务数据不足的跨项目学习场景下，序列级嵌入的准确率提高更为明显(5.10%)。因此，我们建议将序列级嵌入应用于具有有限任务特定数据的跨项目学习。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings

自引率

0.00%

发文量