Comprehensive comparisons of embedding approaches for cryptographic API completion: poster

Ya Xiao, Salman Ahmed, Xinyang Ge, Bimal Viswanath, Na Meng, D. Yao
{"title":"Comprehensive comparisons of embedding approaches for cryptographic API completion: poster","authors":"Ya Xiao, Salman Ahmed, Xinyang Ge, Bimal Viswanath, Na Meng, D. Yao","doi":"10.1145/3510454.3528645","DOIUrl":null,"url":null,"abstract":"In this paper, we conduct a measurement study to comprehensively compare the accuracy of Cryptographic API completion tasks trained with multiple API embedding options. Embedding is the process of automatically learning to represent program elements as low-dimensional vectors. Our measurement aims to uncover the impacts of applying program analysis, token-level embedding, and sequence-level embedding on the Cryptographic API completion accuracies. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.10% accuracy improvement on average when program analysis preprocessing is applied to transfer byte code sequences into API dependence paths. The best accuracy (93.52%) is achieved on API dependence paths with embedding techniques. On the contrary, the pure data-driven approach without program analysis only achieves a low accuracy (around 57.60%), even after the powerful sequence-level embedding is applied. Although sequence-level embedding shows slight accuracy advantages (0.55% on average) over token-level embedding in our basic data split setting, it is not recommended considering its expensive training cost. A more obvious accuracy improvement (5.10%) from sequence-level embedding is observed under the cross-project learning scenario when task data is insufficient. Hence, we recommend applying sequence-level embedding for cross-project learning with limited task-specific data.","PeriodicalId":326006,"journal":{"name":"Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2022-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3510454.3528645","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In this paper, we conduct a measurement study to comprehensively compare the accuracy of Cryptographic API completion tasks trained with multiple API embedding options. Embedding is the process of automatically learning to represent program elements as low-dimensional vectors. Our measurement aims to uncover the impacts of applying program analysis, token-level embedding, and sequence-level embedding on the Cryptographic API completion accuracies. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.10% accuracy improvement on average when program analysis preprocessing is applied to transfer byte code sequences into API dependence paths. The best accuracy (93.52%) is achieved on API dependence paths with embedding techniques. On the contrary, the pure data-driven approach without program analysis only achieves a low accuracy (around 57.60%), even after the powerful sequence-level embedding is applied. Although sequence-level embedding shows slight accuracy advantages (0.55% on average) over token-level embedding in our basic data split setting, it is not recommended considering its expensive training cost. A more obvious accuracy improvement (5.10%) from sequence-level embedding is observed under the cross-project learning scenario when task data is insufficient. Hence, we recommend applying sequence-level embedding for cross-project learning with limited task-specific data.
加密API完成的嵌入方法的综合比较:海报
在本文中,我们进行了一项测量研究,以全面比较使用多个API嵌入选项训练的加密API完成任务的准确性。嵌入是自动学习将程序元素表示为低维向量的过程。我们的测量旨在揭示应用程序分析、令牌级嵌入和序列级嵌入对加密API补全精度的影响。我们的研究结果表明,即使在高级嵌入下,程序分析也是必要的。结果表明,应用程序分析预处理将字节码序列转换为API依赖路径,平均精度提高36.10%。采用嵌入技术在API依赖路径上获得了最高的准确率(93.52%)。相反,没有程序分析的纯数据驱动方法即使在强大的序列级嵌入之后,也只能达到较低的精度(约57.60%)。尽管在我们的基本数据分割设置中,序列级嵌入比令牌级嵌入显示出轻微的准确性优势(平均0.55%),但考虑到其昂贵的训练成本,不建议使用它。在任务数据不足的跨项目学习场景下,序列级嵌入的准确率提高更为明显(5.10%)。因此,我们建议将序列级嵌入应用于具有有限任务特定数据的跨项目学习。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信