A Statistical Method for API Usage Learning and API Misuse Violation Finding

2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA) Pub Date : 2023-05-23 DOI:10.1109/SERA57763.2023.10197708

Deepak Panda, Piyush Basia, Kushal Nallavolu, Xin Zhong, Harvey P. Siy, Myoungkyu Song

{"title":"A Statistical Method for API Usage Learning and API Misuse Violation Finding","authors":"Deepak Panda, Piyush Basia, Kushal Nallavolu, Xin Zhong, Harvey P. Siy, Myoungkyu Song","doi":"10.1109/SERA57763.2023.10197708","DOIUrl":null,"url":null,"abstract":"A large corpus of software repositories enables an opportunity for using machine learning (ML) approaches to create new software engineering tools. In this paper, we propose a novel technique which leverages ML approaches for automating software engineering tasks and thus improves software quality. Our concrete goal is to (1) explore the abundance of predictable repetitive regularities of such a massive codebase, (2) develop an ML approach for training a statistical model to identify common patterns in software corpora, and then (3) use these patterns to statistically detect anomalous, likely buggy, program behavior that significantly deviates from these typical patterns. These internal regularities and repetitive properties of software can be captured as patterns to detect violations of these common patterns. Such violations have a critical impact on program behavior such as bugs, security vulnerabilities, or even program crashes. Our approach focuses on usage patterns of application programming interfaces (APIs). API usage patterns are commonly recurring, representative examples of how real-world applications use APIs in software corpora. These desirable patterns of API usage are learnable to validate or improve developers' implementations. This paper shows preliminary results that we use standard cross-entropy and perplexity to measure how surprising a test subject application is to a statistical model estimated from a software corpus. We continue to develop our approach and evaluate the effectiveness to focus on the following research questions. Are our ML models effectively trainable on large code corpora to learn desirable API usage patterns? How does the performance of our ML-based approach compare to state-of-the-art language models for software when learning API usage for detecting API misuse violations?","PeriodicalId":211080,"journal":{"name":"2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERA57763.2023.10197708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

A large corpus of software repositories enables an opportunity for using machine learning (ML) approaches to create new software engineering tools. In this paper, we propose a novel technique which leverages ML approaches for automating software engineering tasks and thus improves software quality. Our concrete goal is to (1) explore the abundance of predictable repetitive regularities of such a massive codebase, (2) develop an ML approach for training a statistical model to identify common patterns in software corpora, and then (3) use these patterns to statistically detect anomalous, likely buggy, program behavior that significantly deviates from these typical patterns. These internal regularities and repetitive properties of software can be captured as patterns to detect violations of these common patterns. Such violations have a critical impact on program behavior such as bugs, security vulnerabilities, or even program crashes. Our approach focuses on usage patterns of application programming interfaces (APIs). API usage patterns are commonly recurring, representative examples of how real-world applications use APIs in software corpora. These desirable patterns of API usage are learnable to validate or improve developers' implementations. This paper shows preliminary results that we use standard cross-entropy and perplexity to measure how surprising a test subject application is to a statistical model estimated from a software corpus. We continue to develop our approach and evaluate the effectiveness to focus on the following research questions. Are our ML models effectively trainable on large code corpora to learn desirable API usage patterns? How does the performance of our ML-based approach compare to state-of-the-art language models for software when learning API usage for detecting API misuse violations?

查看原文本刊更多论文

一种API使用学习和API滥用违规发现的统计方法

大量的软件存储库为使用机器学习(ML)方法创建新的软件工程工具提供了机会。在本文中，我们提出了一种利用机器学习方法自动化软件工程任务的新技术，从而提高了软件质量。我们的具体目标是:(1)探索如此庞大的代码库中大量可预测的重复规律，(2)开发一种ML方法来训练统计模型以识别软件语料库中的常见模式，然后(3)使用这些模式来统计检测异常，可能有bug的程序行为，这些行为明显偏离这些典型模式。软件的这些内部规律和重复属性可以作为模式捕获，以检测对这些公共模式的违反。这种违反对程序行为有严重的影响，比如bug、安全漏洞，甚至程序崩溃。我们的方法侧重于应用程序编程接口(api)的使用模式。API使用模式通常是真实应用程序如何在软件语料库中使用API的代表性示例。这些理想的API使用模式可以通过学习来验证或改进开发人员的实现。本文显示了我们使用标准交叉熵和困惑度来衡量测试对象应用对从软件语料库估计的统计模型的惊讶程度的初步结果。我们将继续发展我们的方法并评估其有效性，以关注以下研究问题。我们的机器学习模型是否可以在大型代码语料库上有效地训练，以学习理想的API使用模式?在学习API使用情况以检测API滥用违规时，我们基于ml的方法与最先进的软件语言模型的性能如何?

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)

自引率

0.00%

发文量