Deepak Panda, Piyush Basia, Kushal Nallavolu, Xin Zhong, Harvey P. Siy, Myoungkyu Song
{"title":"A Statistical Method for API Usage Learning and API Misuse Violation Finding","authors":"Deepak Panda, Piyush Basia, Kushal Nallavolu, Xin Zhong, Harvey P. Siy, Myoungkyu Song","doi":"10.1109/SERA57763.2023.10197708","DOIUrl":null,"url":null,"abstract":"A large corpus of software repositories enables an opportunity for using machine learning (ML) approaches to create new software engineering tools. In this paper, we propose a novel technique which leverages ML approaches for automating software engineering tasks and thus improves software quality. Our concrete goal is to (1) explore the abundance of predictable repetitive regularities of such a massive codebase, (2) develop an ML approach for training a statistical model to identify common patterns in software corpora, and then (3) use these patterns to statistically detect anomalous, likely buggy, program behavior that significantly deviates from these typical patterns. These internal regularities and repetitive properties of software can be captured as patterns to detect violations of these common patterns. Such violations have a critical impact on program behavior such as bugs, security vulnerabilities, or even program crashes. Our approach focuses on usage patterns of application programming interfaces (APIs). API usage patterns are commonly recurring, representative examples of how real-world applications use APIs in software corpora. These desirable patterns of API usage are learnable to validate or improve developers' implementations. This paper shows preliminary results that we use standard cross-entropy and perplexity to measure how surprising a test subject application is to a statistical model estimated from a software corpus. We continue to develop our approach and evaluate the effectiveness to focus on the following research questions. Are our ML models effectively trainable on large code corpora to learn desirable API usage patterns? How does the performance of our ML-based approach compare to state-of-the-art language models for software when learning API usage for detecting API misuse violations?","PeriodicalId":211080,"journal":{"name":"2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-05-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE/ACIS 21st International Conference on Software Engineering Research, Management and Applications (SERA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERA57763.2023.10197708","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
A large corpus of software repositories enables an opportunity for using machine learning (ML) approaches to create new software engineering tools. In this paper, we propose a novel technique which leverages ML approaches for automating software engineering tasks and thus improves software quality. Our concrete goal is to (1) explore the abundance of predictable repetitive regularities of such a massive codebase, (2) develop an ML approach for training a statistical model to identify common patterns in software corpora, and then (3) use these patterns to statistically detect anomalous, likely buggy, program behavior that significantly deviates from these typical patterns. These internal regularities and repetitive properties of software can be captured as patterns to detect violations of these common patterns. Such violations have a critical impact on program behavior such as bugs, security vulnerabilities, or even program crashes. Our approach focuses on usage patterns of application programming interfaces (APIs). API usage patterns are commonly recurring, representative examples of how real-world applications use APIs in software corpora. These desirable patterns of API usage are learnable to validate or improve developers' implementations. This paper shows preliminary results that we use standard cross-entropy and perplexity to measure how surprising a test subject application is to a statistical model estimated from a software corpus. We continue to develop our approach and evaluate the effectiveness to focus on the following research questions. Are our ML models effectively trainable on large code corpora to learn desirable API usage patterns? How does the performance of our ML-based approach compare to state-of-the-art language models for software when learning API usage for detecting API misuse violations?