Jiannan Wang, Thibaud Lutellier, Shangshu Qian, H. Pham, Lin Tan
{"title":"EAGLE: Creating Equivalent Graphs to Test Deep Learning Libraries","authors":"Jiannan Wang, Thibaud Lutellier, Shangshu Qian, H. Pham, Lin Tan","doi":"10.1145/3510003.3510165","DOIUrl":"https://doi.org/10.1145/3510003.3510165","url":null,"abstract":"Testing deep learning (DL) software is crucial and challenging. Recent approaches use differential testing to cross-check pairs of implementations of the same functionality across different libraries. Such approaches require two DL libraries implementing the same functionality, which is often unavailable. In addition, they rely on a high-level library, Keras, that implements missing functionality in all supported DL libraries, which is prohibitively expensive and thus no longer maintained. To address this issue, we propose EAGLE, a new technique that uses differential testing in a different dimension, by using equivalent graphs to test a single DL implementation (e.g., a single DL library). Equivalent graphs use different Application Programming Interfaces (APIs), data types, or optimizations to achieve the same functionality. The rationale is that two equivalent graphs executed on a single DL implementation should produce identical output given the same input. Specifically, we design 16 new DL equivalence rules and propose a technique, EAGLE, that (1) uses these equivalence rules to build concrete pairs of equivalent graphs and (2) cross-checks the output of these equivalent graphs to detect inconsistency bugs in a DL library. Our evaluation on two widely-used DL libraries, i.e., Tensor Flow and PyTorch, shows that EAGLE detects 25 bugs (18 in Tensor Flow and 7 in PyTorch), including 13 previously unknown bugs.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127506149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanjie Gao, Zhengxi Li, Haoxiang Lin, Hongyu Zhang, Ming Wu, Mao Yang
{"title":"REFTY: Refinement Types for Valid Deep Learning Models","authors":"Yanjie Gao, Zhengxi Li, Haoxiang Lin, Hongyu Zhang, Ming Wu, Mao Yang","doi":"10.1145/3510003.3510077","DOIUrl":"https://doi.org/10.1145/3510003.3510077","url":null,"abstract":"Deep learning has been increasingly adopted in many application areas. To construct valid deep learning models, developers must conform to certain computational constraints by carefully selecting appropriate neural architectures and hyperparameter values. For example, the kernel size hyperparameter of the 2D convolution operator cannot be overlarge to ensure that the height and width of the output tensor remain positive. Because model construction is largely manual and lacks necessary tooling support, it is possible to violate those constraints and raise type errors of deep learning models, causing either runtime exceptions or wrong output results. In this paper, we propose Refty, a refinement type-based tool for statically checking the validity of deep learning models ahead of job execution. Refty refines each type of deep learning operator with framework-independent logical formulae that describe the computational constraints on both tensors and hyperparameters. Given the neural architecture and hyperparameter domains of a model, Refty visits every operator, generates a set of constraints that the model should satisfy, and utilizes an SMT solver for solving the constraints. We have evaluated Refty on both individual operators and representative real-world models with various hyperparameter values under PyTorch and TensorFlow. We also compare it with an existing shape-checking tool. The experimental results show that Refty finds all the type errors and achieves 100% Precision and Recall, demonstrating its effectiveness.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117097035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huan Xie, Yan Lei, Meng Yan, Yue Yu, Xin Xia, Xiaoguang Mao
{"title":"A Universal Data Augmentation Approach for Fault Localization","authors":"Huan Xie, Yan Lei, Meng Yan, Yue Yu, Xin Xia, Xiaoguang Mao","doi":"10.1145/3510003.3510136","DOIUrl":"https://doi.org/10.1145/3510003.3510136","url":null,"abstract":"Data is the fuel to models, and it is still applicable in fault localization (FL). Many existing elaborate FL techniques take the code coverage matrix and failure vector as inputs, expecting the techniques could find the correlation between program entities and failures. However, the input data is high-dimensional and extremely imbalanced since the real-world programs are large in size and the number of failing test cases is much less than that of passing test cases, which are posing severe threats to the effectiveness of FL techniques. To overcome the limitations, we propose Aeneas, a universal data augmentation approach that generAtes synthesized failing test cases from reduced feature sace for more precise fault localization. Specifically, to improve the effectiveness of data augmentation, Aeneas applies a revised principal component analysis (PCA) first to generate reduced feature space for more concise representation of the original coverage matrix, which could also gain efficiency for data synthesis. Then, Aeneas handles the imbalanced data issue through generating synthesized failing test cases from the reduced feature space through conditional variational autoencoder (CVAE). To evaluate the effectiveness of Aeneas, we conduct large-scale experiments on 458 versions of 10 programs (from ManyBugs, SIR, and Defects4J) by six state-of-the-art FL techniques. The experimental results clearly show that Aeneas is statistically more effective than baselines, e.g., our approach can improve the six original methods by 89% on average under the Top-1 accuracy.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117156339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhilei Ren, Shi-lei Sun, J. Xuan, Xiaochen Li, Zhide Zhou, He Jiang
{"title":"Automated Patching for Unreproducible Builds","authors":"Zhilei Ren, Shi-lei Sun, J. Xuan, Xiaochen Li, Zhide Zhou, He Jiang","doi":"10.1145/3510003.3510102","DOIUrl":"https://doi.org/10.1145/3510003.3510102","url":null,"abstract":"Software reproducibility plays an essential role in establishing trust between source code and the built artifacts, by comparing compilation outputs acquired from independent users. Although the testing for unreproducible builds could be automated, fixing unreproducible build issues poses a set of challenges within the reproducible builds practice, among which we consider the localization granularity and the historical knowledge utilization as the most significant ones. To tackle these challenges, we propose a novel approach RepFix that combines tracing-based fine-grained localization with history-based patch generation mechanisms. On the one hand, to tackle the localization granularity challenge, we adopt system-level dynamic tracing to capture both the system call traces and user-space function call information. By integrating the kernel probes and user-space probes, we could determine the location of each executed build command more accurately. On the other hand, to tackle the historical knowledge utilization challenge, we design a similarity based relevant patch retrieving mechanism, and generate patches by applying the edit operations of the existing patches. With the abundant patches accumulated by the reproducible builds practice, we could generate patches to fix the unreproducible builds automatically. To evaluate the usefulness of RepFix, extensive experiments are conducted over a dataset with 116 real-world packages. Based on RepFix, we successfully fix the unreproducible build issues for 64 packages. Moreover, we apply RepFix to the Arch Linux packages, and successfully fix four packages. Two patches have been accepted by the repository, and there is one package for which the patch is pushed and accepted by its upstream repository, so that the fixing could be helpful for other downstream repositories.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131265192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Muhan Zeng, Yiqian Wu, Zhen Ye, Yingfei Xiong, Xin Zhang, Lu Zhang
{"title":"Fault Localization via Efficient Probabilistic Modeling of Program Semantics","authors":"Muhan Zeng, Yiqian Wu, Zhen Ye, Yingfei Xiong, Xin Zhang, Lu Zhang","doi":"10.1145/3510003.3510073","DOIUrl":"https://doi.org/10.1145/3510003.3510073","url":null,"abstract":"Testing-based fault localization has been a significant topic in software engineering in the past decades. It localizes a faulty program element based on a set of passing and failing test executions. Since whether a fault could be triggered and detected by a test is related to program semantics, it is crucial to model program semantics in fault localization approaches. Existing approaches either consider the full semantics of the program (e.g., mutation-based fault localization and angelic debugging), leading to scalability issues, or ignore the semantics of the program (e.g., spectrum-based fault localization), leading to imprecise localization results. Our key idea is: by modeling only the correctness of program values but not their full semantics, a balance could be reached between effectiveness and scalability. To realize this idea, we introduce a probabilistic approach to model program semantics and utilize information from static analysis and dynamic execution traces in our modeling. Our approach, SmartFL (SeMantics bAsed pRobabilisTic Fault Localization), is evaluated on a real-world dataset, Defects4J. The top-1 statement-level accuracy of our approach is 21 %, which is the best among state-of-the-art methods. The average time cost is 210 seconds per fault while existing methods that capture full semantics are often 10x or more slower.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130623973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Moshi Wei, Nima Shiri Harzevili, Yuchao Huang, Junjie Wang, Song Wang
{"title":"CLEAR: Contrastive Learning for API Recommendation","authors":"Moshi Wei, Nima Shiri Harzevili, Yuchao Huang, Junjie Wang, Song Wang","doi":"10.1145/3510003.3510159","DOIUrl":"https://doi.org/10.1145/3510003.3510159","url":null,"abstract":"Automatic API recommendation has been studied for years. There are two orthogonal lines of approaches for this task, i.e., information-retrieval-based (IR-based) and neural-based methods. Although these approaches were reported having remarkable performance, our observation shows that existing approaches can fail due to the following two reasons: 1) most IR-based approaches treat task queries as bag-of-words and use word embedding to represent queries, which cannot capture the sequential semantic information. 2) both the IR-based and the neural-based approaches are weak at distinguishing the semantic difference among lexically similar queries. In this paper, we propose CLEAR, which leverages BERT sen-tence embedding and contrastive learning to tackle the above two is-sues. Specifically, CLEAR embeds the whole sentence of queries and Stack Overflow (SO) posts with a BERT-based model rather than the bag-of-word-based word embedding model, which can preserve the semantic-related sequential information. In addition, CLEAR uses contrastive learning to train the BERT-based embedding model for learning precise semantic representation of programming termi-nologies regardless of their lexical information. CLEAR also builds a BERT-based re-ranking model to optimize its recommendation results. Given a query, CLEAR first selects a set of candidate SO posts via the BERT sentence embedding-based similarity to reduce search space. CLEAR further leverages a BERT-based re-ranking model to rank candidate SO posts and recommends the APIs from the ranked top SO posts for the query. Our experiment results on three different test datasets confirm the effectiveness of CLEAR for both method-level and class-level API recommendation. Compared to the state-of-the-art API recom-mendation approaches, CLEAR improves the MAP by 25%-187% at method-level and 10%-100% at class-level.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125713479","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matthías Páll Gissurarson, Leonhard Applis, Annibale Panichella, A. Deursen, David Sands
{"title":"PROPR: Property-Based Automatic Program Repair","authors":"Matthías Páll Gissurarson, Leonhard Applis, Annibale Panichella, A. Deursen, David Sands","doi":"10.1145/3510003.3510620","DOIUrl":"https://doi.org/10.1145/3510003.3510620","url":null,"abstract":"Automatic program repair (APR) regularly faces the challenge of overfitting patches - patches that pass the test suite, but do not actually address the problems when evaluated manually. Currently, overfit detection requires manual inspection or an oracle making quality control of APR an expensive task. With this work, we want to introduce properties in addition to unit tests for APR to address the problem of overfitting. To that end, we design and implement PROPR, a program repair tool for Haskell that leverages both property-based testing (via QuickCheck) and the rich type sys-tem and synthesis offered by the Haskell compiler. We compare the repair-ratio, time-to-first-patch and overfitting-ratio when using unit tests, property-based tests, and their combination. Our results show that properties lead to quicker results and have a lower overfit ratio than unit tests. The created overfit patches provide valuable insight into the underlying problems of the program to repair (e.g., in terms of fault localization or test quality). We consider this step towards fitter, or at least insightful, patches a critical contribution to bring APR into developer workflows.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123067620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Diversity-Driven Automated Formal Verification","authors":"E. First, Yuriy Brun","doi":"10.1145/3510003.3510138","DOIUrl":"https://doi.org/10.1145/3510003.3510138","url":null,"abstract":"Formally verified correctness is one of the most desirable properties of software systems. But despite great progress made via interactive theorem provers, such as Coq, writing proof scripts for verification remains one of the most effort-intensive (and often prohibitively difficult) software development activities. Recent work has created tools that automatically synthesize proofs or proof scripts. For example, CoqHammer can prove 26.6% of theorems completely automatically by reasoning using precomputed facts, while TacTok and ASTactic, which use machine learning to model proof scripts and then perform biased search through the proof-script space, can prove 12.9% and 12.3% of the theorems, respectively. Further, these three tools are highly complementary; together, they can prove 30.4% of the theorems fully automatically. Our key insight is that control over the learning process can produce a diverse set of models, and that, due to the unique nature of proof synthesis (the existence of the theorem prover, an oracle that infallibly judges a proof's correctness), this diversity can significantly improve these tools' proving power. Accordingly, we develop Diva, which uses a diverse set of models with TacTok's and ASTactic's search mech-anism to prove 21.7% of the theorems. That is, Diva proves 68% more theorems than TacTok and 77% more than ASTactic. Complementary to CoqHammer, Diva proves 781 theorems (27% added value) that CoqHammer does not, and 364 theorems no existing tool has proved automatically. Together with CoqHammer, Diva proves 33.8% of the theorems, the largest fraction to date. We explore nine dimensions for learning diverse models, and identify which dimensions lead to the most useful diversity. Further, we develop an optimization to speed up Diva's execution by 40×. Our study introduces a completely new idea for using diversity in machine learning to improve the power of state-of-the-art proof-script synthesis techniques, and empirically demonstrates that the improvement is significant on a dataset of 68K theorems from 122 open-source software projects.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"140 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133251262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Wang, Qi Xu, Xiaoning Chang, Wensheng Dou, Jiaxin Zhu, Jun Wei, Tao Huang
{"title":"Characterizing and Detecting Bugs in WeChat Mini-Programs","authors":"Tao Wang, Qi Xu, Xiaoning Chang, Wensheng Dou, Jiaxin Zhu, Jun Wei, Tao Huang","doi":"10.1145/3510003.3510114","DOIUrl":"https://doi.org/10.1145/3510003.3510114","url":null,"abstract":"Built on the WeChat social platform, WeChat Mini-Programs are widely used by more than 400 million users every day. Consequently, the reliability of Mini-Programs is particularly crucial. However, WeChat Mini-Programs suffer from various bugs related to execution environment, lifecycle management, asynchronous mechanism, etc. These bugs have seriously affected users' experience and caused serious impacts. In this paper, we conduct the first empirical study on 83 WeChat Mini-Program bugs, and perform an in-depth analysis of their root causes, impacts and fixes. From this study, we obtain many interesting findings that can open up new research directions for combating WeChat Mini-Program bugs. Based on the bug patterns found in our study, we further develop WeDetector to detect WeChat Mini-Program bugs. Our evaluation on 25 real-world Mini-Programs has found 11 previously unknown bugs, and 7 of them have been confirmed by developers.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130377600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Domain-Specific Analysis of Mobile App Reviews Using Keyword-Assisted Topic Models","authors":"Miroslav Tushev, Fahime Ebrahimi, Anas Mahmoud","doi":"10.1145/3510003.3510201","DOIUrl":"https://doi.org/10.1145/3510003.3510201","url":null,"abstract":"Mobile application (app) reviews contain valuable information for app developers. A plethora of supervised and unsupervised techniques have been proposed in the literature to synthesize useful user feedback from app reviews. However, traditional supervised classification algorithms require extensive manual effort to label ground truth data, while unsupervised text mining techniques, such as topic models, often produce suboptimal results due to the sparsity of useful information in the reviews. To overcome these limitations, in this paper, we propose a fully automatic and unsupervised approach for extracting useful information from mobile app reviews. The proposed approach is based on keyATM, a keyword-assisted approach for generating topic models. keyATM overcomes the prob-lem of data sparsity by using seeding keywords extracted directly from the review corpus. These keywords are then used to generate meaningful domain-specific topics. Our approach is evaluated over two datasets of mobile app reviews sampled from the domains of Investing and Food Delivery apps. The results show that our approach produces significantly more coherent topics than traditional topic modeling techniques.","PeriodicalId":202896,"journal":{"name":"2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132077863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}