{"title":"The Unit Test Quality of Deep Learning Libraries: A Mutation Analysis","authors":"Li Jia, Hao Zhong, Linpeng Huang","doi":"10.26226/morressier.613b5418842293c031b5b5cb","DOIUrl":null,"url":null,"abstract":"In recent years, with the flourish of deep learning techniques, deep learning libraries have been used by many smart applications. As smart applications are used in critical scenarios, their bugs become a concern, and bugs in deep learning libraries have far-reaching impacts on their built-on applications. Although programmers write many test cases for deep learning libraries, to the best of our knowledge, no prior study has ever explored to what degree such test cases are sufficient. As a result, some fundamental questions about these test cases are still open. For example, to what degree can existing test cases detect bugs in deep libraries? How to improve such test cases? To help programmers improve their test cases and to shed light on the detection techniques of deep learning bugs, there is a strong need for a study on the test quality of deep learning libraries. To meet the strong need, in this paper, we conduct the first empirical study on this issue. Our basic idea is to inject bugs into deep learning libraries, and to check to what degree existing test cases can detect our injected bugs. With a mutation tool, we constructed 1,545 buggy versions (i.e., mutants). By comparing the testing results between clean and buggy versions, our study leads to 11 findings, and we summarize them into the answers to three research questions. For example, we find that although existing test cases detected 60% of our injected bugs, only 30% of such bugs were detected by the assertions of these test cases. As another example, we find that some exceptions were thrown only in specific learning phases. Furthermore, we interpret our results from the perspectives of researchers, library developers, and application programmers.","PeriodicalId":205629,"journal":{"name":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","volume":"171 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Software Maintenance and Evolution (ICSME)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.26226/morressier.613b5418842293c031b5b5cb","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
In recent years, with the flourish of deep learning techniques, deep learning libraries have been used by many smart applications. As smart applications are used in critical scenarios, their bugs become a concern, and bugs in deep learning libraries have far-reaching impacts on their built-on applications. Although programmers write many test cases for deep learning libraries, to the best of our knowledge, no prior study has ever explored to what degree such test cases are sufficient. As a result, some fundamental questions about these test cases are still open. For example, to what degree can existing test cases detect bugs in deep libraries? How to improve such test cases? To help programmers improve their test cases and to shed light on the detection techniques of deep learning bugs, there is a strong need for a study on the test quality of deep learning libraries. To meet the strong need, in this paper, we conduct the first empirical study on this issue. Our basic idea is to inject bugs into deep learning libraries, and to check to what degree existing test cases can detect our injected bugs. With a mutation tool, we constructed 1,545 buggy versions (i.e., mutants). By comparing the testing results between clean and buggy versions, our study leads to 11 findings, and we summarize them into the answers to three research questions. For example, we find that although existing test cases detected 60% of our injected bugs, only 30% of such bugs were detected by the assertions of these test cases. As another example, we find that some exceptions were thrown only in specific learning phases. Furthermore, we interpret our results from the perspectives of researchers, library developers, and application programmers.