Evaluating Spectrum-Based Fault Localization on Deep Learning Libraries

IF 6.5 1区计算机科学 Q1 COMPUTER SCIENCE, SOFTWARE ENGINEERING

IEEE Transactions on Software Engineering Pub Date : 2025-03-18 DOI:10.1109/TSE.2025.3552622

Ming Yan;Junjie Chen;Tianjie Jiang;Jiajun Jiang;Zan Wang

{"title":"Evaluating Spectrum-Based Fault Localization on Deep Learning Libraries","authors":"Ming Yan;Junjie Chen;Tianjie Jiang;Jiajun Jiang;Zan Wang","doi":"10.1109/TSE.2025.3552622","DOIUrl":null,"url":null,"abstract":"Deep learning (DL) libraries have become increasingly popular and their quality assurance is also gaining significant attention. Although many fault detection techniques have been proposed, effective fault localization techniques tailored to DL libraries are scarce. Due to the unique characteristics of DL libraries (e.g., complicated code architecture supporting DL model training and inference with extensive multidimensional tensor calculations), the effectiveness of existing fault localization techniques for traditional software is also unknown on DL library faults. To bridge this gap, we conducted the first empirical study to investigate the effectiveness of fault localization on DL libraries. Specifically, we evaluated spectrum-based fault localization (SBFL) due to its high generalizability and affordable overhead on such complicated libraries. Based on the key aspects in SBFL, our study investigated the effectiveness of SBFL with different sources of passing test cases (including human-written, fuzzer-generated, and mutation-based test cases) and various suspicious value calculation methods. In particular, mutation-based test cases are produced by our designed rule-based mutation technique and LLM-based mutation technique tailored to DL library faults. To enable our extensive study, we built the first benchmark (Defects4DLL), which contains 120 real-world faults in PyTorch and TensorFlow with easy-to-use experimental environments. Our study delivered a series of useful findings. For example, the rule-based approach is effective in localizing crash faults in DL libraries, successfully localizing 44.44% of crash faults within Top-10 functions and 74.07% of crash faults within Top-10 files, while the passing test cases from DL library fuzzers perform poorly on this task. Furthermore, based on our findings on the complementarity of different sources, we designed a hybrid technique by effectively integrating human-written, LLM-mutated, rule-based mutated test cases, which further achieves 31.48%<inline-formula><tex-math>$\\boldsymbol{\\sim}$</tex-math></inline-formula>61.36% improvements over each single source in terms of the number of detected faults within Top-5 files.","PeriodicalId":13324,"journal":{"name":"IEEE Transactions on Software Engineering","volume":"51 5","pages":"1399-1414"},"PeriodicalIF":6.5000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Software Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10930847/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

Abstract

Deep learning (DL) libraries have become increasingly popular and their quality assurance is also gaining significant attention. Although many fault detection techniques have been proposed, effective fault localization techniques tailored to DL libraries are scarce. Due to the unique characteristics of DL libraries (e.g., complicated code architecture supporting DL model training and inference with extensive multidimensional tensor calculations), the effectiveness of existing fault localization techniques for traditional software is also unknown on DL library faults. To bridge this gap, we conducted the first empirical study to investigate the effectiveness of fault localization on DL libraries. Specifically, we evaluated spectrum-based fault localization (SBFL) due to its high generalizability and affordable overhead on such complicated libraries. Based on the key aspects in SBFL, our study investigated the effectiveness of SBFL with different sources of passing test cases (including human-written, fuzzer-generated, and mutation-based test cases) and various suspicious value calculation methods. In particular, mutation-based test cases are produced by our designed rule-based mutation technique and LLM-based mutation technique tailored to DL library faults. To enable our extensive study, we built the first benchmark (Defects4DLL), which contains 120 real-world faults in PyTorch and TensorFlow with easy-to-use experimental environments. Our study delivered a series of useful findings. For example, the rule-based approach is effective in localizing crash faults in DL libraries, successfully localizing 44.44% of crash faults within Top-10 functions and 74.07% of crash faults within Top-10 files, while the passing test cases from DL library fuzzers perform poorly on this task. Furthermore, based on our findings on the complementarity of different sources, we designed a hybrid technique by effectively integrating human-written, LLM-mutated, rule-based mutated test cases, which further achieves 31.48%

$\boldsymbol{\sim}$

61.36% improvements over each single source in terms of the number of detected faults within Top-5 files.

查看原文本刊更多论文

基于深度学习库的频谱故障定位评估

深度学习（DL）库越来越受欢迎，其质量保证也受到了极大的关注。虽然已经提出了许多故障检测技术，但针对DL库的有效故障定位技术很少。由于深度学习库的独特特性（例如，复杂的代码架构支持深度学习模型训练和大量多维张量计算的推理），现有的传统软件故障定位技术对深度学习库故障的有效性也是未知的。为了弥补这一差距，我们进行了第一次实证研究，探讨了DL库故障定位的有效性。具体来说，我们对基于频谱的故障定位（SBFL）进行了评估，因为它具有很高的通用性，并且对这种复杂的库开销负担得起。基于SBFL的关键方面，我们研究了不同来源的通过测试用例（包括人工编写、模糊生成和基于突变的测试用例）和各种可疑值计算方法的SBFL有效性。特别是，基于突变的测试用例是由我们设计的基于规则的突变技术和针对DL库故障定制的基于llm的突变技术产生的。为了进行广泛的研究，我们构建了第一个基准（缺陷4dll），其中包含PyTorch和TensorFlow中的120个真实故障，具有易于使用的实验环境。我们的研究得出了一系列有用的发现。例如，基于规则的方法可以有效地定位DL库中的崩溃错误，成功地定位了top 10函数中44.44%的崩溃错误和top 10文件中74.07%的崩溃错误，而来自DL库模糊器的通过测试用例在此任务上表现不佳。此外，基于我们对不同来源的互补性的发现，我们设计了一种混合技术，通过有效地集成人类编写的，llm突变的，基于规则的突变测试用例，在Top-5文件中检测到的故障数量方面，比每个单一来源进一步提高了31.48%$ $61.36% $ $。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Software Engineering 工程技术-工程：电子与电气

CiteScore

9.70

自引率

10.80%

发文量

724

审稿时长

6 months

期刊介绍： IEEE Transactions on Software Engineering seeks contributions comprising well-defined theoretical results and empirical studies with potential impacts on software construction, analysis, or management. The scope of this Transactions extends from fundamental mechanisms to the development of principles and their application in specific environments. Specific topic areas include: a) Development and maintenance methods and models: Techniques and principles for specifying, designing, and implementing software systems, encompassing notations and process models. b) Assessment methods: Software tests, validation, reliability models, test and diagnosis procedures, software redundancy, design for error control, and measurements and evaluation of process and product aspects. c) Software project management: Productivity factors, cost models, schedule and organizational issues, and standards. d) Tools and environments: Specific tools, integrated tool environments, associated architectures, databases, and parallel and distributed processing issues. e) System issues: Hardware-software trade-offs. f) State-of-the-art surveys: Syntheses and comprehensive reviews of the historical development within specific areas of interest.