{"title":"运行中自动模型选择的动态稳健性评估","authors":"Jin Zhang , Jingyue Li , Zhirong Yang","doi":"10.1016/j.infsof.2024.107603","DOIUrl":null,"url":null,"abstract":"<div><h3>Context:</h3><div>The increasing use of artificial neural network (ANN) classifiers in systems, especially safety-critical systems (SCSs), requires ensuring their robustness against out-of-distribution (OOD) shifts in operation, which are changes in the underlying data distribution from the data training the classifier. However, measuring the robustness of classifiers in operation with only unlabeled data is challenging. Additionally, machine learning engineers may need to compare different models or versions of the same model and switch to an optimal version based on their robustness.</div></div><div><h3>Objective:</h3><div>This paper explores the problem of dynamic robustness evaluation for automated model selection. We aim to find efficient and effective metrics for evaluating and comparing the robustness of multiple ANN classifiers using unlabeled operational data.</div></div><div><h3>Methods:</h3><div>To quantitatively measure the differences between the model outputs and assess robustness under OOD shifts using unlabeled data, we choose distance-based metrics. An empirical comparison of five such metrics, suitable for higher-dimensional data like images, is performed. The selected metrics include Wasserstein distance (WD), maximum mean discrepancy (MMD), Hellinger distance (HL), Kolmogorov–Smirnov statistic (KS), and Kullback–Leibler divergence (KL), known for their efficacy in quantifying distribution differences. We evaluate these metrics on 20 state-of-the-art models (ten CIFAR10-based models, five CIFAR100-based models, and five ImageNet-based models) from a widely used robustness benchmark (<strong>RobustBench</strong>) using data perturbed with various types and magnitudes of corruptions to mimic real-world OOD shifts.</div></div><div><h3>Results:</h3><div>Our findings reveal that the WD metric outperforms others when ranking multiple ANN models for CIFAR10- and CIFAR100-based models, while the KS metric demonstrates superior performance for ImageNet-based models. MMD can be used as a reliable second option for both datasets.</div></div><div><h3>Conclusion:</h3><div>This study highlights the effectiveness of distance-based metrics in ranking models’ robustness for automated model selection. It also emphasizes the significance of advancing research in dynamic robustness evaluation.</div></div>","PeriodicalId":54983,"journal":{"name":"Information and Software Technology","volume":"178 ","pages":"Article 107603"},"PeriodicalIF":3.8000,"publicationDate":"2024-10-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dynamic robustness evaluation for automated model selection in operation\",\"authors\":\"Jin Zhang , Jingyue Li , Zhirong Yang\",\"doi\":\"10.1016/j.infsof.2024.107603\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Context:</h3><div>The increasing use of artificial neural network (ANN) classifiers in systems, especially safety-critical systems (SCSs), requires ensuring their robustness against out-of-distribution (OOD) shifts in operation, which are changes in the underlying data distribution from the data training the classifier. However, measuring the robustness of classifiers in operation with only unlabeled data is challenging. Additionally, machine learning engineers may need to compare different models or versions of the same model and switch to an optimal version based on their robustness.</div></div><div><h3>Objective:</h3><div>This paper explores the problem of dynamic robustness evaluation for automated model selection. We aim to find efficient and effective metrics for evaluating and comparing the robustness of multiple ANN classifiers using unlabeled operational data.</div></div><div><h3>Methods:</h3><div>To quantitatively measure the differences between the model outputs and assess robustness under OOD shifts using unlabeled data, we choose distance-based metrics. An empirical comparison of five such metrics, suitable for higher-dimensional data like images, is performed. The selected metrics include Wasserstein distance (WD), maximum mean discrepancy (MMD), Hellinger distance (HL), Kolmogorov–Smirnov statistic (KS), and Kullback–Leibler divergence (KL), known for their efficacy in quantifying distribution differences. We evaluate these metrics on 20 state-of-the-art models (ten CIFAR10-based models, five CIFAR100-based models, and five ImageNet-based models) from a widely used robustness benchmark (<strong>RobustBench</strong>) using data perturbed with various types and magnitudes of corruptions to mimic real-world OOD shifts.</div></div><div><h3>Results:</h3><div>Our findings reveal that the WD metric outperforms others when ranking multiple ANN models for CIFAR10- and CIFAR100-based models, while the KS metric demonstrates superior performance for ImageNet-based models. MMD can be used as a reliable second option for both datasets.</div></div><div><h3>Conclusion:</h3><div>This study highlights the effectiveness of distance-based metrics in ranking models’ robustness for automated model selection. It also emphasizes the significance of advancing research in dynamic robustness evaluation.</div></div>\",\"PeriodicalId\":54983,\"journal\":{\"name\":\"Information and Software Technology\",\"volume\":\"178 \",\"pages\":\"Article 107603\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2024-10-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information and Software Technology\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950584924002088\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information and Software Technology","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950584924002088","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Dynamic robustness evaluation for automated model selection in operation
Context:
The increasing use of artificial neural network (ANN) classifiers in systems, especially safety-critical systems (SCSs), requires ensuring their robustness against out-of-distribution (OOD) shifts in operation, which are changes in the underlying data distribution from the data training the classifier. However, measuring the robustness of classifiers in operation with only unlabeled data is challenging. Additionally, machine learning engineers may need to compare different models or versions of the same model and switch to an optimal version based on their robustness.
Objective:
This paper explores the problem of dynamic robustness evaluation for automated model selection. We aim to find efficient and effective metrics for evaluating and comparing the robustness of multiple ANN classifiers using unlabeled operational data.
Methods:
To quantitatively measure the differences between the model outputs and assess robustness under OOD shifts using unlabeled data, we choose distance-based metrics. An empirical comparison of five such metrics, suitable for higher-dimensional data like images, is performed. The selected metrics include Wasserstein distance (WD), maximum mean discrepancy (MMD), Hellinger distance (HL), Kolmogorov–Smirnov statistic (KS), and Kullback–Leibler divergence (KL), known for their efficacy in quantifying distribution differences. We evaluate these metrics on 20 state-of-the-art models (ten CIFAR10-based models, five CIFAR100-based models, and five ImageNet-based models) from a widely used robustness benchmark (RobustBench) using data perturbed with various types and magnitudes of corruptions to mimic real-world OOD shifts.
Results:
Our findings reveal that the WD metric outperforms others when ranking multiple ANN models for CIFAR10- and CIFAR100-based models, while the KS metric demonstrates superior performance for ImageNet-based models. MMD can be used as a reliable second option for both datasets.
Conclusion:
This study highlights the effectiveness of distance-based metrics in ranking models’ robustness for automated model selection. It also emphasizes the significance of advancing research in dynamic robustness evaluation.
期刊介绍:
Information and Software Technology is the international archival journal focusing on research and experience that contributes to the improvement of software development practices. The journal''s scope includes methods and techniques to better engineer software and manage its development. Articles submitted for review should have a clear component of software engineering or address ways to improve the engineering and management of software development. Areas covered by the journal include:
• Software management, quality and metrics,
• Software processes,
• Software architecture, modelling, specification, design and programming
• Functional and non-functional software requirements
• Software testing and verification & validation
• Empirical studies of all aspects of engineering and managing software development
Short Communications is a new section dedicated to short papers addressing new ideas, controversial opinions, "Negative" results and much more. Read the Guide for authors for more information.
The journal encourages and welcomes submissions of systematic literature studies (reviews and maps) within the scope of the journal. Information and Software Technology is the premiere outlet for systematic literature studies in software engineering.