HybridRepair:面向深度学习模型的标注高效修复

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis Pub Date : 2022-07-18 DOI:10.1145/3533767.3534408

Yu LI, Mu-Hwa Chen, Qiang Xu

{"title":"HybridRepair:面向深度学习模型的标注高效修复","authors":"Yu LI, Mu-Hwa Chen, Qiang Xu","doi":"10.1145/3533767.3534408","DOIUrl":null,"url":null,"abstract":"A well-trained deep learning (DL) model often cannot achieve expected performance after deployment due to the mismatch between the distributions of the training data and the field data in the operational environment. Therefore, repairing DL models is critical, especially when deployed on increasingly larger tasks with shifted distributions. Generally speaking, it is easy to obtain a large amount of field data. Existing solutions develop various techniques to select a subset for annotation and then fine-tune the model for repair. While effective, achieving a higher repair rate is inevitably associated with more expensive labeling costs. To mitigate this problem, we propose a novel annotation-efficient repair solution for DL models, namely HybridRepair, wherein we take a holistic approach that coordinates the use of a small amount of annotated data and a large amount of unlabeled data for repair. Our key insight is that accurate yet sufficient training data is needed to repair the corresponding failure region in the data distribution. Under a given labeling budget, we selectively annotate some data in failure regions and propagate their labels to the neighboring data on the one hand. On the other hand, we take advantage of the semi-supervised learning (SSL) techniques to further boost the training data density. However, different from existing SSL solutions that try to use all the unlabeled data, we only use a selected part of them considering the impact of distribution shift on SSL solutions. Experimental results show that HybridRepair outperforms both state-of-the-art DL model repair solutions and semi-supervised techniques for model improvements, especially when there is a distribution shift between the training data and the field data. Our code is available at: https://github.com/cure-lab/HybridRepair.","PeriodicalId":412271,"journal":{"name":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","volume":"98 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"HybridRepair: towards annotation-efficient repair for deep learning models\",\"authors\":\"Yu LI, Mu-Hwa Chen, Qiang Xu\",\"doi\":\"10.1145/3533767.3534408\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A well-trained deep learning (DL) model often cannot achieve expected performance after deployment due to the mismatch between the distributions of the training data and the field data in the operational environment. Therefore, repairing DL models is critical, especially when deployed on increasingly larger tasks with shifted distributions. Generally speaking, it is easy to obtain a large amount of field data. Existing solutions develop various techniques to select a subset for annotation and then fine-tune the model for repair. While effective, achieving a higher repair rate is inevitably associated with more expensive labeling costs. To mitigate this problem, we propose a novel annotation-efficient repair solution for DL models, namely HybridRepair, wherein we take a holistic approach that coordinates the use of a small amount of annotated data and a large amount of unlabeled data for repair. Our key insight is that accurate yet sufficient training data is needed to repair the corresponding failure region in the data distribution. Under a given labeling budget, we selectively annotate some data in failure regions and propagate their labels to the neighboring data on the one hand. On the other hand, we take advantage of the semi-supervised learning (SSL) techniques to further boost the training data density. However, different from existing SSL solutions that try to use all the unlabeled data, we only use a selected part of them considering the impact of distribution shift on SSL solutions. Experimental results show that HybridRepair outperforms both state-of-the-art DL model repair solutions and semi-supervised techniques for model improvements, especially when there is a distribution shift between the training data and the field data. Our code is available at: https://github.com/cure-lab/HybridRepair.\",\"PeriodicalId\":412271,\"journal\":{\"name\":\"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis\",\"volume\":\"98 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3533767.3534408\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3533767.3534408","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

由于训练数据的分布与实际操作环境中的现场数据不匹配，训练有素的深度学习(DL)模型在部署后往往无法达到预期的性能。因此，修复DL模型是至关重要的，特别是当部署在具有移位分布的日益庞大的任务上时。一般来说，很容易获得大量的现场数据。现有的解决方案开发了各种技术来选择一个子集进行注释，然后对模型进行微调以进行修复。虽然有效，但实现更高的修复率不可避免地与更昂贵的标签成本相关。为了缓解这个问题，我们为深度学习模型提出了一种新的注释高效修复解决方案，即HybridRepair，其中我们采用了一种整体方法，协调使用少量注释数据和大量未标记数据进行修复。我们的关键见解是，需要准确而足够的训练数据来修复数据分布中相应的故障区域。在给定的标注预算下，我们一方面有选择地标注失效区域的一些数据，并将它们的标注传播到邻近的数据中。另一方面，我们利用半监督学习(SSL)技术进一步提高训练数据密度。然而，与尝试使用所有未标记数据的现有SSL解决方案不同，考虑到分布转移对SSL解决方案的影响，我们只使用其中的一部分。实验结果表明，在模型改进方面，HybridRepair优于最先进的深度学习模型修复解决方案和半监督技术，特别是当训练数据和现场数据之间的分布发生变化时。我们的代码可在:https://github.com/cure-lab/HybridRepair。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

HybridRepair: towards annotation-efficient repair for deep learning models

A well-trained deep learning (DL) model often cannot achieve expected performance after deployment due to the mismatch between the distributions of the training data and the field data in the operational environment. Therefore, repairing DL models is critical, especially when deployed on increasingly larger tasks with shifted distributions. Generally speaking, it is easy to obtain a large amount of field data. Existing solutions develop various techniques to select a subset for annotation and then fine-tune the model for repair. While effective, achieving a higher repair rate is inevitably associated with more expensive labeling costs. To mitigate this problem, we propose a novel annotation-efficient repair solution for DL models, namely HybridRepair, wherein we take a holistic approach that coordinates the use of a small amount of annotated data and a large amount of unlabeled data for repair. Our key insight is that accurate yet sufficient training data is needed to repair the corresponding failure region in the data distribution. Under a given labeling budget, we selectively annotate some data in failure regions and propagate their labels to the neighboring data on the one hand. On the other hand, we take advantage of the semi-supervised learning (SSL) techniques to further boost the training data density. However, different from existing SSL solutions that try to use all the unlabeled data, we only use a selected part of them considering the impact of distribution shift on SSL solutions. Experimental results show that HybridRepair outperforms both state-of-the-art DL model repair solutions and semi-supervised techniques for model improvements, especially when there is a distribution shift between the training data and the field data. Our code is available at: https://github.com/cure-lab/HybridRepair.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis

自引率

0.00%

发文量