Risk-based data validation in machine learning-based software systems

Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation Pub Date : 2019-08-27 DOI:10.1145/3340482.3342743

Harald Foidl, M. Felderer

{"title":"Risk-based data validation in machine learning-based software systems","authors":"Harald Foidl, M. Felderer","doi":"10.1145/3340482.3342743","DOIUrl":null,"url":null,"abstract":"Data validation is an essential requirement to ensure the reliability and quality of Machine Learning-based Software Systems. However, an exhaustive validation of all data fed to these systems (i.e. up to several thousand features) is practically unfeasible. In addition, there has been little discussion about methods that support software engineers of such systems in determining how thorough to validate each feature (i.e. data validation rigor). Therefore, this paper presents a conceptual data validation approach that prioritizes features based on their estimated risk of poor data quality. The risk of poor data quality is determined by the probability that a feature is of low data quality and the impact of this low (data) quality feature on the result of the machine learning model. Three criteria are presented to estimate the probability of low data quality (Data Source Quality, Data Smells, Data Pipeline Quality). To determine the impact of low (data) quality features, the importance of features according to the performance of the machine learning model (i.e. Feature Importance) is utilized. The presented approach provides decision support (i.e. data validation prioritization and rigor) for software engineers during the implementation of data validation techniques in the course of deploying a trained machine learning model and its software stack.","PeriodicalId":254040,"journal":{"name":"Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3340482.3342743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

Abstract

Data validation is an essential requirement to ensure the reliability and quality of Machine Learning-based Software Systems. However, an exhaustive validation of all data fed to these systems (i.e. up to several thousand features) is practically unfeasible. In addition, there has been little discussion about methods that support software engineers of such systems in determining how thorough to validate each feature (i.e. data validation rigor). Therefore, this paper presents a conceptual data validation approach that prioritizes features based on their estimated risk of poor data quality. The risk of poor data quality is determined by the probability that a feature is of low data quality and the impact of this low (data) quality feature on the result of the machine learning model. Three criteria are presented to estimate the probability of low data quality (Data Source Quality, Data Smells, Data Pipeline Quality). To determine the impact of low (data) quality features, the importance of features according to the performance of the machine learning model (i.e. Feature Importance) is utilized. The presented approach provides decision support (i.e. data validation prioritization and rigor) for software engineers during the implementation of data validation techniques in the course of deploying a trained machine learning model and its software stack.

查看原文本刊更多论文

基于机器学习的软件系统中基于风险的数据验证

数据验证是保证基于机器学习的软件系统可靠性和质量的基本要求。然而，对提供给这些系统的所有数据(即多达数千个特征)进行详尽的验证实际上是不可行的。此外，关于支持此类系统的软件工程师确定如何彻底验证每个特性(即数据验证的严谨性)的方法的讨论很少。因此，本文提出了一种概念性数据验证方法，该方法根据数据质量差的估计风险对特征进行优先级排序。数据质量差的风险是由一个特征是低数据质量的概率和这个低(数据)质量特征对机器学习模型结果的影响决定的。提出了三个评估低数据质量概率的标准(数据源质量、数据气味、数据管道质量)。为了确定低(数据)质量特征的影响，根据机器学习模型的性能使用特征的重要性(即特征重要性)。提出的方法为软件工程师在部署训练有素的机器学习模型及其软件堆栈的过程中实现数据验证技术提供决策支持(即数据验证优先级和严谨性)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation

自引率

0.00%

发文量