基于机器学习的软件系统中基于风险的数据验证

Harald Foidl, M. Felderer
{"title":"基于机器学习的软件系统中基于风险的数据验证","authors":"Harald Foidl, M. Felderer","doi":"10.1145/3340482.3342743","DOIUrl":null,"url":null,"abstract":"Data validation is an essential requirement to ensure the reliability and quality of Machine Learning-based Software Systems. However, an exhaustive validation of all data fed to these systems (i.e. up to several thousand features) is practically unfeasible. In addition, there has been little discussion about methods that support software engineers of such systems in determining how thorough to validate each feature (i.e. data validation rigor). Therefore, this paper presents a conceptual data validation approach that prioritizes features based on their estimated risk of poor data quality. The risk of poor data quality is determined by the probability that a feature is of low data quality and the impact of this low (data) quality feature on the result of the machine learning model. Three criteria are presented to estimate the probability of low data quality (Data Source Quality, Data Smells, Data Pipeline Quality). To determine the impact of low (data) quality features, the importance of features according to the performance of the machine learning model (i.e. Feature Importance) is utilized. The presented approach provides decision support (i.e. data validation prioritization and rigor) for software engineers during the implementation of data validation techniques in the course of deploying a trained machine learning model and its software stack.","PeriodicalId":254040,"journal":{"name":"Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation","volume":"97 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":"{\"title\":\"Risk-based data validation in machine learning-based software systems\",\"authors\":\"Harald Foidl, M. Felderer\",\"doi\":\"10.1145/3340482.3342743\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data validation is an essential requirement to ensure the reliability and quality of Machine Learning-based Software Systems. However, an exhaustive validation of all data fed to these systems (i.e. up to several thousand features) is practically unfeasible. In addition, there has been little discussion about methods that support software engineers of such systems in determining how thorough to validate each feature (i.e. data validation rigor). Therefore, this paper presents a conceptual data validation approach that prioritizes features based on their estimated risk of poor data quality. The risk of poor data quality is determined by the probability that a feature is of low data quality and the impact of this low (data) quality feature on the result of the machine learning model. Three criteria are presented to estimate the probability of low data quality (Data Source Quality, Data Smells, Data Pipeline Quality). To determine the impact of low (data) quality features, the importance of features according to the performance of the machine learning model (i.e. Feature Importance) is utilized. The presented approach provides decision support (i.e. data validation prioritization and rigor) for software engineers during the implementation of data validation techniques in the course of deploying a trained machine learning model and its software stack.\",\"PeriodicalId\":254040,\"journal\":{\"name\":\"Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation\",\"volume\":\"97 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3340482.3342743\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 3rd ACM SIGSOFT International Workshop on Machine Learning Techniques for Software Quality Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3340482.3342743","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24

摘要

数据验证是保证基于机器学习的软件系统可靠性和质量的基本要求。然而,对提供给这些系统的所有数据(即多达数千个特征)进行详尽的验证实际上是不可行的。此外,关于支持此类系统的软件工程师确定如何彻底验证每个特性(即数据验证的严谨性)的方法的讨论很少。因此,本文提出了一种概念性数据验证方法,该方法根据数据质量差的估计风险对特征进行优先级排序。数据质量差的风险是由一个特征是低数据质量的概率和这个低(数据)质量特征对机器学习模型结果的影响决定的。提出了三个评估低数据质量概率的标准(数据源质量、数据气味、数据管道质量)。为了确定低(数据)质量特征的影响,根据机器学习模型的性能使用特征的重要性(即特征重要性)。提出的方法为软件工程师在部署训练有素的机器学习模型及其软件堆栈的过程中实现数据验证技术提供决策支持(即数据验证优先级和严谨性)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Risk-based data validation in machine learning-based software systems
Data validation is an essential requirement to ensure the reliability and quality of Machine Learning-based Software Systems. However, an exhaustive validation of all data fed to these systems (i.e. up to several thousand features) is practically unfeasible. In addition, there has been little discussion about methods that support software engineers of such systems in determining how thorough to validate each feature (i.e. data validation rigor). Therefore, this paper presents a conceptual data validation approach that prioritizes features based on their estimated risk of poor data quality. The risk of poor data quality is determined by the probability that a feature is of low data quality and the impact of this low (data) quality feature on the result of the machine learning model. Three criteria are presented to estimate the probability of low data quality (Data Source Quality, Data Smells, Data Pipeline Quality). To determine the impact of low (data) quality features, the importance of features according to the performance of the machine learning model (i.e. Feature Importance) is utilized. The presented approach provides decision support (i.e. data validation prioritization and rigor) for software engineers during the implementation of data validation techniques in the course of deploying a trained machine learning model and its software stack.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信