Data Sets and Data Quality in Software Engineering: Eight Years On

G. Liebchen, M. Shepperd
{"title":"Data Sets and Data Quality in Software Engineering: Eight Years On","authors":"G. Liebchen, M. Shepperd","doi":"10.1145/2972958.2972967","DOIUrl":null,"url":null,"abstract":"Context: We revisit our review of data quality within the context of empirical software engineering eight years on from our PROMISE 2008 article. Objective: To assess the extent and types of techniques used to manage quality within data sets. We consider this a particularly interesting question in the context of initiatives to promote sharing and secondary analysis of data sets. Method: We update the 2008 mapping study through four subsequently published reviews and a snowballing exercise. Results: The original study located only 23 articles explicitly considering data quality. This picture has changed substantially as our updated review now finds 283 articles, however, our estimate is that this still represents perhaps 1% of the total empirical software engineering literature. Conclusions: It appears the community is now taking the issue of data quality more seriously and there is more work exploring techniques to automatically detect (and sometimes repair) noise problems. However, there is still little systematic work to evaluate the various data sets that are widely used for secondary analysis; addressing this would be of considerable benefit. It should also be a priority to work collab-oratively with practitioners to add new, higher quality data to the existing corpora.","PeriodicalId":176848,"journal":{"name":"Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering","volume":"143 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the The 12th International Conference on Predictive Models and Data Analytics in Software Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2972958.2972967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Context: We revisit our review of data quality within the context of empirical software engineering eight years on from our PROMISE 2008 article. Objective: To assess the extent and types of techniques used to manage quality within data sets. We consider this a particularly interesting question in the context of initiatives to promote sharing and secondary analysis of data sets. Method: We update the 2008 mapping study through four subsequently published reviews and a snowballing exercise. Results: The original study located only 23 articles explicitly considering data quality. This picture has changed substantially as our updated review now finds 283 articles, however, our estimate is that this still represents perhaps 1% of the total empirical software engineering literature. Conclusions: It appears the community is now taking the issue of data quality more seriously and there is more work exploring techniques to automatically detect (and sometimes repair) noise problems. However, there is still little systematic work to evaluate the various data sets that are widely used for secondary analysis; addressing this would be of considerable benefit. It should also be a priority to work collab-oratively with practitioners to add new, higher quality data to the existing corpora.
软件工程中的数据集和数据质量:八年来
背景:我们回顾了我们在实证软件工程背景下对数据质量的回顾,从我们的PROMISE 2008文章开始。目的:评估用于管理数据集中质量的技术的程度和类型。我们认为,在促进数据集共享和二次分析的倡议背景下,这是一个特别有趣的问题。方法:我们通过随后发表的四篇综述和滚雪球式练习更新了2008年的制图研究。结果:原始研究只找到了23篇明确考虑数据质量的文章。这幅图已经发生了很大的变化,因为我们更新的回顾现在发现了283篇文章,然而,我们的估计是,这仍然代表了总经验软件工程文献的1%。结论:似乎社区现在更重视数据质量问题,并且有更多的工作探索自动检测(有时修复)噪声问题的技术。然而,仍然很少有系统的工作来评估广泛用于二次分析的各种数据集;解决这个问题将带来相当大的好处。与从业者协作,向现有的语料库中添加新的、更高质量的数据也应该是一个优先事项。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信