How Good is Good Enough?: Quantifying the Effects of Training Set Quality

B. Swan, M. Laverdiere, Hsiuhan Lexie Yang
{"title":"How Good is Good Enough?: Quantifying the Effects of Training Set Quality","authors":"B. Swan, M. Laverdiere, Hsiuhan Lexie Yang","doi":"10.1145/3281548.3281557","DOIUrl":null,"url":null,"abstract":"There is a general consensus in the neural network community that noise in training data has a negative impact on model output; however, efforts to quantify the impact of varying levels have been limited, particularly for semantic segmentation tasks. This is a question of particular importance for remote sensing applications where the cost of producing a large training set can lead to reliance on publicly available data with varying degrees of noise. This work explores the effects of different degrees and types of training label noise on a pre-trained building extraction deep learner. Quantitative and qualitative evaluations of these effects can help inform decisions about trade-offs between the cost of producing training data and the quality of model outputs. We found that, relative to the base model, models trained with small amounts of noise showed little change in precision but achieved considerable increases in recall. Conversely, as noise levels increased, both precision and recall decreased. Precision and recall both lagged behind a model trained with pristine data. These exploratory results indicate the importance of quality control for training and, more broadly, that the relationship between degrees and types of training data noise and model performance is more complex than trade-offs between precision and recall.","PeriodicalId":231184,"journal":{"name":"Proceedings of the 2nd ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd ACM SIGSPATIAL International Workshop on AI for Geographic Knowledge Discovery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3281548.3281557","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

There is a general consensus in the neural network community that noise in training data has a negative impact on model output; however, efforts to quantify the impact of varying levels have been limited, particularly for semantic segmentation tasks. This is a question of particular importance for remote sensing applications where the cost of producing a large training set can lead to reliance on publicly available data with varying degrees of noise. This work explores the effects of different degrees and types of training label noise on a pre-trained building extraction deep learner. Quantitative and qualitative evaluations of these effects can help inform decisions about trade-offs between the cost of producing training data and the quality of model outputs. We found that, relative to the base model, models trained with small amounts of noise showed little change in precision but achieved considerable increases in recall. Conversely, as noise levels increased, both precision and recall decreased. Precision and recall both lagged behind a model trained with pristine data. These exploratory results indicate the importance of quality control for training and, more broadly, that the relationship between degrees and types of training data noise and model performance is more complex than trade-offs between precision and recall.
好到什么程度才算好?:量化训练集质量的影响
神经网络学界普遍认为,训练数据中的噪声对模型输出有负面影响;然而,量化不同层次的影响的努力是有限的,特别是对于语义分割任务。对于遥感应用来说,这是一个特别重要的问题,因为产生大量训练集的成本可能导致依赖具有不同程度噪声的公开数据。本研究探讨了不同程度和类型的训练标签噪声对预训练的建筑提取深度学习器的影响。对这些影响进行定量和定性的评估,可以帮助就产生训练数据的成本和模型输出的质量之间的权衡作出决策。我们发现,相对于基本模型,用少量噪声训练的模型在精度上几乎没有变化,但在召回率上却有了相当大的提高。相反,随着噪音水平的提高,准确率和召回率都下降了。精确度和召回率都落后于用原始数据训练的模型。这些探索性结果表明了训练质量控制的重要性,更广泛地说,训练数据噪声的程度和类型与模型性能之间的关系比精度和召回率之间的权衡更为复杂。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信