Predictive Modeling in a Big Data Distributed Setting: A Scalable Bias Correction Approach

Gianluca Bontempi, Y. Borgne
{"title":"Predictive Modeling in a Big Data Distributed Setting: A Scalable Bias Correction Approach","authors":"Gianluca Bontempi, Y. Borgne","doi":"10.1109/BigDataCongress.2016.17","DOIUrl":null,"url":null,"abstract":"Massive datasets are becoming pervasive in computational sciences. Though this opens new perspectives for discovery and an increasing number of processing and storage solutions is available, it is still an open issue how to transpose machine learning and statistical procedures to distributed settings. Big datasets are no guarantee for optimal modeling since they do not automatically solve the issues of model design, validation and selection. At the same time conventional techniques of cross-validation and model assessment are computationally prohibitive when the size of the dataset explodes. This paper claims that the main benefit of a massive dataset is not related to the size of the training set but to the possibility of assessing in an accurate and scalable manner the properties of the learner itself (e.g. bias and variance). Accordingly, the paper proposes a scalable implementation of a bias correction strategy to improve the accuracy of learning techniques for regression in a big data setting. An analytical derivation and an experimental study show the potential of the approach.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Massive datasets are becoming pervasive in computational sciences. Though this opens new perspectives for discovery and an increasing number of processing and storage solutions is available, it is still an open issue how to transpose machine learning and statistical procedures to distributed settings. Big datasets are no guarantee for optimal modeling since they do not automatically solve the issues of model design, validation and selection. At the same time conventional techniques of cross-validation and model assessment are computationally prohibitive when the size of the dataset explodes. This paper claims that the main benefit of a massive dataset is not related to the size of the training set but to the possibility of assessing in an accurate and scalable manner the properties of the learner itself (e.g. bias and variance). Accordingly, the paper proposes a scalable implementation of a bias correction strategy to improve the accuracy of learning techniques for regression in a big data setting. An analytical derivation and an experimental study show the potential of the approach.
大数据分布式环境下的预测建模:一种可扩展的偏差校正方法
海量数据集在计算科学领域正变得无处不在。尽管这为发现和越来越多的处理和存储解决方案提供了新的视角,但如何将机器学习和统计过程转移到分布式设置中仍然是一个悬而未决的问题。大数据集不能保证最优建模,因为它们不能自动解决模型设计、验证和选择的问题。同时,当数据集的规模爆炸时,传统的交叉验证和模型评估技术在计算上是令人望而却步的。本文声称,大规模数据集的主要好处与训练集的大小无关,而是与以准确和可扩展的方式评估学习器本身属性(例如偏差和方差)的可能性有关。因此,本文提出了一种可扩展的偏差校正策略,以提高大数据环境下回归学习技术的准确性。分析推导和实验研究表明了该方法的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信