{"title":"大数据分布式环境下的预测建模:一种可扩展的偏差校正方法","authors":"Gianluca Bontempi, Y. Borgne","doi":"10.1109/BigDataCongress.2016.17","DOIUrl":null,"url":null,"abstract":"Massive datasets are becoming pervasive in computational sciences. Though this opens new perspectives for discovery and an increasing number of processing and storage solutions is available, it is still an open issue how to transpose machine learning and statistical procedures to distributed settings. Big datasets are no guarantee for optimal modeling since they do not automatically solve the issues of model design, validation and selection. At the same time conventional techniques of cross-validation and model assessment are computationally prohibitive when the size of the dataset explodes. This paper claims that the main benefit of a massive dataset is not related to the size of the training set but to the possibility of assessing in an accurate and scalable manner the properties of the learner itself (e.g. bias and variance). Accordingly, the paper proposes a scalable implementation of a bias correction strategy to improve the accuracy of learning techniques for regression in a big data setting. An analytical derivation and an experimental study show the potential of the approach.","PeriodicalId":407471,"journal":{"name":"2016 IEEE International Congress on Big Data (BigData Congress)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predictive Modeling in a Big Data Distributed Setting: A Scalable Bias Correction Approach\",\"authors\":\"Gianluca Bontempi, Y. Borgne\",\"doi\":\"10.1109/BigDataCongress.2016.17\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Massive datasets are becoming pervasive in computational sciences. Though this opens new perspectives for discovery and an increasing number of processing and storage solutions is available, it is still an open issue how to transpose machine learning and statistical procedures to distributed settings. Big datasets are no guarantee for optimal modeling since they do not automatically solve the issues of model design, validation and selection. At the same time conventional techniques of cross-validation and model assessment are computationally prohibitive when the size of the dataset explodes. This paper claims that the main benefit of a massive dataset is not related to the size of the training set but to the possibility of assessing in an accurate and scalable manner the properties of the learner itself (e.g. bias and variance). Accordingly, the paper proposes a scalable implementation of a bias correction strategy to improve the accuracy of learning techniques for regression in a big data setting. An analytical derivation and an experimental study show the potential of the approach.\",\"PeriodicalId\":407471,\"journal\":{\"name\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-06-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 IEEE International Congress on Big Data (BigData Congress)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/BigDataCongress.2016.17\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE International Congress on Big Data (BigData Congress)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/BigDataCongress.2016.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Predictive Modeling in a Big Data Distributed Setting: A Scalable Bias Correction Approach
Massive datasets are becoming pervasive in computational sciences. Though this opens new perspectives for discovery and an increasing number of processing and storage solutions is available, it is still an open issue how to transpose machine learning and statistical procedures to distributed settings. Big datasets are no guarantee for optimal modeling since they do not automatically solve the issues of model design, validation and selection. At the same time conventional techniques of cross-validation and model assessment are computationally prohibitive when the size of the dataset explodes. This paper claims that the main benefit of a massive dataset is not related to the size of the training set but to the possibility of assessing in an accurate and scalable manner the properties of the learner itself (e.g. bias and variance). Accordingly, the paper proposes a scalable implementation of a bias correction strategy to improve the accuracy of learning techniques for regression in a big data setting. An analytical derivation and an experimental study show the potential of the approach.