Joffrey L. Leevy, T. Khoshgoftaar, Richard A. Bauder, Naeem Seliya
{"title":"The Effect of Time on the Maintenance of a Predictive Model","authors":"Joffrey L. Leevy, T. Khoshgoftaar, Richard A. Bauder, Naeem Seliya","doi":"10.1109/ICMLA.2019.00304","DOIUrl":null,"url":null,"abstract":"Periodic updating of a machine learning model may become necessary because new data could have a distribution that has drifted significantly over time from the original data distribution, thus impacting the model's usefulness. The primary objective of this paper is to evaluate temporal influence on the maintenance of a predictive model. We investigate the impact of using training data from various year-groupings on a model designed to detect Medicare Part B billing fraud. Training datasets are obtained from year-groupings of 2015, 2014-2015, 2013-2015, and 2012-2015. The test dataset is represented by 2016 data. Our study utilizes five popular learners and five class ratios obtained by Random Undersampling. Using the Area Under the Receiver Operating Characteristic (ROC) Curve as the performance metric, our case study indicates that the Logistic Regression learner yields the highest overall value for the yeargrouping of 2013-2015, with a majority-to-minority ratio of 90:10. For the problem of maintaining predictive models for Medicare fraud, we conclude that a sampled dataset should be chosen over the full dataset and that the largest training dataset (i.e., 2012- 2015) does not always produce the best results. To the best of our knowledge, this is the first big data study that examines the influence of time on the maintenance of machine learning models.","PeriodicalId":436714,"journal":{"name":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICMLA.2019.00304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Periodic updating of a machine learning model may become necessary because new data could have a distribution that has drifted significantly over time from the original data distribution, thus impacting the model's usefulness. The primary objective of this paper is to evaluate temporal influence on the maintenance of a predictive model. We investigate the impact of using training data from various year-groupings on a model designed to detect Medicare Part B billing fraud. Training datasets are obtained from year-groupings of 2015, 2014-2015, 2013-2015, and 2012-2015. The test dataset is represented by 2016 data. Our study utilizes five popular learners and five class ratios obtained by Random Undersampling. Using the Area Under the Receiver Operating Characteristic (ROC) Curve as the performance metric, our case study indicates that the Logistic Regression learner yields the highest overall value for the yeargrouping of 2013-2015, with a majority-to-minority ratio of 90:10. For the problem of maintaining predictive models for Medicare fraud, we conclude that a sampled dataset should be chosen over the full dataset and that the largest training dataset (i.e., 2012- 2015) does not always produce the best results. To the best of our knowledge, this is the first big data study that examines the influence of time on the maintenance of machine learning models.