{"title":"Sweeper","authors":"Nutthawut Thawanthaleunglit, K. Sripanidkulchai","doi":"10.1145/3374549.3374574","DOIUrl":null,"url":null,"abstract":"Data processing prior to creating models is an essential process in the data science workflow. Using erroneous data such as data with missing values, class imbalance, and skew, may affect model performance and classification outcome. Often when models have poor performance, practitioners focus on improving models but overlook quality processing as it requires the involvement of data experts or data owners who have an intimate understanding of the data. Furthermore, performing data quality processing is challenging in that there is no one-size-fits-all optimal solution that is suitable for all types of data and models. Therefore, a unique combination of data quality processing methods is required for each dataset and model prior to model generation in order to identify the most appropriate set of methods that can improve the model's performance in terms of accuracy, precision, and recall. Finding the most effective way to prepare data requires many manual iterations of trial and error. In this paper, we design and develop Sweeper, a tool that automatically explores combinations of many data quality processing methods and models to rank and identify the most suitable one for the given data. Sweeper is simple to use and can reduce the manual workload for practitioners in improving data classification performance.","PeriodicalId":187087,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Software and e-Business","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 3rd International Conference on Software and e-Business","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3374549.3374574","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Data processing prior to creating models is an essential process in the data science workflow. Using erroneous data such as data with missing values, class imbalance, and skew, may affect model performance and classification outcome. Often when models have poor performance, practitioners focus on improving models but overlook quality processing as it requires the involvement of data experts or data owners who have an intimate understanding of the data. Furthermore, performing data quality processing is challenging in that there is no one-size-fits-all optimal solution that is suitable for all types of data and models. Therefore, a unique combination of data quality processing methods is required for each dataset and model prior to model generation in order to identify the most appropriate set of methods that can improve the model's performance in terms of accuracy, precision, and recall. Finding the most effective way to prepare data requires many manual iterations of trial and error. In this paper, we design and develop Sweeper, a tool that automatically explores combinations of many data quality processing methods and models to rank and identify the most suitable one for the given data. Sweeper is simple to use and can reduce the manual workload for practitioners in improving data classification performance.