Sweeper

Proceedings of the 2019 3rd International Conference on Software and e-Business Pub Date : 2019-12-09 DOI:10.1145/3374549.3374574

Nutthawut Thawanthaleunglit, K. Sripanidkulchai

{"title":"Sweeper","authors":"Nutthawut Thawanthaleunglit, K. Sripanidkulchai","doi":"10.1145/3374549.3374574","DOIUrl":null,"url":null,"abstract":"Data processing prior to creating models is an essential process in the data science workflow. Using erroneous data such as data with missing values, class imbalance, and skew, may affect model performance and classification outcome. Often when models have poor performance, practitioners focus on improving models but overlook quality processing as it requires the involvement of data experts or data owners who have an intimate understanding of the data. Furthermore, performing data quality processing is challenging in that there is no one-size-fits-all optimal solution that is suitable for all types of data and models. Therefore, a unique combination of data quality processing methods is required for each dataset and model prior to model generation in order to identify the most appropriate set of methods that can improve the model's performance in terms of accuracy, precision, and recall. Finding the most effective way to prepare data requires many manual iterations of trial and error. In this paper, we design and develop Sweeper, a tool that automatically explores combinations of many data quality processing methods and models to rank and identify the most suitable one for the given data. Sweeper is simple to use and can reduce the manual workload for practitioners in improving data classification performance.","PeriodicalId":187087,"journal":{"name":"Proceedings of the 2019 3rd International Conference on Software and e-Business","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2019 3rd International Conference on Software and e-Business","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3374549.3374574","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Data processing prior to creating models is an essential process in the data science workflow. Using erroneous data such as data with missing values, class imbalance, and skew, may affect model performance and classification outcome. Often when models have poor performance, practitioners focus on improving models but overlook quality processing as it requires the involvement of data experts or data owners who have an intimate understanding of the data. Furthermore, performing data quality processing is challenging in that there is no one-size-fits-all optimal solution that is suitable for all types of data and models. Therefore, a unique combination of data quality processing methods is required for each dataset and model prior to model generation in order to identify the most appropriate set of methods that can improve the model's performance in terms of accuracy, precision, and recall. Finding the most effective way to prepare data requires many manual iterations of trial and error. In this paper, we design and develop Sweeper, a tool that automatically explores combinations of many data quality processing methods and models to rank and identify the most suitable one for the given data. Sweeper is simple to use and can reduce the manual workload for practitioners in improving data classification performance.

查看原文本刊更多论文

清洁工

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2019 3rd International Conference on Software and e-Business

自引率

0.00%

发文量