Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac
{"title":"Principled Data Preprocessing: Application to Biological Aquatic Indicators of Water Pollution","authors":"Eva C. Serrano Balderas, Laure Berti-Équille, Maria Aurora Armienta Hernandez, C. Grac","doi":"10.1109/DEXA.2017.27","DOIUrl":null,"url":null,"abstract":"In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.","PeriodicalId":127009,"journal":{"name":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 28th International Workshop on Database and Expert Systems Applications (DEXA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DEXA.2017.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
In many biological studies, statistical and data mining methods are extensively used to analyze the data and discover actionable knowledge. But, bad data quality causing incorrect analysis results and wrong interpretations may induce misleading conclusions and inadequate decisions. To ensure the validity of the results, avoid bias and data misuse, it is necessary to control not only the whole analytical pipeline, but most importantly the quality of the data with appropriate data preprocessing choices. Since various preprocessing techniques and alternative strategies may lead to dramatically different outputs, it is crucial to rely on a principled and rigorous method to select the optimal set of data preprocessing steps that depends both on the input data distributional characteristics and on the inherent characteristics of the targeted statistical or data mining methods. In this paper, we propose a method that selects, given a dataset, the optimal set of preprocessing tasks to apply to the data such that the overall data preprocessing output maximizes the quality of the analytical results for various techniques of clustering, regression, and classification. We present some promising results that validate our approach on biomonitoring data preparation.