{"title":"Data Cleansing with PDI for Improving Data Quality","authors":"Siti Aulia Noor, T. F. Kusumasari, M. A. Hasibuan","doi":"10.5220/0009868102560261","DOIUrl":null,"url":null,"abstract":": Technological developments that will quickly produce diverse data or information can improve the decision-making process. This causes the organization to require quality data so that it can be used as a basis for decision making that can truly be trusted. Data quality is an important supporting factor for processing data to produce valid information that can be beneficial to the company. Therefore, in this paper we will discuss data cleaning to improve data quality by using open source tools. As an open source tool used in this paper is Pentaho Data Integration (PDI). The cleaning data collection method in this paper includes data profiles, determine the processing algorithm for data cleansing, mapping algorithms of data collection to components in the PDI, and finally evaluating. Evaluation is done by comparing the results of research with existing data cleaning tools (OpenRefine and Talend). The results of the implementation of data cleansing show the character of data settings that form for Drug Circular Permit numbers with an accuracy of 0.0614. The advantage of the results of this study is that the data sources used can consist of databases with various considerations.","PeriodicalId":394577,"journal":{"name":"Proceedings of the International Conference on Creative Economics, Tourism and Information Management","volume":"148 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Creative Economics, Tourism and Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0009868102560261","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
: Technological developments that will quickly produce diverse data or information can improve the decision-making process. This causes the organization to require quality data so that it can be used as a basis for decision making that can truly be trusted. Data quality is an important supporting factor for processing data to produce valid information that can be beneficial to the company. Therefore, in this paper we will discuss data cleaning to improve data quality by using open source tools. As an open source tool used in this paper is Pentaho Data Integration (PDI). The cleaning data collection method in this paper includes data profiles, determine the processing algorithm for data cleansing, mapping algorithms of data collection to components in the PDI, and finally evaluating. Evaluation is done by comparing the results of research with existing data cleaning tools (OpenRefine and Talend). The results of the implementation of data cleansing show the character of data settings that form for Drug Circular Permit numbers with an accuracy of 0.0614. The advantage of the results of this study is that the data sources used can consist of databases with various considerations.
技术的发展将迅速产生不同的数据或信息,可以改善决策过程。这导致组织需要高质量的数据,以便可以将其用作真正可信的决策制定的基础。数据质量是处理数据以产生对公司有益的有效信息的重要支持因素。因此,在本文中,我们将讨论数据清理,通过使用开源工具来提高数据质量。本文使用的开源工具是Pentaho Data Integration (PDI)。本文的清洗数据收集方法包括数据概要,确定数据清洗的处理算法,将数据收集映射到PDI中组件的算法,最后进行评估。评估是通过将研究结果与现有数据清理工具(OpenRefine和Talend)进行比较来完成的。数据清洗的实施结果显示了形成药品循环许可证编号的数据设置的特征,精度为0.0614。本研究结果的优点是所使用的数据源可以由具有各种考虑因素的数据库组成。