Data Cleansing with PDI for Improving Data Quality

Siti Aulia Noor, T. F. Kusumasari, M. A. Hasibuan
{"title":"Data Cleansing with PDI for Improving Data Quality","authors":"Siti Aulia Noor, T. F. Kusumasari, M. A. Hasibuan","doi":"10.5220/0009868102560261","DOIUrl":null,"url":null,"abstract":": Technological developments that will quickly produce diverse data or information can improve the decision-making process. This causes the organization to require quality data so that it can be used as a basis for decision making that can truly be trusted. Data quality is an important supporting factor for processing data to produce valid information that can be beneficial to the company. Therefore, in this paper we will discuss data cleaning to improve data quality by using open source tools. As an open source tool used in this paper is Pentaho Data Integration (PDI). The cleaning data collection method in this paper includes data profiles, determine the processing algorithm for data cleansing, mapping algorithms of data collection to components in the PDI, and finally evaluating. Evaluation is done by comparing the results of research with existing data cleaning tools (OpenRefine and Talend). The results of the implementation of data cleansing show the character of data settings that form for Drug Circular Permit numbers with an accuracy of 0.0614. The advantage of the results of this study is that the data sources used can consist of databases with various considerations.","PeriodicalId":394577,"journal":{"name":"Proceedings of the International Conference on Creative Economics, Tourism and Information Management","volume":"148 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Creative Economics, Tourism and Information Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0009868102560261","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

: Technological developments that will quickly produce diverse data or information can improve the decision-making process. This causes the organization to require quality data so that it can be used as a basis for decision making that can truly be trusted. Data quality is an important supporting factor for processing data to produce valid information that can be beneficial to the company. Therefore, in this paper we will discuss data cleaning to improve data quality by using open source tools. As an open source tool used in this paper is Pentaho Data Integration (PDI). The cleaning data collection method in this paper includes data profiles, determine the processing algorithm for data cleansing, mapping algorithms of data collection to components in the PDI, and finally evaluating. Evaluation is done by comparing the results of research with existing data cleaning tools (OpenRefine and Talend). The results of the implementation of data cleansing show the character of data settings that form for Drug Circular Permit numbers with an accuracy of 0.0614. The advantage of the results of this study is that the data sources used can consist of databases with various considerations.
使用PDI进行数据清理以提高数据质量
技术的发展将迅速产生不同的数据或信息,可以改善决策过程。这导致组织需要高质量的数据,以便可以将其用作真正可信的决策制定的基础。数据质量是处理数据以产生对公司有益的有效信息的重要支持因素。因此,在本文中,我们将讨论数据清理,通过使用开源工具来提高数据质量。本文使用的开源工具是Pentaho Data Integration (PDI)。本文的清洗数据收集方法包括数据概要,确定数据清洗的处理算法,将数据收集映射到PDI中组件的算法,最后进行评估。评估是通过将研究结果与现有数据清理工具(OpenRefine和Talend)进行比较来完成的。数据清洗的实施结果显示了形成药品循环许可证编号的数据设置的特征,精度为0.0614。本研究结果的优点是所使用的数据源可以由具有各种考虑因素的数据库组成。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信