DATA CLEANING BEFORE UPLOADING TO STORAGE

Elvin Jafarov Elvin Jafarov
{"title":"DATA CLEANING BEFORE UPLOADING TO STORAGE","authors":"Elvin Jafarov Elvin Jafarov","doi":"10.36962/etm13012023-117","DOIUrl":null,"url":null,"abstract":"The article considered the issue of cleaning big data before uploading it to storage. At this time, the errors made and the methods of eliminating these errors have been clarified.\nThe technology of creating a big data storage and analysis system is reviewed, as well as solutions for the implementation of the first stages of the Data Science process: data acquisition, cleaning and loading are described. The results of the research allow us to move towards the realization of future steps in the field of big data processing.\nIt was noted that Data cleansing is an essential step in working with big data, as any analysis based on inaccurate data can lead to erroneous results.\nAlso, it was noted that cleaning and consolidation of data can also be performed when the data is loaded into a distributed file system.\nThe methods of uploading data to the storage system have been tested. An assembly from Hortonworks was used as the implementation. The easiest way to upload is to use the web interface of the Ambari system or to use HDFS commands to upload to HDFS Hadoop from the local system.\nIt has been shown that the ETL process should be considered more broadly than just importing data from receivers, minimal transformations and loading procedures into the warehouse. Data cleaning should become a mandatory stage of work, because the cost of storage is determined not only by the amount of data, but also by the quality of the information collected. \nKeywords: Big Data, Data Cleaning, Storage System, ETL process, Loading methods.","PeriodicalId":246138,"journal":{"name":"ETM - Equipment, Technologies, Materials","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ETM - Equipment, Technologies, Materials","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36962/etm13012023-117","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The article considered the issue of cleaning big data before uploading it to storage. At this time, the errors made and the methods of eliminating these errors have been clarified. The technology of creating a big data storage and analysis system is reviewed, as well as solutions for the implementation of the first stages of the Data Science process: data acquisition, cleaning and loading are described. The results of the research allow us to move towards the realization of future steps in the field of big data processing. It was noted that Data cleansing is an essential step in working with big data, as any analysis based on inaccurate data can lead to erroneous results. Also, it was noted that cleaning and consolidation of data can also be performed when the data is loaded into a distributed file system. The methods of uploading data to the storage system have been tested. An assembly from Hortonworks was used as the implementation. The easiest way to upload is to use the web interface of the Ambari system or to use HDFS commands to upload to HDFS Hadoop from the local system. It has been shown that the ETL process should be considered more broadly than just importing data from receivers, minimal transformations and loading procedures into the warehouse. Data cleaning should become a mandatory stage of work, because the cost of storage is determined not only by the amount of data, but also by the quality of the information collected. Keywords: Big Data, Data Cleaning, Storage System, ETL process, Loading methods.
上传到存储前的数据清理
这篇文章考虑了大数据上传到存储之前的清理问题。此时,所犯的错误和消除这些错误的方法已经明确。回顾了创建大数据存储和分析系统的技术,以及实施数据科学过程第一阶段的解决方案:数据采集、清理和加载。研究结果使我们朝着实现大数据处理领域的未来步骤迈进。会议指出,数据清理是处理大数据的重要步骤,因为任何基于不准确数据的分析都可能导致错误的结果。此外,还指出,当数据加载到分布式文件系统中时,也可以执行数据的清理和整合。数据上传到存储系统的方法已经过测试。来自Hortonworks的程序集被用作实现。上传最简单的方法是使用Ambari系统的web界面或使用HDFS命令从本地系统上传到HDFS Hadoop。已经表明,应该更广泛地考虑ETL过程,而不仅仅是从接收器导入数据、进行最小的转换和将过程加载到仓库中。数据清理应该成为一个强制性的工作阶段,因为存储成本不仅取决于数据量,还取决于所收集信息的质量。关键词:大数据,数据清洗,存储系统,ETL进程,加载方法
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信