提高非结构化大数据仓库执行时间的模型

IF 3.7 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Marwa Salah Farhan, Amira Youssef, Laila Abdelhamid
{"title":"提高非结构化大数据仓库执行时间的模型","authors":"Marwa Salah Farhan, Amira Youssef, Laila Abdelhamid","doi":"10.3390/bdcc8020017","DOIUrl":null,"url":null,"abstract":"Traditional data warehouses (DWs) have played a key role in business intelligence and decision support systems. However, the rapid growth of the data generated by the current applications requires new data warehousing systems. In big data, it is important to adapt the existing warehouse systems to overcome new issues and limitations. The main drawbacks of traditional Extract–Transform–Load (ETL) are that a huge amount of data cannot be processed over ETL and that the execution time is very high when the data are unstructured. This paper focuses on a new model consisting of four layers: Extract–Clean–Load–Transform (ECLT), designed for processing unstructured big data, with specific emphasis on text. The model aims to reduce execution time through experimental procedures. ECLT is applied and tested using Spark, which is a framework employed in Python. Finally, this paper compares the execution time of ECLT with different models by applying two datasets. Experimental results showed that for a data size of 1 TB, the execution time of ECLT is 41.8 s. When the data size increases to 1 million articles, the execution time is 119.6 s. These findings demonstrate that ECLT outperforms ETL, ELT, DELT, ELTL, and ELTA in terms of execution time.","PeriodicalId":36397,"journal":{"name":"Big Data and Cognitive Computing","volume":null,"pages":null},"PeriodicalIF":3.7000,"publicationDate":"2024-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Model for Enhancing Unstructured Big Data Warehouse Execution Time\",\"authors\":\"Marwa Salah Farhan, Amira Youssef, Laila Abdelhamid\",\"doi\":\"10.3390/bdcc8020017\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Traditional data warehouses (DWs) have played a key role in business intelligence and decision support systems. However, the rapid growth of the data generated by the current applications requires new data warehousing systems. In big data, it is important to adapt the existing warehouse systems to overcome new issues and limitations. The main drawbacks of traditional Extract–Transform–Load (ETL) are that a huge amount of data cannot be processed over ETL and that the execution time is very high when the data are unstructured. This paper focuses on a new model consisting of four layers: Extract–Clean–Load–Transform (ECLT), designed for processing unstructured big data, with specific emphasis on text. The model aims to reduce execution time through experimental procedures. ECLT is applied and tested using Spark, which is a framework employed in Python. Finally, this paper compares the execution time of ECLT with different models by applying two datasets. Experimental results showed that for a data size of 1 TB, the execution time of ECLT is 41.8 s. When the data size increases to 1 million articles, the execution time is 119.6 s. These findings demonstrate that ECLT outperforms ETL, ELT, DELT, ELTL, and ELTA in terms of execution time.\",\"PeriodicalId\":36397,\"journal\":{\"name\":\"Big Data and Cognitive Computing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.7000,\"publicationDate\":\"2024-02-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Big Data and Cognitive Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/bdcc8020017\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Big Data and Cognitive Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/bdcc8020017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

传统的数据仓库(DW)在商业智能和决策支持系统中发挥了关键作用。然而,当前应用所产生数据的快速增长需要新的数据仓库系统。在大数据中,必须对现有的仓库系统进行调整,以克服新的问题和限制。传统的提取-转换-加载(ETL)的主要缺点是无法通过 ETL 处理海量数据,而且当数据是非结构化数据时,执行时间非常长。本文重点介绍一种由四层组成的新模型:Extract-Clean-Load-Transform (ECLT),专为处理非结构化大数据而设计,特别强调文本。该模型旨在通过实验程序缩短执行时间。ECLT 使用 Spark 进行了应用和测试,Spark 是一个使用 Python 的框架。最后,本文通过应用两个数据集,比较了 ECLT 与不同模型的执行时间。实验结果表明,当数据量为 1 TB 时,ECLT 的执行时间为 41.8 秒;当数据量增加到 100 万条时,执行时间为 119.6 秒。这些结果表明,ECLT 在执行时间方面优于 ETL、ELT、DELT、ELTL 和 ELTA。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A Model for Enhancing Unstructured Big Data Warehouse Execution Time
Traditional data warehouses (DWs) have played a key role in business intelligence and decision support systems. However, the rapid growth of the data generated by the current applications requires new data warehousing systems. In big data, it is important to adapt the existing warehouse systems to overcome new issues and limitations. The main drawbacks of traditional Extract–Transform–Load (ETL) are that a huge amount of data cannot be processed over ETL and that the execution time is very high when the data are unstructured. This paper focuses on a new model consisting of four layers: Extract–Clean–Load–Transform (ECLT), designed for processing unstructured big data, with specific emphasis on text. The model aims to reduce execution time through experimental procedures. ECLT is applied and tested using Spark, which is a framework employed in Python. Finally, this paper compares the execution time of ECLT with different models by applying two datasets. Experimental results showed that for a data size of 1 TB, the execution time of ECLT is 41.8 s. When the data size increases to 1 million articles, the execution time is 119.6 s. These findings demonstrate that ECLT outperforms ETL, ELT, DELT, ELTL, and ELTA in terms of execution time.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Big Data and Cognitive Computing
Big Data and Cognitive Computing Business, Management and Accounting-Management Information Systems
CiteScore
7.10
自引率
8.10%
发文量
128
审稿时长
11 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信