本地数据仓库配置的案例研究

2020 11th IEEE International Conference on Cognitive Infocommunications (CogInfoCom) Pub Date : 2020-09-23 DOI:10.1109/CogInfoCom50765.2020.9237814

Bence Bogdandy, Adam Kovacs, Zsolt Tóth

{"title":"本地数据仓库配置的案例研究","authors":"Bence Bogdandy, Adam Kovacs, Zsolt Tóth","doi":"10.1109/CogInfoCom50765.2020.9237814","DOIUrl":null,"url":null,"abstract":"The development of machine learning over the years has facilitated the joint upsurge of complex cognitive infocommunication systems. Machine Learning methods are vital elements of modern cognitive infocommunications systems because they can be used in various ways such as behavior modeling or sentiment analysis. Machine Learning algorithms requires a reliable infrastructure and vast amount of data. Therefore building data warehouse systems is one of the essential steps of of building reliable cognitive infocommunication systems. Finding and preprocessing data streams of different origins are the first steps during the creation of a data warehouse. Unfortunately, online data streams are most often formatted uniquely. Therefore, the obtained data sets must be transformed into a unified data model. The modelling and conversion of data sources serves as a key step during the unification of heterogeneous data. Storage should be persistent, and optimized for the analytical processing of data. These requirements raise technological challenges that are not common during the design of data sources. This paper gives an overview of current data warehouse technologies and suggests an infrastructure implementation. Hive is used for accessing, modifying, and running complex analytics on the stored data sets. Economical data can often be unique to the product, or the industry it covers. Different data sources used unique data formats which were tailored for their application area or needs. Moreover, some of these data sources may change their format in time. Therefore, a flexible data transformation step is required which can be configured easily. The ETL processes of the data sources are implemented in Python, and Hive. The data is loaded in a Hive data warehouse which stores data in the distributed Hadoop File System.","PeriodicalId":236400,"journal":{"name":"2020 11th IEEE International Conference on Cognitive Infocommunications (CogInfoCom)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Case Study of an On-premise Data Warehouse Configuration\",\"authors\":\"Bence Bogdandy, Adam Kovacs, Zsolt Tóth\",\"doi\":\"10.1109/CogInfoCom50765.2020.9237814\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The development of machine learning over the years has facilitated the joint upsurge of complex cognitive infocommunication systems. Machine Learning methods are vital elements of modern cognitive infocommunications systems because they can be used in various ways such as behavior modeling or sentiment analysis. Machine Learning algorithms requires a reliable infrastructure and vast amount of data. Therefore building data warehouse systems is one of the essential steps of of building reliable cognitive infocommunication systems. Finding and preprocessing data streams of different origins are the first steps during the creation of a data warehouse. Unfortunately, online data streams are most often formatted uniquely. Therefore, the obtained data sets must be transformed into a unified data model. The modelling and conversion of data sources serves as a key step during the unification of heterogeneous data. Storage should be persistent, and optimized for the analytical processing of data. These requirements raise technological challenges that are not common during the design of data sources. This paper gives an overview of current data warehouse technologies and suggests an infrastructure implementation. Hive is used for accessing, modifying, and running complex analytics on the stored data sets. Economical data can often be unique to the product, or the industry it covers. Different data sources used unique data formats which were tailored for their application area or needs. Moreover, some of these data sources may change their format in time. Therefore, a flexible data transformation step is required which can be configured easily. The ETL processes of the data sources are implemented in Python, and Hive. The data is loaded in a Hive data warehouse which stores data in the distributed Hadoop File System.\",\"PeriodicalId\":236400,\"journal\":{\"name\":\"2020 11th IEEE International Conference on Cognitive Infocommunications (CogInfoCom)\",\"volume\":\"71 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-09-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 11th IEEE International Conference on Cognitive Infocommunications (CogInfoCom)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CogInfoCom50765.2020.9237814\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 11th IEEE International Conference on Cognitive Infocommunications (CogInfoCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CogInfoCom50765.2020.9237814","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

多年来机器学习的发展促进了复杂认知信息通信系统的联合高潮。机器学习方法是现代认知信息通信系统的重要组成部分，因为它们可以以各种方式使用，例如行为建模或情感分析。机器学习算法需要可靠的基础设施和大量的数据。因此，构建数据仓库系统是构建可靠的认知信息通信系统的重要步骤之一。查找和预处理不同来源的数据流是创建数据仓库的第一步。不幸的是，在线数据流的格式通常是唯一的。因此，必须将得到的数据集转换成统一的数据模型。数据源的建模和转换是异构数据统一的关键环节。存储应该是持久的，并针对数据的分析处理进行了优化。这些需求提出了在数据源设计过程中不常见的技术挑战。本文概述了当前的数据仓库技术，并提出了一种基础设施实现方案。Hive用于对存储的数据集进行访问、修改和运行复杂的分析。经济数据通常是产品或其涵盖的行业所独有的。不同的数据源使用针对其应用领域或需求量身定制的独特数据格式。此外，其中一些数据源可能会及时更改其格式。因此，需要一个易于配置的灵活的数据转换步骤。数据源的ETL进程是用Python和Hive实现的。数据加载到Hive数据仓库中，该数据仓库存储在分布式Hadoop文件系统中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Case Study of an On-premise Data Warehouse Configuration

The development of machine learning over the years has facilitated the joint upsurge of complex cognitive infocommunication systems. Machine Learning methods are vital elements of modern cognitive infocommunications systems because they can be used in various ways such as behavior modeling or sentiment analysis. Machine Learning algorithms requires a reliable infrastructure and vast amount of data. Therefore building data warehouse systems is one of the essential steps of of building reliable cognitive infocommunication systems. Finding and preprocessing data streams of different origins are the first steps during the creation of a data warehouse. Unfortunately, online data streams are most often formatted uniquely. Therefore, the obtained data sets must be transformed into a unified data model. The modelling and conversion of data sources serves as a key step during the unification of heterogeneous data. Storage should be persistent, and optimized for the analytical processing of data. These requirements raise technological challenges that are not common during the design of data sources. This paper gives an overview of current data warehouse technologies and suggests an infrastructure implementation. Hive is used for accessing, modifying, and running complex analytics on the stored data sets. Economical data can often be unique to the product, or the industry it covers. Different data sources used unique data formats which were tailored for their application area or needs. Moreover, some of these data sources may change their format in time. Therefore, a flexible data transformation step is required which can be configured easily. The ETL processes of the data sources are implemented in Python, and Hive. The data is loaded in a Hive data warehouse which stores data in the distributed Hadoop File System.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 11th IEEE International Conference on Cognitive Infocommunications (CogInfoCom)

自引率

0.00%

发文量