J. Inf. Data Manag.最新文献_第2页

Analysis of the Influence of Modeling, Data Format and Processing Tool on the Performance of Hadoop-Hive Based Data Warehouse 基于Hadoop-Hive的数据仓库建模、数据格式和处理工具对性能的影响分析

J. Inf. Data Manag. Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2516

Beatriz Fragnan P. de Oliveira, A. Valente, M. Victorino, E. Ribeiro, M. Holanda

{"title":"Analysis of the Influence of Modeling, Data Format and Processing Tool on the Performance of Hadoop-Hive Based Data Warehouse","authors":"Beatriz Fragnan P. de Oliveira, A. Valente, M. Victorino, E. Ribeiro, M. Holanda","doi":"10.5753/jidm.2022.2516","DOIUrl":"https://doi.org/10.5753/jidm.2022.2516","url":null,"abstract":"With the emergence of Big Data and the continuous growth of massive data produced by web applications, smartphones, social networks, and others, organizations began to invest in alternative solutions that would derive value from this amount of data. In this context, this article evaluates three factors that can significantly influence the performance of Big Data Hive queries: data modeling, data format and processing tool. The objective is to present a comparative analysis of the Hive platform performance with the snowflake model and the fully denormalized one. Moreover, the influence of two types of table storage file types (CSV and Parquet) and two types of data processing tools, Hadoop and Spark, were also comparatively analyzed. The data used for analysis is the open data of the Brazilian Army in the Google Cloud environment. Analysis was performed for different data volumes in Hive and cluster configuration scenarios. The results yielded that the Parquet storage format always performed better than when CSV storage formats were used, regardless of the model and processing tool selected for the test scenario.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127360795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Beyond Click-and-View: a Comparative Study of Data Management Approaches for Interactive Visualization 超越点击-查看:交互式可视化数据管理方法的比较研究

J. Inf. Data Manag. Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2513

Lorenna Christ'na Nascimento, Rodolfo P. Chagas, Marcos Lage, Daniel de Oliveira

{"title":"Beyond Click-and-View: a Comparative Study of Data Management Approaches for Interactive Visualization","authors":"Lorenna Christ'na Nascimento, Rodolfo P. Chagas, Marcos Lage, Daniel de Oliveira","doi":"10.5753/jidm.2022.2513","DOIUrl":"https://doi.org/10.5753/jidm.2022.2513","url":null,"abstract":"Visual analytics solutions have been growing in popularity in recent years, not only for showing final results but also for assisting in interactive analysis and decision-making. Analysis of a large amount of data requires flexible exploration and visualizations. However, queries that span geographical regions over time slices are expensive to compute, which turns it challenging to accomplish interactive speeds for huge data sets. Such systems require efficient data availability, so that response time does not interfere with the user’s ability to observe and analyze. Simultaneously, researches in the database domain have proposed solutions that can be used to support visualization systems. This article presents a comparative study of data management approaches to support interactive visualizations. The chosen data management solutions are (i) Apache Drill (a Polystore system), (ii) Apache Spark (a big data framework), (iii) Elasticsearch (a search engine), (iv) MonetDB (a column-oriented DBMS), and (v) PostgreSQL (a relational DBMS). To evaluate the performance of each solution, we selected a list of spatiotemporal queries among multiple queries submitted by users in a visual analytics system for rainfall data analysis named TEMPO. The results of this study show that Apache Spark and MonetDB present the best performance for the selected queries.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129366430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Data Management in Digital Twins for the Oil and Gas Industry: beyond the OSDU Data Platform 面向油气行业的数字孪生数据管理:超越OSDU数据平台

J. Inf. Data Manag. Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2506

Jaqueline B. Correia, F. Rodrigues, Nicolau O. Santos, Mara Abel, K. Becker

{"title":"Data Management in Digital Twins for the Oil and Gas Industry: beyond the OSDU Data Platform","authors":"Jaqueline B. Correia, F. Rodrigues, Nicolau O. Santos, Mara Abel, K. Becker","doi":"10.5753/jidm.2022.2506","DOIUrl":"https://doi.org/10.5753/jidm.2022.2506","url":null,"abstract":"Competitiveness in the Oil and Gas (O&G) sector has required high technological investments for datacentric decisions. One of the trends is the adoption of Digital Twins (DTs), which use virtual spaces and advanced analytical services to monitor and improve physical spaces. Central to the interconnection of these systems is a Data Fusion Core (DFC) component, which provides data management capabilities. Although the literature has proposed data management functionality in the scope of specific O&G DT applications, different joint efforts towards standardization can be found to deal with data integration and interoperability in the industry. The Open Subsurface Data Universe (OSDU) data platform is an initiative by several partners members of The Open Group consortium created to eliminate data silos in the O&G ecosystem and leverage innovation through a data-driven approach. In this article, we look at the convergence of this effort in providing data management functionalities for digital twins, highlighting strengths, gaps, and opportunities. We investigated the extent to which the OSDU data platform meets the needs of a DFC implementation, with a focus on interoperability, integration, governance, and data lineage. We also propose additional resources for data management in this context, namely data enrichment, workflows, and data lineage. Our main contributions are: (i) analysis of possible data management capabilities for creating a working DFC for an O&G DT and (ii) initial ideas on the complementary role of OSDU data representation and ontologies and how this semantic enrichment can be leveraged in a DFC of a DT.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123048270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Genetic Algorithm with Flexible Fitness Function for Feature Selection in Educational Data: Comparative Evaluation 具有灵活适应度函数的遗传算法在教育数据特征选择中的应用:比较评价

J. Inf. Data Manag. Pub Date : 2022-09-21 DOI: 10.5753/jidm.2022.2480

Danielle F. de Albuquerque, Luís Tarrataca, Diego N. Brandão, R. Coutinho

{"title":"A Genetic Algorithm with Flexible Fitness Function for Feature Selection in Educational Data: Comparative Evaluation","authors":"Danielle F. de Albuquerque, Luís Tarrataca, Diego N. Brandão, R. Coutinho","doi":"10.5753/jidm.2022.2480","DOIUrl":"https://doi.org/10.5753/jidm.2022.2480","url":null,"abstract":"Educational Data Mining is an interdisciplinary field that helps understand educational phenomena through computational techniques. The databases of educational institutions are usually extensive, possessing many descriptive attributes that make the prediction process complex. In addition, the data can be sparse, redundant, irrelevant, and noisy, which can degrade the predictive quality of the models and affect computational performance. One way to simplify the problem is to identify the least important attributes and omit them from the modeling process. This can be performed by employing attribute selection techniques. This work evaluates different feature selection techniques applied to open educational data and paired alongside a genetic algorithm with a flexible fitness function. The methods and results described herein extend a previously published paper by: (i) describing a larger set of computational experiments; (ii) performing a hypothesis test over different classifiers; and (iii) presenting a more in-depth literature revision. The results obtained indicate an improvement in the classification process.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"27 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116686173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Integrating Heterogeneous Stream and Historical Data Sources using SQL 使用SQL集成异构流和历史数据源

J. Inf. Data Manag. Pub Date : 2022-09-12 DOI: 10.5753/jidm.2022.2488

Jefferson Amará, Victor Ströele, R. Braga, M. Dantas, Michael A. Bauer

引用次数: 1

ACERPI-Block: Applying Blocking Techniques to the ACERPI Approach ACERPI- block:将block技术应用于ACERPI方法

J. Inf. Data Manag. Pub Date : 2022-09-12 DOI: 10.5753/jidm.2022.2509

Christian Schmitz, Jonathan Martins, Serigne K. Mbaye, Edimar Manica, Renata Galante

引用次数: 0

A Systematic Review of Spatial Approximations in Spatial Database Systems 空间数据库系统中空间逼近的系统综述

J. Inf. Data Manag. Pub Date : 2022-09-12 DOI: 10.5753/jidm.2022.2519

P. G. K. Bertella, Y. K. Lopes, Rafael Alves Paes de Oliveira, A. Carniel

{"title":"A Systematic Review of Spatial Approximations in Spatial Database Systems","authors":"P. G. K. Bertella, Y. K. Lopes, Rafael Alves Paes de Oliveira, A. Carniel","doi":"10.5753/jidm.2022.2519","DOIUrl":"https://doi.org/10.5753/jidm.2022.2519","url":null,"abstract":"Many applications rely on spatial information retrieval, which involves costly computational geometric algorithms to process spatial queries. Spatial approximations simplify the geometric shape of complex spatial objects, allowing faster spatial queries at the expense of result accuracy. In this sense, spatial approximations have been employed to efficiently reduce the number of objects under consideration, followed by a refinement step to restore accuracy. For instance, spatial index structures employ spatial approximations to organize spatial objects in hierarchical structures (e.g., the R-tree). It leads to the interest in studying how spatial approximations can be efficiently employed to improve spatial query processing. This article presents a systematic review on this topic. We gather relevant studies by performing a search string on several digital libraries. We further expand the studies under consideration by employing a single iteration of the snowballing approach, where we track the reference list of selected papers. As a result, we provide an overview and comparison of existing approaches that propose, evaluate, or make use of spatial approximations to optimize the performance of spatial queries. The spatial approximations mentioned by the approaches are also summarized. Further, we characterize the approaches and discuss some future trends.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125054727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Evaluation of Hash Join Operations Performance Executing on SDN Switches: A Cost Model Approach 在SDN交换机上执行哈希连接操作的性能评估:一种成本模型方法

J. Inf. Data Manag. Pub Date : 2022-09-12 DOI: 10.5753/jidm.2022.2515

Marisa S. Franco, Simone Dominico, T. R. Kepe, L. Albini, E. Almeida, M. Alves

{"title":"Evaluation of Hash Join Operations Performance Executing on SDN Switches: A Cost Model Approach","authors":"Marisa S. Franco, Simone Dominico, T. R. Kepe, L. Albini, E. Almeida, M. Alves","doi":"10.5753/jidm.2022.2515","DOIUrl":"https://doi.org/10.5753/jidm.2022.2515","url":null,"abstract":"Distributed database systems store and manipulate data on multiple machines. In these systems, the processing cost of query operations is mainly impacted by the data access latency between machines over the network. With recent technology advances in programmable network devices, the network switches provide new opportunities for dynamically managing the network topology, enabling the data processing on these devices with the same network throughput. In this paper, we explore the programmable network switches in query processing, evaluating the processing performance of a cost model in executing the hash join operation. We assume the storage of the hash table built from outer relation and the materialization of the join probing are made in switches using advanced matching techniques similar to package inspections enabled by Ternary Content-Addressable Memories (TCAM) or SRAM via hashing. Our results show that processing the hash join operation using network switches achieved the best results compared to traditional servers, with an average time reduction of 91.82% (Query-10 from TPC-H) and 96.52% (Query-11 from TPC-H).","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125116338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sketch+ for Visual and Correlation-Based Exploratory Data Analysis: A Case Study with COVID-19 Databases 基于可视化和相关性的探索性数据分析Sketch+:以COVID-19数据库为例

J. Inf. Data Manag. Pub Date : 2022-09-12 DOI: 10.5753/jidm.2022.2484

M. Cazzolato, L. S. Rodrigues, M. X. Ribeiro, M. A. Gutierrez, C. Traina, A. J. Traina

{"title":"Sketch+ for Visual and Correlation-Based Exploratory Data Analysis: A Case Study with COVID-19 Databases","authors":"M. Cazzolato, L. S. Rodrigues, M. X. Ribeiro, M. A. Gutierrez, C. Traina, A. J. Traina","doi":"10.5753/jidm.2022.2484","DOIUrl":"https://doi.org/10.5753/jidm.2022.2484","url":null,"abstract":"The amount of data daily generated by different sources grows exponentially and brings new challenges to the information technology experts. The recorded data usually include heterogeneous attribute types, such as the traditional date, numerical, textual, and categorical information, as well as complex ones, such as images, videos, and multidimensional data. Simply posing similarity queries over such records can underestimate the semantics and potential usefulness of particular attributes. In this context, the Exploratory Data Analysis (EDA) technology is well-suited to understand data and perform knowledge extraction and visualization of existing patterns. In this paper, we propose Sketch+ , a technique and a corresponding supporting tool to compare electronic health records (provided by hospitals) by similarity, supporting correlation-based exploratory analysis over attributes of different types and allowing data preprocessing tasks for visualization and knowledge extraction. Sketch+ computes partial and overall data correlation considering distance spaces induced by the attributes. It employs both ANOVA and association rules with lift correlations to study relationships between variables, allowing extensive data analysis. Among the tools provided, a pixel-oriented one drives the analysts to observe visual correlations among dates, categorical and numerical attributes. As a running case study, we employed three open databases of COVID-19 cases, showing that specialists can benefit from the inference modules of Sketch+ to analyze electronic records. The study highlights how Sketch+ can be employed to spot strong correlations among tuples and attributes, with statistically significant results. The exploratory analysis has been shown to be an essential complement for similarity search tasks, identifying and evaluating patterns from heterogeneous attributes.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126584742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Framework to Compute Entity Relatedness in Large RDF Knowledge Bases 大型RDF知识库中实体关联计算框架

J. Inf. Data Manag. Pub Date : 2022-09-12 DOI: 10.5753/jidm.2022.2435

J. G. Jiménez, Luiz André Portes Paes Leme, Y. Izquierdo, A. B. Neves, M. Casanova

引用次数: 0