Beatriz Fragnan P. de Oliveira, A. Valente, M. Victorino, E. Ribeiro, M. Holanda
{"title":"Analysis of the Influence of Modeling, Data Format and Processing Tool on the Performance of Hadoop-Hive Based Data Warehouse","authors":"Beatriz Fragnan P. de Oliveira, A. Valente, M. Victorino, E. Ribeiro, M. Holanda","doi":"10.5753/jidm.2022.2516","DOIUrl":"https://doi.org/10.5753/jidm.2022.2516","url":null,"abstract":"With the emergence of Big Data and the continuous growth of massive data produced by web applications, smartphones, social networks, and others, organizations began to invest in alternative solutions that would derive value from this amount of data. In this context, this article evaluates three factors that can significantly influence the performance of Big Data Hive queries: data modeling, data format and processing tool. The objective is to present a comparative analysis of the Hive platform performance with the snowflake model and the fully denormalized one. Moreover, the influence of two types of table storage file types (CSV and Parquet) and two types of data processing tools, Hadoop and Spark, were also comparatively analyzed. The data used for analysis is the open data of the Brazilian Army in the Google Cloud environment. Analysis was performed for different data volumes in Hive and cluster configuration scenarios. The results yielded that the Parquet storage format always performed better than when CSV storage formats were used, regardless of the model and processing tool selected for the test scenario.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127360795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lorenna Christ'na Nascimento, Rodolfo P. Chagas, Marcos Lage, Daniel de Oliveira
{"title":"Beyond Click-and-View: a Comparative Study of Data Management Approaches for Interactive Visualization","authors":"Lorenna Christ'na Nascimento, Rodolfo P. Chagas, Marcos Lage, Daniel de Oliveira","doi":"10.5753/jidm.2022.2513","DOIUrl":"https://doi.org/10.5753/jidm.2022.2513","url":null,"abstract":"Visual analytics solutions have been growing in popularity in recent years, not only for showing final results but also for assisting in interactive analysis and decision-making. Analysis of a large amount of data requires flexible exploration and visualizations. However, queries that span geographical regions over time slices are expensive to compute, which turns it challenging to accomplish interactive speeds for huge data sets. Such systems require efficient data availability, so that response time does not interfere with the user’s ability to observe and analyze. Simultaneously, researches in the database domain have proposed solutions that can be used to support visualization systems. This article presents a comparative study of data management approaches to support interactive visualizations. The chosen data management solutions are (i) Apache Drill (a Polystore system), (ii) Apache Spark (a big data framework), (iii) Elasticsearch (a search engine), (iv) MonetDB (a column-oriented DBMS), and (v) PostgreSQL (a relational DBMS). To evaluate the performance of each solution, we selected a list of spatiotemporal queries among multiple queries submitted by users in a visual analytics system for rainfall data analysis named TEMPO. The results of this study show that Apache Spark and MonetDB present the best performance for the selected queries.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129366430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaqueline B. Correia, F. Rodrigues, Nicolau O. Santos, Mara Abel, K. Becker
{"title":"Data Management in Digital Twins for the Oil and Gas Industry: beyond the OSDU Data Platform","authors":"Jaqueline B. Correia, F. Rodrigues, Nicolau O. Santos, Mara Abel, K. Becker","doi":"10.5753/jidm.2022.2506","DOIUrl":"https://doi.org/10.5753/jidm.2022.2506","url":null,"abstract":"Competitiveness in the Oil and Gas (O&G) sector has required high technological investments for datacentric decisions. One of the trends is the adoption of Digital Twins (DTs), which use virtual spaces and advanced analytical services to monitor and improve physical spaces. Central to the interconnection of these systems is a Data Fusion Core (DFC) component, which provides data management capabilities. Although the literature has proposed data management functionality in the scope of specific O&G DT applications, different joint efforts towards standardization can be found to deal with data integration and interoperability in the industry. The Open Subsurface Data Universe (OSDU) data platform is an initiative by several partners members of The Open Group consortium created to eliminate data silos in the O&G ecosystem and leverage innovation through a data-driven approach. In this article, we look at the convergence of this effort in providing data management functionalities for digital twins, highlighting strengths, gaps, and opportunities. We investigated the extent to which the OSDU data platform meets the needs of a DFC implementation, with a focus on interoperability, integration, governance, and data lineage. We also propose additional resources for data management in this context, namely data enrichment, workflows, and data lineage. Our main contributions are: (i) analysis of possible data management capabilities for creating a working DFC for an O&G DT and (ii) initial ideas on the complementary role of OSDU data representation and ontologies and how this semantic enrichment can be leveraged in a DFC of a DT.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123048270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Danielle F. de Albuquerque, Luís Tarrataca, Diego N. Brandão, R. Coutinho
{"title":"A Genetic Algorithm with Flexible Fitness Function for Feature Selection in Educational Data: Comparative Evaluation","authors":"Danielle F. de Albuquerque, Luís Tarrataca, Diego N. Brandão, R. Coutinho","doi":"10.5753/jidm.2022.2480","DOIUrl":"https://doi.org/10.5753/jidm.2022.2480","url":null,"abstract":"Educational Data Mining is an interdisciplinary field that helps understand educational phenomena through computational techniques. The databases of educational institutions are usually extensive, possessing many descriptive attributes that make the prediction process complex. In addition, the data can be sparse, redundant, irrelevant, and noisy, which can degrade the predictive quality of the models and affect computational performance. One way to simplify the problem is to identify the least important attributes and omit them from the modeling process. This can be performed by employing attribute selection techniques. This work evaluates different feature selection techniques applied to open educational data and paired alongside a genetic algorithm with a flexible fitness function. The methods and results described herein extend a previously published paper by: (i) describing a larger set of computational experiments; (ii) performing a hypothesis test over different classifiers; and (iii) presenting a more in-depth literature revision. The results obtained indicate an improvement in the classification process.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"27 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116686173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jefferson Amará, Victor Ströele, R. Braga, M. Dantas, Michael A. Bauer
{"title":"Integrating Heterogeneous Stream and Historical Data Sources using SQL","authors":"Jefferson Amará, Victor Ströele, R. Braga, M. Dantas, Michael A. Bauer","doi":"10.5753/jidm.2022.2488","DOIUrl":"https://doi.org/10.5753/jidm.2022.2488","url":null,"abstract":"Applications capable of integrating data from historical and streaming sources can make the most contextualized and enriched decision-making. However, the complexity of data integration over heterogeneous data sources can be a hard task for querying in this context. Approaches that facilitate data integration, abstracting details and formats of the primary sources can meet these needs. This work presents a framework that allows the integration of streaming and historical data in real-time, abstracting syntactic aspects of queries through the use of SQL as a standard language for querying heterogeneous sources. The framework was evaluated through an experiment using relational datasets and real data produced by sensors. The results point to the feasibility of the approach.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"295 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132256595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christian Schmitz, Jonathan Martins, Serigne K. Mbaye, Edimar Manica, Renata Galante
{"title":"ACERPI-Block: Applying Blocking Techniques to the ACERPI Approach","authors":"Christian Schmitz, Jonathan Martins, Serigne K. Mbaye, Edimar Manica, Renata Galante","doi":"10.5753/jidm.2022.2509","DOIUrl":"https://doi.org/10.5753/jidm.2022.2509","url":null,"abstract":"Ordinances are documents issued by federal institutions that contain, among others, information regarding their staff. These documents are accessible through public repositories that usually do not allow any filter or advanced search on documents’ contents. This paper extends ACERPI (an approach to collect documents, extract information and resolve entities from institutional ordinances), which identifies the people mentioned in ordinances from institutions to help users find the documents of interest. ACERPI-Block focuses on the Entity Resolution step of the approach, developing blocking strategies that allow scalability to hundreds of thousands of records being resolved. Experiments show a reduction of 93.3% in the number of comparisons of similarity between records if compared to the solution without blocking, with no decrease in efficacy.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115546645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. G. K. Bertella, Y. K. Lopes, Rafael Alves Paes de Oliveira, A. Carniel
{"title":"A Systematic Review of Spatial Approximations in Spatial Database Systems","authors":"P. G. K. Bertella, Y. K. Lopes, Rafael Alves Paes de Oliveira, A. Carniel","doi":"10.5753/jidm.2022.2519","DOIUrl":"https://doi.org/10.5753/jidm.2022.2519","url":null,"abstract":"Many applications rely on spatial information retrieval, which involves costly computational geometric algorithms to process spatial queries. Spatial approximations simplify the geometric shape of complex spatial objects, allowing faster spatial queries at the expense of result accuracy. In this sense, spatial approximations have been employed to efficiently reduce the number of objects under consideration, followed by a refinement step to restore accuracy. For instance, spatial index structures employ spatial approximations to organize spatial objects in hierarchical structures (e.g., the R-tree). It leads to the interest in studying how spatial approximations can be efficiently employed to improve spatial query processing. This article presents a systematic review on this topic. We gather relevant studies by performing a search string on several digital libraries. We further expand the studies under consideration by employing a single iteration of the snowballing approach, where we track the reference list of selected papers. As a result, we provide an overview and comparison of existing approaches that propose, evaluate, or make use of spatial approximations to optimize the performance of spatial queries. The spatial approximations mentioned by the approaches are also summarized. Further, we characterize the approaches and discuss some future trends.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125054727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marisa S. Franco, Simone Dominico, T. R. Kepe, L. Albini, E. Almeida, M. Alves
{"title":"Evaluation of Hash Join Operations Performance Executing on SDN Switches: A Cost Model Approach","authors":"Marisa S. Franco, Simone Dominico, T. R. Kepe, L. Albini, E. Almeida, M. Alves","doi":"10.5753/jidm.2022.2515","DOIUrl":"https://doi.org/10.5753/jidm.2022.2515","url":null,"abstract":"Distributed database systems store and manipulate data on multiple machines. In these systems, the processing cost of query operations is mainly impacted by the data access latency between machines over the network. With recent technology advances in programmable network devices, the network switches provide new opportunities for dynamically managing the network topology, enabling the data processing on these devices with the same network throughput. In this paper, we explore the programmable network switches in query processing, evaluating the processing performance of a cost model in executing the hash join operation. We assume the storage of the hash table built from outer relation and the materialization of the join probing are made in switches using advanced matching techniques similar to package inspections enabled by Ternary Content-Addressable Memories (TCAM) or SRAM via hashing. Our results show that processing the hash join operation using network switches achieved the best results compared to traditional servers, with an average time reduction of 91.82% (Query-10 from TPC-H) and 96.52% (Query-11 from TPC-H).","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125116338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Cazzolato, L. S. Rodrigues, M. X. Ribeiro, M. A. Gutierrez, C. Traina, A. J. Traina
{"title":"Sketch+ for Visual and Correlation-Based Exploratory Data Analysis: A Case Study with COVID-19 Databases","authors":"M. Cazzolato, L. S. Rodrigues, M. X. Ribeiro, M. A. Gutierrez, C. Traina, A. J. Traina","doi":"10.5753/jidm.2022.2484","DOIUrl":"https://doi.org/10.5753/jidm.2022.2484","url":null,"abstract":"The amount of data daily generated by different sources grows exponentially and brings new challenges to the information technology experts. The recorded data usually include heterogeneous attribute types, such as the traditional date, numerical, textual, and categorical information, as well as complex ones, such as images, videos, and multidimensional data. Simply posing similarity queries over such records can underestimate the semantics and potential usefulness of particular attributes. In this context, the Exploratory Data Analysis (EDA) technology is well-suited to understand data and perform knowledge extraction and visualization of existing patterns. In this paper, we propose Sketch+ , a technique and a corresponding supporting tool to compare electronic health records (provided by hospitals) by similarity, supporting correlation-based exploratory analysis over attributes of different types and allowing data preprocessing tasks for visualization and knowledge extraction. Sketch+ computes partial and overall data correlation considering distance spaces induced by the attributes. It employs both ANOVA and association rules with lift correlations to study relationships between variables, allowing extensive data analysis. Among the tools provided, a pixel-oriented one drives the analysts to observe visual correlations among dates, categorical and numerical attributes. As a running case study, we employed three open databases of COVID-19 cases, showing that specialists can benefit from the inference modules of Sketch+ to analyze electronic records. The study highlights how Sketch+ can be employed to spot strong correlations among tuples and attributes, with statistically significant results. The exploratory analysis has been shown to be an essential complement for similarity search tasks, identifying and evaluating patterns from heterogeneous attributes.","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126584742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. G. Jiménez, Luiz André Portes Paes Leme, Y. Izquierdo, A. B. Neves, M. Casanova
{"title":"A Framework to Compute Entity Relatedness in Large RDF Knowledge Bases","authors":"J. G. Jiménez, Luiz André Portes Paes Leme, Y. Izquierdo, A. B. Neves, M. Casanova","doi":"10.5753/jidm.2022.2435","DOIUrl":"https://doi.org/10.5753/jidm.2022.2435","url":null,"abstract":"\u0000\u0000\u0000The entity relatedness problem refers to the question of exploring a knowledge base, represented as an RDF graph, to discover and understand how two entities are connected. This article addresses such problem by combining distributed RDF path search and ranking strategies in a framework called DCoEPinKB, which helps reduce the overall execution time in large RDF graphs and yet maintains adequate ranking accuracy. The framework allows the implementation of different strategies and enables their comparison. The article also reports experiments with data from DBpedia, which provide insights into the performance of different strategies.\u0000\u0000\u0000","PeriodicalId":301338,"journal":{"name":"J. Inf. Data Manag.","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124749007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}