{"title":"Cardinality estimation in ETL processes","authors":"Maik Thiele, Tim Kiefer, Wolfgang Lehner","doi":"10.1145/1651291.1651302","DOIUrl":"https://doi.org/10.1145/1651291.1651302","url":null,"abstract":"The cardinality estimation in ETL processes is particularly difficult. Aside from the well-known SQL operators, which are also used in ETL processes, there are a variety of operators without exact counterparts in the relational world. In addition to those, we find operators that support very specific data integration aspects. For such operators, there are no well-examined statistic approaches for cardinality estimations. Therefore, we propose a black-box approach and estimate the cardinality using a set of statistic models for each operator. We discuss different model granularities and develop an adaptive cardinality estimation framework for ETL processes. We map the abstract model operators to specific statistic learning approaches (regression, decision trees, support vector machines, etc.) and evaluate our cardinality estimations in an extensive experimental study.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"89 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124453872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Consistency-aware evaluation of OLAP queries in replicated data warehouses","authors":"Javier García-García, C. Ordonez","doi":"10.1145/1651291.1651305","DOIUrl":"https://doi.org/10.1145/1651291.1651305","url":null,"abstract":"OLAP tools for distributed data warehouses generally assume underlying replicated tables are up to date. Unfortunately, maintaining updated replicas is difficult due to the inherent tradeoff between consistency and availability. In this paper, we propose techniques to evaluate OLAP queries in distributed data warehouses assuming a lazy replication model. Considering that it may be admissible to evaluate OLAP queries with slightly outdated replicated tables, our technique first efficiently computes the degree of obsolescence of replicated local tables and when such result is acceptable, given an error threshold, then the query is evaluated locally, avoiding the transmission of large tables over the network. Otherwise, the query can be remotely evaluated less efficiently with the master copy of tables, provided they are stored at a single site. Inconsistency measurement is computed by adapting distributed set reconciliation algorithms to efficiently compute the symmetric difference between the master and replicated tables. Our improved distributed database algorithm has linear communication complexity and cubic time complexity in the size of the symmetric difference, which is expected to be small in a replicated data warehouse. Our technique is independent of the method employed to propagate data warehouse insertions, deletions and updates. We present experiments simulating distributed databases, with different CPU and transmission speeds, showing our method is effective to decide if the query should be evaluated either locally or remotely.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127691472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"View usability and safety for the answering of top-k queries via materialized views","authors":"Eftychia Baikousi, Panos Vassiliadis","doi":"10.1145/1651291.1651308","DOIUrl":"https://doi.org/10.1145/1651291.1651308","url":null,"abstract":"In this paper, we investigate the problem of answering top-k queries via materialized views. We provide theoretical guarantees for the adequacy of a view to answer a top-k query, along with algorithmic techniques to compute the query via a view when this is possible. We explore the problem of answering a query via a combination of more than one view and show that it is impossible to improve our theoretical guarantees for the answering of a query via a combination of views. Finally, we experimentally assess our approach for its effectiveness and efficiency.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125881987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Generating data quality rules and integration into ETL process","authors":"J. Rodic, M. Baranović","doi":"10.1145/1651291.1651303","DOIUrl":"https://doi.org/10.1145/1651291.1651303","url":null,"abstract":"Many data quality projects are integrated into data warehouse projects without enough time allocated for the data quality part, which leads to a need for a quicker data quality process implementation that can be easily adopted as the first stage of data warehouse implementation. We will see that many data quality rules can be implemented in a similar way, and thus generated based on metadata tables that store information about the rules. These generated rules are then used to check data in designated tables and mark erroneous records, or to do certain updates of invalid data. We will also store information about the rules violations in order to provide analysis of such data. This could give a significant insight into our source systems. Entire data quality process will be integrated into ETL process in order to achieve load of data warehouse that is as automated, as correct and as quick as possible. Only small number of records would be left for manual inspection and reprocessing.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125915090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Defining ETL worfklows using BPMN and BPEL","authors":"Z. E. Akkaoui, E. Zimányi","doi":"10.1145/1651291.1651299","DOIUrl":"https://doi.org/10.1145/1651291.1651299","url":null,"abstract":"Decisional systems are crucial for enterprise improvement. They allow the consolidation of heterogeneous data from distributed enterprise data stores into strategic indicators. An essential component of this data consolidation is the Extract, Transform, and Load (ETL) process. In the research literature there has been very few work defining conceptual models for ETL processes. At the same time, there are currently many tools that manage such processes. However, each tool uses its own model, which is not necessarily able to communicate with the models of other tools. In this paper, we propose a platform-independent conceptual model of ETL processes based on the Business Process Model Notation (BPMN) standard. We also show how such a conceptual model can be implemented using Business Process Execution Language (BPEL), a standard executable language for specifying interactions with web services.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133346817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comprehensive approach to data warehouse testing","authors":"M. Golfarelli, S. Rizzi","doi":"10.1145/1651291.1651295","DOIUrl":"https://doi.org/10.1145/1651291.1651295","url":null,"abstract":"Testing is an essential part of the design life-cycle of any software product. Nevertheless, while most phases of data warehouse design have received considerable attention in the literature, not much has been said about data warehouse testing. In this paper we introduce a number of data mart-specific testing activities, we classify them in terms of what is tested and how it is tested, and we discuss how they can be framed within a reference design methodology.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130853626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oscar Romero, Diego Calvanese, A. Abelló, M. Rodriguez-Muro
{"title":"Discovering functional dependencies for multidimensional design","authors":"Oscar Romero, Diego Calvanese, A. Abelló, M. Rodriguez-Muro","doi":"10.1145/1651291.1651293","DOIUrl":"https://doi.org/10.1145/1651291.1651293","url":null,"abstract":"Nowadays, it is widely accepted that the data warehouse design task should be largely automated. Furthermore, the data warehouse conceptual schema must be structured according to the multidimensional model and as a consequence, the most common way to automatically look for subjects and dimensions of analysis is by discovering functional dependencies (as dimensions functionally depend on the fact) over the data sources. Most advanced methods for automating the design of the data warehouse carry out this process from relational OLTP systems, assuming that a RDBMS is the most common kind of data source we may find, and taking as starting point a relational schema. In contrast, in our approach we propose to rely instead on a conceptual representation of the domain of interest formalized through a domain ontology expressed in the DL-Lite Description Logic. We propose an algorithm to discover functional dependencies from the domain ontology that exploits the inference capabilities of DL-Lite, thus fully taking into account the semantics of the domain. We also provide an evaluation of our approach in a real-world scenario.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"339 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122543079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic generation of ETL processes from conceptual models","authors":"Lilia Muñoz, J. Mazón, J. Trujillo","doi":"10.1145/1651291.1651298","DOIUrl":"https://doi.org/10.1145/1651291.1651298","url":null,"abstract":"Data warehouses (DW) integrate different data sources in order to give a multidimensional view of them to the decision-maker. To this aim, the ETL (Extraction, Transformation and Load) processes are responsible for extracting data from heterogeneous operational data sources, their transformation (conversion, cleaning, standardization, etc.), and its load in the DW. In recent years, several conceptual modeling approaches have been proposed for designing ETL processes. Although these approaches are very useful for documenting ETL processes and supporting the designer tasks, these proposals fail to give mechanisms to carry out an automatic code generation stage. Such a stage should be required to both avoid fails and save development time in the implementation of complex ETL process. Therefore, in this paper we define an approach for the automatic code generation of ETL processes. To this aim, we align the modeling of ETL processes in DW with MDA (Model Driven Architecture) by formally defining a set of QVT (Query, View, Transformation) transformations.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"148 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128442363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Silva, V. Times, A. Salgado, Clenúbio Souza, R. Fidalgo, A. Oliveira
{"title":"A set of aggregation functions for spatial measures","authors":"J. Silva, V. Times, A. Salgado, Clenúbio Souza, R. Fidalgo, A. Oliveira","doi":"10.1145/1458432.1458438","DOIUrl":"https://doi.org/10.1145/1458432.1458438","url":null,"abstract":"A number of studies have been developed in recent years aimed at integrating pertinent concepts and technologies for analytical multidimensional (OLAP) and geographic (GIS) processing environments. This type of integrated environment has been identified as SOLAP (Spatial OLAP). However, due to the fact that these two technologies were conceived with different purposes in mind, the interaction of the two environments is not an easy task and even with so much research being developed, there remain unresolved issues that merit exploration. One such issue refers to aggregation functions for measures. These functions are currently used in the definition of multidimensional and geographic data cubes. The aim of this paper is to present a set of aggregation functions for geographic measures. We also show these functions in practice, by taking into account their use with a SOLAP architecture prototype. This SOLAP prototype is based on a model for Geographic Data Warehouse (GDW), a data cube model and a geographic multidimensional query language.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124479926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient OLAP with UDFs","authors":"Zhibo Chen, C. Ordonez","doi":"10.1145/1458432.1458440","DOIUrl":"https://doi.org/10.1145/1458432.1458440","url":null,"abstract":"Since the early 1990s, On-Line Analytical Processing (OLAP) has been a well studied research topic that has focused on implementation outside the database, either with OLAP servers or entirely within the client computers. Our approach involves the computation and storage of OLAP cubes using User-Defined Functions (UDF) with a database management system. UDFs offer users a chance to write their own code that can then called like any other standard SQL function. By generating OLAP cubes within a UDF, we are able to create the entire lattice in main memory. The UDF also allows the user to assert more control over the actual generation process than when using standard OLAP functions such as the CUBE operator. We introduce a data structure that can not only efficiently create an OLAP lattice in main memory, but also be adapted to generate association rule itemsets with minimal change. We experimentally show that the UDF approach is more efficient than SQL using one real dataset and a synthetic dataset. Also, we present several experiments showing that generating association rule itemsets using the UDF approach is comparable to a SQL approach. In this paper, we show that techniques such as OLAP and association rules can be efficiently pushed into the UDF, and has better performance, in most cases, compared to standard SQL functions.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116879903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}