Emona Nakuçi, V. Theodorou, P. Jovanovic, A. Abelló
{"title":"Bijoux: Data Generator for Evaluating ETL Process Quality","authors":"Emona Nakuçi, V. Theodorou, P. Jovanovic, A. Abelló","doi":"10.1145/2666158.2666183","DOIUrl":"https://doi.org/10.1145/2666158.2666183","url":null,"abstract":"Obtaining the right set of data for evaluating the fulfillment of different quality standards in the extract-transform-load (ETL) process design is rather challenging. First, the real data might be out of reach due to different privacy constraints, while providing a synthetic set of data is known as a labor-intensive task that needs to take various combinations of process parameters into account. Additionally, having a single dataset usually does not represent the evolution of data throughout the complete process lifespan, hence missing the plethora of possible test cases. To facilitate such demanding task, in this paper we propose an automatic data generator (i.e., Bijoux). Starting from a given ETL process model, Bijoux extracts the semantics of data transformations, analyzes the constraints they imply over data, and automatically generates testing datasets. At the same time, it considers different dataset and transformation characteristics (e.g., size, distribution, selectivity, etc.) in order to cover a variety of test scenarios. We report our experimental findings showing the effectiveness and scalability of our approach.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133091192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
O. Andersen, Benjamin B. Krogh, Christian Thomsen, K. Torp
{"title":"An Advanced Data Warehouse for Integrating Large Sets of GPS Data","authors":"O. Andersen, Benjamin B. Krogh, Christian Thomsen, K. Torp","doi":"10.1145/2666158.2666172","DOIUrl":"https://doi.org/10.1145/2666158.2666172","url":null,"abstract":"GPS data recorded from driving vehicles is available from many sources and is a very good data foundation for answering traffic related queries. However, most approaches so far have not considered combining GPS data from many sources into a single data warehouse. Further, the integration of GPS data with fuel consumption data (from the so-called CAN bus in the vehicles) and weather data has not been done. In this paper, we propose a data warehouse design for handling GPS data, fuel consumption data, and weather data. The design is fully implemented in a running system using the PostgreSQL DBMS. The system has been in production since March 2011 and the main fact table contains today approximately 3.4 billion rows from 16 different data sources. We show that the system can be used for a number of novel traffic related analyses such as relating the fuel consumption of vehicles with the road network and road congestion.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116622660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Big Graph Analytics: The State of the Art and Future Research Agenda","authors":"A. Cuzzocrea, I. Song","doi":"10.1145/2666158.2668454","DOIUrl":"https://doi.org/10.1145/2666158.2668454","url":null,"abstract":"Analytics over big graphs is becoming a first-class challenge in database research, with fast-growing interest from both the academia and the industrial community. This problem arises in several application scenarios, ranging from social networks to large-scale network systems, from knowledge discovery to cybersecurity, and so forth. Following this major trend, this paper explores actual state-of-the-art results in the area of analytics over big graphs and discusses open research issues and actual trends in such area.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129407081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lorenzo Baldacci, M. Golfarelli, Simone Graziani, S. Rizzi
{"title":"GOLAM: A Framework for Analyzing Genomic Data","authors":"Lorenzo Baldacci, M. Golfarelli, Simone Graziani, S. Rizzi","doi":"10.1145/2666158.2666175","DOIUrl":"https://doi.org/10.1145/2666158.2666175","url":null,"abstract":"The emerging medical models aim at leveraging on high-throughput genome sequencing technologies to better target drugs to patients' personal profiles so as to increase their effectiveness. However, the huge amount of data made available by these technologies calls for sophisticated and automated analysis techniques. In this direction we present GOLAM, a framework for OLAP analysis and mining of matches between genomic regions extracted from ENCODE, a worldwide-available collection of shared genomic data. The goal of GOLAM is to overcome the current limitations of genome analysis methods, that are normally based on browsing. This is done by partially automating and speeding-up the analysis process on the one hand, by making it more flexible and introducing a multi-resolution view of data on the other. The framework has been partially implemented so far; in this paper we focus on conveying its potential and on describing its functional architecture and the underlying data models.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125471336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Can we analyze big data inside a DBMS?","authors":"C. Ordonez","doi":"10.1145/2513190.2513198","DOIUrl":"https://doi.org/10.1145/2513190.2513198","url":null,"abstract":"Relational DBMSs remain the main data management technology, despite the big data analytics and no-SQL waves. On the other hand, for data analytics in a broad sense, there are plenty of non-DBMS tools including statistical languages, matrix packages, generic data mining programs and large-scale parallel systems, being the main technology for big data analytics. Such large-scale systems are mostly based on the Hadoop distributed file system and MapReduce. Thus it would seem a DBMS is not a good technology to analyze big data, going beyond SQL queries, acting just as a reliable and fast data repository. In this survey, we argue that is not the case, explaining important research that has enabled analytics on large databases inside a DBMS. However, we also argue DBMSs cannot compete with parallel systems like MapReduce to analyze web-scale text data. Therefore, each technology will keep influencing each other. We conclude with a proposal of long-term research issues, considering the \"big data analytics\" trend.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130100681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using REO on ETL conceptual modelling: a first approach","authors":"Bruno Oliveira, O. Belo","doi":"10.1145/2513190.2513202","DOIUrl":"https://doi.org/10.1145/2513190.2513202","url":null,"abstract":"The formalization of software patterns has proven to be very useful in software developing, improving systems communication, data interchange across platforms, and simplifying the integration of processes and data flows. Populating a data warehouse (ETL) is often a very complex task demanding significant computational resources. It faces many drawbacks during its design and implementation, involving not only large volumes of data that must be processed but also undesirable change of business requirements. All of this leads frequently to reuse significant parts of other ETL implementations, adapting data structures and processes to comply with new requirements. Additionally, we believe that it's necessary a more simply and reliable approach for ETL conceptual modelling covering the \"lack of mature\" of this important part of ETL development. In this paper we explored a new approach to ETL conceptual modelling using the Reo coordination language, trying to evaluate its adequacy and expressiveness on the coordination of ETL tasks. A pattern-based approach was designed to map typical operations used in real world ETL scenarios from an initial Reo specification. For demonstration purposes, we present and discuss as two case studies, a slowly changing dimension and a surrogated key pipelining processes.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127808087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"INDREX: in-database distributional relation extraction","authors":"T. Kilias, Alexander Löser, Periklis Andritsos","doi":"10.1145/2513190.2513196","DOIUrl":"https://doi.org/10.1145/2513190.2513196","url":null,"abstract":"Relation extraction transforms the textual representation of a relationship into the relational model of a data warehouse. Early systems, such as SystemT by IBM or the open source system GATE solve this task with handcrafted rule sets that the system executes document-by-document. Thereby the user must execute a highly interactive and iterative process of reading a document, of expressing rules, of testing these rules on the next document and of refining rules. Until now, these systems do neither leverage the full potential of built-in declarative query languages nor the indexing and query optimization techniques of a modern RDBMS that would enable a user interactive rule refinement across documents and on the entire corpus. We propose the INDREX system that enables a user for the first time to describe corpus-wide extraction tasks in a declarative language and permits the user to run interactive rule refinement queries. For enabling this powerful functionality we extend a standard PostgreSQL with a set of white-box user-defined functions that enable corpus-wide transformations from sentences into relationships. We store the text corpus and rules in the same RDBMS that already holds domain specific structured data. As a result, (1) the user can leverage this data to further adapt rules to the target domain, (2) the user does not need an additional system for rule extraction and (3) the INDREX system can leverage the full power of built-in indexing and query optimization techniques of the underlaying RDBMS. In a preliminary study we report on the feasibility of this disruptive approach and show multiple queries in INDREX on the Reuters Corpus, Volume 1.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"215 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116823655","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Meta-stars: multidimensional modeling for social business intelligence","authors":"E. Gallinucci, M. Golfarelli, S. Rizzi","doi":"10.1145/2513190.2513195","DOIUrl":"https://doi.org/10.1145/2513190.2513195","url":null,"abstract":"Social business intelligence is the discipline of combining corporate data with user-generated content (UGC) to let decision-makers improve their business based on the trends perceived from the environment. A key role in the analysis of textual UGC is played by topics, meant as specific concepts of interest within a subject area. To enable aggregations of topics at different levels, a topic hierarchy is to be defined. Some attempts have been made to address some of the peculiarities of topic hierarchies, but no comprehensive solution has been found so far. The approach we propose to model topic hierarchies in ROLAP systems is called meta-stars. Its basic idea is to use meta-modeling coupled with navigation tables and with traditional dimension tables: navigation tables support hierarchy instances with different lengths and with non-leaf facts, and allow different roll-up semantics to be explicitly annotated; meta-modeling enables hierarchy heterogeneity and dynamics to be accommodated; dimension tables are easily integrated with standard business hierarchies. After outlining a reference architecture for social business intelligence and describing the meta-star approach, we discuss its effectiveness and efficiency by showing its querying expressiveness and by presenting some experimental results for query performances.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131871642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lamia Oukid, Ounas Asfari, F. Bentayeb, N. Benblidia, Omar Boussaïd
{"title":"CXT-cube: contextual text cube model and aggregation operator for text OLAP","authors":"Lamia Oukid, Ounas Asfari, F. Bentayeb, N. Benblidia, Omar Boussaïd","doi":"10.1145/2513190.2513201","DOIUrl":"https://doi.org/10.1145/2513190.2513201","url":null,"abstract":"Traditional data warehousing technologies and On-Line Analytical Processing (OLAP) are unable to analyze textual data. Moreover, as OLAP queries of a decision-maker are generally related to a context, contextual information must be taken into account during the exploitation of data warehouses. Thus, we propose a contextual text cube model denoted CXT-Cube which considers several contextual factors during the OLAP analysis in order to better consider the contextual information associated with textual data. CXT-Cube is characterized by several contextual dimensions, each one related to a contextual factor. In addition, we extend our aggregation OLAP operator for textual data ORank (OLAP-Rank) to consider all the contextual factors defined in our CXT-Cube model. To validate our model, we perform an experimental study and the preliminary results show the importance of our approach for integrating textual data into a data warehouse and improving the decision-making.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125294573","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lazy data structure maintenance for main-memory analytics over sliding windows","authors":"Chang Ge, Lukasz Golab","doi":"10.1145/2513190.2513203","DOIUrl":"https://doi.org/10.1145/2513190.2513203","url":null,"abstract":"We address the problem of maintaining data structures used by memory-resident data warehouses that store sliding windows. We propose a framework that eagerly expires data from the sliding window to save space and/or satisfy data retention policies, but lazily maintains the associated data structures to reduce maintenance overhead. Using a dictionary as an example, we show that our framework enables maintenance algorithms that outperform existing approaches in terms of space overhead, maintenance overhead, and dictionary lookup overhead during query execution.","PeriodicalId":335396,"journal":{"name":"International Workshop on Data Warehousing and OLAP","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129961469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}