{"title":"PolarDBMS: Towards a cost-effective and policy-based data management in the cloud","authors":"Ilir Fetai, Filip-Martin Brinkmann, H. Schuldt","doi":"10.1109/ICDEW.2014.6818323","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818323","url":null,"abstract":"The proliferation of Cloud computing has attracted a large variety of applications which are completely deployed on resources of Cloud providers. As data management is an essential part of these applications, Cloud providers have to deal with many different requirements for data management, depending on the characteristics and guarantees these applications are supposed to have. The objective of a Cloud provider is to support these diverse requirements with a basic set of customizable modules and protocols that can be (dynamically) combined. With the pay-as-you-go cost model of the Cloud, literally each user action and resource usage has a price tag attached to it. Thus, for the application providers, it is essential that the needs of their applications are provided in a cost-optimized manner. In this paper, we present the work in progress PolarDBMS, a flexible and dynamically adaptable system for managing data in the Cloud. PolarDBMS derives policies from application and service objectives. Based on these policies, it will automatically deploy the most efficient and cost-optimized set of modules and protocols and monitor their compliance. If necessary, the modules and/or their customization is changed dynamically at run-time. Several modules and protocols that have already been developed are presented. Additionally, we discuss the challenges that have to be met to fully implement PolarDBMS.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"151 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123741733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards optimization of RDF analytical queries on MapReduce","authors":"P. Ravindra","doi":"10.1109/ICDEW.2014.6818351","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818351","url":null,"abstract":"The broadened use of Semantic Web technologies across domains has led to a shift in focus from simple pattern matching queries on RDF data to analytical queries with complex grouping and aggregations. An RDF analytical query involves graph pattern matching, which translates to several join operations due to the fine-grained nature of RDF data model. Complex analytical queries involve multiple grouping-aggregations on different graph patterns, making such tasks join-intensive. Scale-out processing of RDF analytical queries on existing relational-style MapReduce platforms such as Apache Hive and Pig, results in lengthy execution workflows with multiple cycles of I/O and network transfer. Additionally, certain graph patterns result in avoidable redundancy in intermediate results, which negatively impacts processing costs. The PhD thesis summarized in this paper proposes a two-pronged approach to minimize the costs while processing RDF queries on MapReduce: an algebraic approach based on a Nested TripleGroup Data Model and Algebra that reinterprets graph pattern queries in a way that reduces the required number of map-reduce cycles, and special strategies to minimize the redundancy in intermediate data while processing certain graph patterns. The proposed techniques are integrated into Apache Pig. Empirical evaluation of this work for processing graph pattern queries show 45-60% performance gains over systems such as Pig and Hive.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115167492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Renato Beserra Sousa, D. C. Cugler, Joana G. Malaverri, C. B. Medeiros
{"title":"A provenance-based approach to manage long term preservation of scientific data","authors":"Renato Beserra Sousa, D. C. Cugler, Joana G. Malaverri, C. B. Medeiros","doi":"10.1109/ICDEW.2014.6818316","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818316","url":null,"abstract":"Long term preservation of scientific data goes beyond the data, and extends to metadata preservation and curation. While several researchers emphasize curation processes, our work is geared towards assessing the quality of scientific (meta)data. The rationale behind this strategy is that scientific data are often accessible via metadata - and thus ensuring metadata quality is a means to provide long term accessibility. This paper discusses our quality assessment architecture, presenting a case study on animal sound recording metadata. Our case study is an example of the importance of periodically assessing (meta)data quality, since knowledge about the world may evolve, and quality decrease with time, hampering long term preservation.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128192066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Data and Software Preservation for Open Science (DASPOS)","authors":"M. Hildreth","doi":"10.1109/ICDEW.2014.6818318","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818318","url":null,"abstract":"Data and Software Preservation for Open Science (DASPOS), represents a first attempt to establish a formal collaboration tying together physicists from the CMS and ATLAS experiments at the LHC and the Tevatron experiments with experts in digital curation, heterogeneous high-throughput storage systems, large-scale computing systems, and grid access and infrastructure. Recently funded by the National Science Foundation, the project is organizing multiple workshops aimed at understanding use cases for data, software, and knowledge preservation in High Energy Physics and other scientific disciplines, including BioInformatics and Astrophysics. The goal of this project is the technical development and specification of an architecture for curating HEP data and software to the point where the repetition of a physics analysis using only the archived data, software, and analysis description is possible. The novelty of this effort is this holistic approach, where not only data but also software and frameworks necessary to use the data are part of the preservation effort, making it true “physics preservation” rather than merely data preservation.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134102234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Overlap versus partition: Marketing classification and customer profiling in complex networks of products","authors":"Diego Pennacchioli, M. Coscia, D. Pedreschi","doi":"10.1109/ICDEW.2014.6818312","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818312","url":null,"abstract":"In recent years we witnessed the explosion in the availability of data regarding human and customer behavior in the market. This data richness era has fostered the development of useful applications in understanding how markets and the minds of the customers work. In this paper we focus on the analysis of complex networks based on customer behavior. Complex network analysis has provided a new and wide toolbox for the classic data mining task of clustering. With community discovery, i.e. the detection of functional modules in complex networks, we are now able to group together customers and products using a variety of different criteria. The aim of this paper is to explore this new analytic degree of freedom. We are interested in providing a case study uncovering the meaning of different community discovery algorithms on a network of products connected together because co-purchased by the same customers. We focus our interest in the different interpretation of a partition approach, where each product belongs to a single community, against an overlapping approach, where each product can belong to multiple communities. We found that the former is useful to improve the marketing classification of products, while the latter is able to create a collection of different customer profiles.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134416106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing comparison shopping behavior: A case study","authors":"Mona Gupta, Happy Mittal, Parag Singla, A. Bagchi","doi":"10.1109/ICDEW.2014.6818314","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818314","url":null,"abstract":"In this work we study the behavior of users on online comparison shopping using session traces collected over one year from an Indian mobile phone comparison website: http://smartprix.com. There are two aspects to our study: data analysis and behavior prediction. The first aspect of our study, data analysis, is geared towards providing insights into user behavior that could enable vendors to offer the right kinds of products and prices, and that could help the comparison shopping engine to customize the search based on user preferences. We discover the correlation between the search queries which users write before coming on the site and their future behavior on the same. We have also studied the distribution of users based on geographic location, time of the day, day of the week, number of sessions which have a click to buy (convert), repeat users, phones/brands visited and compared. We analyze the impact of price change on the popularity of a product and how special events such as launch of a new model affect the popularity of a brand. Our analysis corroborates intuitions such as increasing price leads to decrease in popularity and vice-versa. Further, we characterize the time lag in the effect of such phenomena on popularity. We characterize the user behavior on the website in terms of sequence of transitions between multiple states (defined in terms of the kind of page being visited e.g. home, visit, compare etc.). We use KL divergence to show that a time-homogeneous Markov chain is the right model for session traces when the number of clicks varies from 5 to 30. Finally, we build a model using Markov logic that uses the history of the user's activity in a session to predict whether a user is going to click to convert in that session. Our methodology of combining data analysis with machine learning is, in our opinion, a new approach to the empirical study of such data sets.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130977683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sumant Kulkarni, S. Srinivasa, Jyotiska Nath Khasnabish, K. Nagal, Sandeep G. Kurdagi
{"title":"SortingHat: A framework for deep matching between classes of entities","authors":"Sumant Kulkarni, S. Srinivasa, Jyotiska Nath Khasnabish, K. Nagal, Sandeep G. Kurdagi","doi":"10.1109/ICDEW.2014.6818309","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818309","url":null,"abstract":"This paper addresses the problem of “deep matching” - or matching different classes of entities based on latent underlying semantics, rather than just their visible attributes. An example of this is the “automatic task assignment” problem where several tasks have to be assigned to people with varied skill-sets and experiences. Datasets showing types of entities (tasks and people) along with their involvement of other concepts, are used as the basis for deep matching. This paper describes a work in progress, of a deep matching application called SortingHat. We analyze issue tracking data of a large corporation containing task descriptions and assignments to people that were computed manually. We identify several entities and concepts from the dataset and build a co-occurrence graph as the basic data structure for computing deep matches. We then propose a set of query primitives that can establish several forms of semantic matching across different classes of entities.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125190403","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SLA-driven workload management for cloud databases","authors":"D. Stamatakis, Olga Papaemmanouil","doi":"10.1109/ICDEW.2014.6818324","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818324","url":null,"abstract":"Despite the fast growth and increased adoption of cloud databases, challenges related to Service-Level-Agreements (SLAs) specification and management still exist. Supporting application-specific performance goals and SLAs, assigning incoming query processing workloads to the reserved resources to avoid SLA violations and monitoring performance factors to ensure acceptable QoS levels, are some of the critical tasks that have not yet been addressed by the database community. In this position paper, we argue that SLA management for cloud databases should itself be offered to developers as a cloud-based automated service. Towards this goal, we discuss the design of a framework that a) enables the specification of custom applicaton-level performance SLAs and b) offers workload management mechanisms that can automatically customize their functionality towards meeting these application-specific SLAs.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115133807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"LODHub — A platform for sharing and integrated processing of linked open data","authors":"Stefan Hagedorn, K. Sattler","doi":"10.1109/ICDEW.2014.6818336","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818336","url":null,"abstract":"In this paper we discuss the need for a new platform that combines existing solutions for publishing and sharing linked open data with the infrastructure of services for exploring, processing, and analyzing data across multiple data sets. We identify various requirements for such a platform, describe the architecture, and sketch initial results of our prototype.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"253 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123098514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cinderella — Adaptive online partitioning of irregularly structured data","authors":"K. Herrmann, H. Voigt, Wolfgang Lehner","doi":"10.1109/ICDEW.2014.6818342","DOIUrl":"https://doi.org/10.1109/ICDEW.2014.6818342","url":null,"abstract":"In an increasing number of use cases, databases face the challenge of managing irregularly structured data. Irregularly structured data is characterized by a quickly evolving variety of entities without a common set of attributes. These entities do not show enough regularity to be captured in a traditional database schema. A common solution is to centralize the diverse entities in a universal table. Usually, this leads to a very sparse table. Although today's techniques allow efficient storage of sparse universal tables, query efficiency is still a problem. Queries that reference only a subset of attributes have to read the whole universal table including many irrelevant entities. One possible solution is to use a partitioning of the table, which allows pruning partitions of irrelevant entities before they are touched. Creating and maintaining such a partitioning manually is very laborious or even infeasible, due to the enormous complexity. Thus an autonomous solution is desirable. In this paper, we define the Online Partitioning Problem for irregularly structured data and present Cinderella. Cinderella is an autonomous online algorithm for horizontal partitioning of irregularly structured entities in universal tables. It is designed to keep its overhead low by incrementally assigning entities to partitions while they are touched anyway during modifications. The achieved partitioning allows queries that retrieve only entities with a subset of attributes easily pruning partitions of irrelevant entities. Cinderella increases the locality of queries and reduces query execution cost.","PeriodicalId":302600,"journal":{"name":"2014 IEEE 30th International Conference on Data Engineering Workshops","volume":"287 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116565053","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}