{"title":"Query optimization using column statistics in hive","authors":"Anja Gruenheid, E. Omiecinski, L. Mark","doi":"10.1145/2076623.2076636","DOIUrl":"https://doi.org/10.1145/2076623.2076636","url":null,"abstract":"Hive is a data warehousing solution on top of the Hadoop MapReduce framework that has been designed to handle large amounts of data and store them in tables like a relational database management system or a conventional data warehouse while using the parallelization and batch processing functionalities of the Hadoop MapReduce framework to speed up the execution of queries. Data inserted into Hive is stored in the Hadoop FileSystem (HDFS), which is part of the Hadoop MapReduce framework. To make the data accessible to the user, Hive uses a query language similar to SQL, which is called HiveQL. When a query is issued in HiveQL, it is translated by a parser into a query execution plan that is optimized and then turned into a series of map and reduce iterations. These iterations are then executed on the data stored in the HDFS, writing the output to a file.\u0000 The goal of this work is to to develop an approach for improving the performance of the HiveQL queries executed in the Hive framework. For that purpose, we introduce an extension to the Hive MetaStore which stores metadata that has been extracted on the column level of the user database. These column level statistics are then used for example in combination with join ordering algorithms which are adapted to the specific needs of the Hadoop MapReduce environment to improve the overall performance of the HiveQL query execution.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"31 1","pages":"97-105"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89770805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient incremental breadth-depth XML event mining","authors":"Rashed K. Salem, J. Darmont, Omar Boussaïd","doi":"10.1145/2076623.2076649","DOIUrl":"https://doi.org/10.1145/2076623.2076649","url":null,"abstract":"Many applications log a large amount of events continuously. Extracting interesting knowledge from logged events is an emerging active research area in data mining. In this context, we propose an approach for mining frequent events and association rules from logged events in XML format. This approach is composed of two-main phases: I) constructing a novel tree structure called Frequency XML-based Tree (FXT), which contains the frequency of events to be mined; II) querying the constructed FXT using XQuery to discover frequent itemsets and association rules. The FXT is constructed with a single-pass over logged data. We implement the proposed algorithm and study various performance issues. The performance study shows that the algorithm is efficient, for both constructing the FXT and discovering association rules.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"10 1","pages":"197-203"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90361801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Databases on the web: national web domain survey","authors":"Denis Shestakov","doi":"10.1145/2076623.2076646","DOIUrl":"https://doi.org/10.1145/2076623.2076646","url":null,"abstract":"The deep Web, the part of the Web consisting of web pages filled with information from myriads of online databases, is to date relatively unexplored. Even its basic characteristics such as, for instance, the number of searchable databases on the Web are disputable. In this paper, we address the problem of accurate estimation of the deep Web by sampling one national web domain. We report some of our results obtained when surveying the Russian Web. The survey findings, namely the size estimates of the deep Web, could be useful for further studies to handle data in the deep Web.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"6 1","pages":"179-184"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74210991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Addressing resource usage in stream processing systems: sizing window effect","authors":"Sabina Surdu, Vasile-Marian Scuturici","doi":"10.1145/2076623.2076658","DOIUrl":"https://doi.org/10.1145/2076623.2076658","url":null,"abstract":"Stream processing systems compute continuous queries over increasingly large volumes of data, as monitoring applications emerge in a broad array of fields. These systems need to satisfy application-dependent constraints, one of the most important ones being accuracy demands and query response times. As system resources are limited, various query optimization techniques are proposed. To the best of our knowledge, none of the existing methods takes into account the size of the window, which is input to a query. We believe resource usage can be tackled with a novel approach, that attempts to compute an optimal window size for a given continuous query, thereby placing a minimal upper bound on the resource consumption for that query.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"374 1","pages":"247-248"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75512881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Union rewritings for XPath fragments","authors":"F. Afrati, M. Damigos, M. Gergatsoulis","doi":"10.1145/2076623.2076630","DOIUrl":"https://doi.org/10.1145/2076623.2076630","url":null,"abstract":"In this paper, we study the problem of finding an equivalent rewriting of an XPath query using multiple views, and we show that the union operator may be required in order to find such a rewriting. In particular, focusing on the fragment of XPath containing both descendant edges and wildcard labels, we propose an algorithm that outputs a union of single-view rewritings (if there exists any) which equivalently rewrites a given query. For the same fragment of XPath, we give necessary and sufficient conditions for query containment and equivalence of unions of queries.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"5 1","pages":"43-51"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75860951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Query answering on trajectory cuboids using prime numbers encodings","authors":"E. Masciari","doi":"10.1145/2076623.2076652","DOIUrl":"https://doi.org/10.1145/2076623.2076652","url":null,"abstract":"Trajectory data streams are huge amounts of data pertaining to time and position of moving objects generated by different sources continuously using a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amounts of data is challenging, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams poses interesting challenges both for their proper definition and acquisition, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams On Line Analytical Processing, that revealed really challenging as we deal with data (trajectories) for which the order of elements is relevant. We propose an end to end framework in order to make the querying step quite effective. We performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed techniques.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"1 1","pages":"214-218"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89735483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An efficient local region and clustering-based ensemble system for intrusion detection","authors":"H. Huu, Nouria Harbi, J. Darmont","doi":"10.1145/2076623.2076647","DOIUrl":"https://doi.org/10.1145/2076623.2076647","url":null,"abstract":"The dramatic proliferation of sophisticated cyber attacks, in conjunction with the ever growing use of Internet-based services and applications, is nowadays becoming a great concern in any organization. Among many efficient security solutions proposed in the literature to deal with this evolving threat, ensemble approaches, a particular family of data mining, have proven very successful in designing high performance intrusion detection systems (IDSs) resting on the mutual combination of multiple classifiers. However, the strength of ensemble systems depends heavily on the methods to generate and combine individual classifiers. In this thread, we propose a novel design method to generate a robust ensemble-based IDS. In our approach, individual classifiers are built using both the input feature space and additional features exploited from k-means clustering. In addition, the ensemble combination is calculated based on the classification ability of classifiers on different local data regions defined in form of k-means clustering. Experimental results prove that our solution is superior to several well-known methods.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"17 1","pages":"185-191"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78254538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semantics-enabled web APIs selection patterns","authors":"D. Bianchini, V. D. Antonellis, M. Melchiori","doi":"10.1145/2076623.2076650","DOIUrl":"https://doi.org/10.1145/2076623.2076650","url":null,"abstract":"The design of Web applications from third-party Web APIs can be shortened by providing effective tools that abstract from heterogeneity of Web API descriptions and support the designer for their proactive selection. In this paper, we identify Web API selection patterns to support interactive and proactive Web application development according to an exploratory perspective. Selection patterns rely on a semantic characterization of Web API descriptions that abstracts from implementation details and semantics-enabled metrics for evaluation of coupling and similarity degree. A prototype tool that implements selection patterns is also presented.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"230 1","pages":"204-208"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77507609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Bevacqua, M. Carnuccio, R. Ortale, E. Ritacco
{"title":"A new architectural paradigm for content-based web applications: Borè","authors":"Antonio Bevacqua, M. Carnuccio, R. Ortale, E. Ritacco","doi":"10.1145/2076623.2076648","DOIUrl":"https://doi.org/10.1145/2076623.2076648","url":null,"abstract":"The Web is an evolving system, which tries to adapt to the needs of users. The transition to Web2.0, and, currently, to Web3.0, are the expression of this trend: the goal is to focus on the leading role of the end user in Web browsing, which should be supported by adequate tools. In this paper, we propose Borè, an architectural paradigm for developing content-based web applications based on cooperative interaction, whose foundations are based on the principles of the model Web3.0. The proposed architecture is extremely innovative in three respects. The first one is the possibility of defining, organizing, storing, querying and displaying the information as customizable objects and relations: a notexpert user can create the Web that he/she may prefer. A second aspect is the realization of social networks (Social Cooperations), which spontaneously arise, through user resource sharing. Finally, there is the possibility of analyzing users' browsing activities, through learning tools that enable the user to enrich his/her Web browsing experience with new knowledge.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"104 1","pages":"192-196"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79194482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable queries for large datasets using cloud computing: a case study","authors":"James P. McGlothlin, L. Khan","doi":"10.1145/2076623.2076626","DOIUrl":"https://doi.org/10.1145/2076623.2076626","url":null,"abstract":"Cloud computing is rapidly growing in popularity as a solution for processing and retrieving huge amounts of data over clusters of inexpensive commodity hardware. The most common data model utilized by cloud computing software is the NoSQL data model. While this data model is extremely scalable, it is much more efficient for simple retrievals and scans than for the complex analytical queries typical in a relational database model. In this paper, we evaluate emerging cloud computing technologies using a representative use case. Our use case involves analyzing telecommunications logs for performance monitoring and quality assurance. Clearly, the size of such logs is growing exponentially as more devices communicate more frequently and the amount of data being transferred steadily increases. We analyze potential solutions to provide a scalable database which supports both retrieval and analysis. We will investigate and analyze all the major open source cloud computing solutions and designs. We then choose the most applicable subset of these technologies for experimentation. We provide a performance evaluation of these products, and we analyze our results and make recommendations. This paper provides a comprehensive survey of technologies for scalable data processing and an in-depth performance evaluation of these technologies.","PeriodicalId":93615,"journal":{"name":"Proceedings. International Database Engineering and Applications Symposium","volume":"57 6 1","pages":"8-16"},"PeriodicalIF":0.0,"publicationDate":"2011-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77750913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}